2024-06-19 13:37:48,379 INFO [train.py:1096] (0/2) Training started 2024-06-19 13:37:48,386 INFO [train.py:1106] (0/2) Device: cuda:0 2024-06-19 13:37:48,437 INFO [train.py:1118] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '8f976a1e1407e330e2a233d68f81b1eb5269fdaa', 'k2-git-date': 'Thu Jun 6 02:13:08 2024', 'lhotse-version': '1.24.0.dev+git.4d57d53d.dirty', 'torch-version': '2.3.1+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.9', 'icefall-git-branch': 'feature/ksponspeech_zipformer', 'icefall-git-sha1': '7dda45c9-dirty', 'icefall-git-date': 'Tue Jun 18 16:40:30 2024', 'icefall-path': '/home/ubuntu/icefall', 'k2-path': '/home/ubuntu/miniforge3/envs/lhotse/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/home/ubuntu/lhotse/lhotse/__init__.py', 'hostname': 'gpu-1', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 23456, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_5000/bpe.model', 'base_lr': 0.035, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 550, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 24, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 5000} 2024-06-19 13:37:48,438 INFO [train.py:1120] (0/2) About to create model 2024-06-19 13:37:49,355 INFO [train.py:1124] (0/2) Number of model parameters: 74778511 2024-06-19 13:37:50,216 INFO [train.py:1139] (0/2) Using DDP 2024-06-19 13:37:51,966 INFO [asr_datamodule.py:391] (0/2) About to get train cuts. 2024-06-19 13:37:52,022 INFO [asr_datamodule.py:215] (0/2) Enable MUSAN 2024-06-19 13:37:52,022 INFO [asr_datamodule.py:216] (0/2) About to get Musan cuts 2024-06-19 13:37:54,119 INFO [asr_datamodule.py:240] (0/2) Enable SpecAugment 2024-06-19 13:37:54,119 INFO [asr_datamodule.py:241] (0/2) Time warp factor: 80 2024-06-19 13:37:54,119 INFO [asr_datamodule.py:251] (0/2) Num frame mask: 10 2024-06-19 13:37:54,119 INFO [asr_datamodule.py:264] (0/2) About to create train dataset 2024-06-19 13:37:54,120 INFO [asr_datamodule.py:291] (0/2) Using DynamicBucketingSampler. 2024-06-19 13:37:54,677 INFO [asr_datamodule.py:308] (0/2) About to create train dataloader 2024-06-19 13:37:54,678 INFO [asr_datamodule.py:398] (0/2) About to get dev cuts 2024-06-19 13:37:54,680 INFO [asr_datamodule.py:339] (0/2) About to create dev dataset 2024-06-19 13:37:54,839 INFO [asr_datamodule.py:356] (0/2) About to create dev dataloader 2024-06-19 13:37:54,839 INFO [train.py:1330] (0/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2024-06-19 13:46:44,023 INFO [scaling.py:1023] (0/2) Whitening: name=None, num_groups=1, num_channels=192, metric=39.11 vs. limit=7.5 2024-06-19 13:46:44,112 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15013MB 2024-06-19 13:46:46,240 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:46:51,239 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:46:54,098 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:47:12,143 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:47:15,469 INFO [train.py:1358] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:49:24,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=7.5 2024-06-19 13:49:25,116 INFO [train.py:1028] (0/2) Epoch 1, batch 0, loss[loss=9.826, simple_loss=8.918, pruned_loss=9.062, over 12956.00 frames. ], tot_loss[loss=9.826, simple_loss=8.918, pruned_loss=9.062, over 12956.00 frames. ], batch size: 36, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 13:49:25,116 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 13:49:40,386 INFO [train.py:1060] (0/2) Epoch 1, validation: loss=9.767, simple_loss=8.861, pruned_loss=9.037, over 351949.00 frames. 2024-06-19 13:49:40,386 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 15219MB 2024-06-19 13:49:41,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.75 vs. limit=7.5 2024-06-19 13:49:42,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=0.0, ans=0.25 2024-06-19 13:49:42,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=0.0, ans=0.5 2024-06-19 13:49:43,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.75 vs. limit=7.5 2024-06-19 13:49:46,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=0.0, ans=0.9 2024-06-19 13:49:46,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=7.5 2024-06-19 13:49:47,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=18.333333333333332, ans=0.1993125 2024-06-19 13:49:49,864 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.798e+03 2.880e+03 3.055e+03 3.457e+03 3.724e+03, threshold=1.222e+04, percent-clipped=0.0 2024-06-19 13:49:50,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=18.333333333333332, ans=0.2998166666666667 2024-06-19 13:49:54,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=52.80 vs. limit=5.004583333333334 2024-06-19 13:49:55,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=365.14 vs. limit=7.5275 2024-06-19 13:49:57,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.515e+02 2.290e+03 2.974e+03 3.166e+03 3.853e+03, threshold=1.190e+04, percent-clipped=0.0 2024-06-19 13:50:03,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=24.86 vs. limit=4.022 2024-06-19 13:50:04,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=372.37 vs. limit=7.520625 2024-06-19 13:50:06,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=346.28 vs. limit=7.520625 2024-06-19 13:50:07,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=378.43 vs. limit=5.0275 2024-06-19 13:50:16,789 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.515e+02 1.101e+03 3.024e+03 7.840e+03, threshold=4.406e+03, percent-clipped=0.0 2024-06-19 13:50:17,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=73.33333333333333, ans=0.2011 2024-06-19 13:50:18,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=371.90 vs. limit=7.555 2024-06-19 13:50:23,408 INFO [train.py:1028] (0/2) Epoch 1, batch 50, loss[loss=1.72, simple_loss=1.594, pruned_loss=1.186, over 12798.00 frames. ], tot_loss[loss=4.256, simple_loss=3.966, pruned_loss=2.905, over 575124.10 frames. ], batch size: 29, lr: 1.93e-02, grad_scale: 0.5 2024-06-19 13:50:26,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=91.66666666666667, ans=3.01375 2024-06-19 13:50:28,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=91.66666666666667, ans=0.09793750000000001 2024-06-19 13:50:29,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=204.65 vs. limit=7.534375 2024-06-19 13:50:30,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91.66666666666667, ans=0.48854166666666665 2024-06-19 13:50:33,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=339.13 vs. limit=7.54125 2024-06-19 13:50:36,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=353.35 vs. limit=7.54125 2024-06-19 13:50:38,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=110.38 vs. limit=4.044 2024-06-19 13:50:38,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=235.00 vs. limit=7.54125 2024-06-19 13:50:40,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=36.65 vs. limit=4.051333333333333 2024-06-19 13:50:40,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=128.33333333333334, ans=0.493984375 2024-06-19 13:50:41,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=128.33333333333334, ans=0.1951875 2024-06-19 13:50:42,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=177.70 vs. limit=5.0320833333333335 2024-06-19 13:50:46,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=242.38 vs. limit=5.064166666666667 2024-06-19 13:50:47,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=128.33333333333334, ans=7.548125 2024-06-19 13:50:48,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=374.18 vs. limit=7.555 2024-06-19 13:50:49,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=221.94 vs. limit=7.555 2024-06-19 13:50:55,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=146.66666666666666, ans=0.09908333333333334 2024-06-19 13:51:01,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=388.11 vs. limit=7.561875 2024-06-19 13:51:02,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=331.90 vs. limit=7.62375 2024-06-19 13:51:04,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=355.92 vs. limit=7.561875 2024-06-19 13:51:05,858 INFO [train.py:1028] (0/2) Epoch 1, batch 100, loss[loss=1.02, simple_loss=0.898, pruned_loss=0.9953, over 13338.00 frames. ], tot_loss[loss=2.454, simple_loss=2.264, pruned_loss=1.81, over 1018015.75 frames. ], batch size: 46, lr: 2.10e-02, grad_scale: 1.0 2024-06-19 13:51:06,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=279.69 vs. limit=7.56875 2024-06-19 13:51:07,461 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.229e+01 6.855e+01 4.562e+02 9.632e+02 7.840e+03, threshold=9.124e+02, percent-clipped=0.0 2024-06-19 13:51:08,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=30.68 vs. limit=7.6375 2024-06-19 13:51:12,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=183.33333333333334, ans=0.19312500000000002 2024-06-19 13:51:13,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=360.39 vs. limit=7.575625 2024-06-19 13:51:15,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=298.59 vs. limit=7.575625 2024-06-19 13:51:20,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=12.45 vs. limit=4.080666666666667 2024-06-19 13:51:28,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=93.05 vs. limit=7.5825 2024-06-19 13:51:36,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=20.40 vs. limit=4.0953333333333335 2024-06-19 13:51:37,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=286.51 vs. limit=7.589375 2024-06-19 13:51:37,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=183.75 vs. limit=7.589375 2024-06-19 13:51:40,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=14.42 vs. limit=4.0953333333333335 2024-06-19 13:51:40,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=238.33333333333334, ans=0.488828125 2024-06-19 13:51:41,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=337.67 vs. limit=7.589375 2024-06-19 13:51:42,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=256.6666666666667, ans=0.09839583333333334 2024-06-19 13:51:47,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=32.02 vs. limit=5.064166666666667 2024-06-19 13:51:50,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=64.65 vs. limit=5.128333333333333 2024-06-19 13:51:51,322 INFO [train.py:1028] (0/2) Epoch 1, batch 150, loss[loss=0.9293, simple_loss=0.8046, pruned_loss=0.9233, over 12600.00 frames. ], tot_loss[loss=1.825, simple_loss=1.664, pruned_loss=1.441, over 1365893.74 frames. ], batch size: 29, lr: 2.28e-02, grad_scale: 1.0 2024-06-19 13:51:51,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=75.26 vs. limit=7.603125 2024-06-19 13:51:52,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=98.93 vs. limit=7.603125 2024-06-19 13:51:53,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=4.11 2024-06-19 13:51:54,343 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=37.37 vs. limit=7.603125 2024-06-19 13:51:56,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=4.11 2024-06-19 13:51:59,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=83.00 vs. limit=7.61 2024-06-19 13:52:01,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.60 vs. limit=5.073333333333333 2024-06-19 13:52:02,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=250.88 vs. limit=7.61 2024-06-19 13:52:11,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=311.6666666666667, ans=0.20467500000000002 2024-06-19 13:52:14,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=7.73375 2024-06-19 13:52:19,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=20.91 vs. limit=4.132 2024-06-19 13:52:21,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=330.0, ans=0.18762500000000001 2024-06-19 13:52:23,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=66.05 vs. limit=7.7475 2024-06-19 13:52:24,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348.3333333333333, ans=0.29651666666666665 2024-06-19 13:52:25,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=348.3333333333333, ans=0.483671875 2024-06-19 13:52:30,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=7.76125 2024-06-19 13:52:33,318 INFO [train.py:1028] (0/2) Epoch 1, batch 200, loss[loss=0.9078, simple_loss=0.7826, pruned_loss=0.8637, over 12508.00 frames. ], tot_loss[loss=1.505, simple_loss=1.357, pruned_loss=1.25, over 1635198.28 frames. ], batch size: 202, lr: 2.45e-02, grad_scale: 2.0 2024-06-19 13:52:35,224 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.236e+01 3.699e+01 4.197e+01 4.868e+01 7.672e+01, threshold=8.394e+01, percent-clipped=0.0 2024-06-19 13:52:39,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=396.69 vs. limit=7.6375 2024-06-19 13:52:43,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=366.6666666666667, ans=0.8871666666666667 2024-06-19 13:52:44,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=24.73 vs. limit=7.775 2024-06-19 13:52:44,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=82.56 vs. limit=7.644375 2024-06-19 13:52:47,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=385.0, ans=7.78875 2024-06-19 13:52:49,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=385.0, ans=0.048796875 2024-06-19 13:52:51,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=71.37 vs. limit=7.78875 2024-06-19 13:53:05,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=169.24 vs. limit=7.658125 2024-06-19 13:53:08,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=136.94 vs. limit=5.210833333333333 2024-06-19 13:53:13,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.67 vs. limit=4.176 2024-06-19 13:53:13,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=338.62 vs. limit=7.665 2024-06-19 13:53:17,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=7.83 2024-06-19 13:53:18,956 INFO [train.py:1028] (0/2) Epoch 1, batch 250, loss[loss=0.7884, simple_loss=0.6711, pruned_loss=0.7484, over 13032.00 frames. ], tot_loss[loss=1.311, simple_loss=1.168, pruned_loss=1.131, over 1846440.11 frames. ], batch size: 144, lr: 2.63e-02, grad_scale: 2.0 2024-06-19 13:53:24,732 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=373.30 vs. limit=7.671875 2024-06-19 13:53:25,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=458.3333333333333, ans=0.18281250000000002 2024-06-19 13:53:28,952 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=50.99 vs. limit=7.67875 2024-06-19 13:53:30,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=24.09 vs. limit=5.238333333333333 2024-06-19 13:53:33,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=476.6666666666667, ans=0.09702083333333333 2024-06-19 13:53:38,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=495.0, ans=0.882675 2024-06-19 13:53:39,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=17.86 vs. limit=7.685625 2024-06-19 13:53:45,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=14.66 vs. limit=5.128333333333333 2024-06-19 13:53:47,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=24.77 vs. limit=7.6925 2024-06-19 13:53:49,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=45.67 vs. limit=7.6925 2024-06-19 13:53:52,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=531.6666666666666, ans=0.7553166666666666 2024-06-19 13:53:56,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=531.6666666666666, ans=0.0880375 2024-06-19 13:54:03,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=531.6666666666666, ans=0.475078125 2024-06-19 13:54:05,543 INFO [train.py:1028] (0/2) Epoch 1, batch 300, loss[loss=0.7611, simple_loss=0.6384, pruned_loss=0.7252, over 13179.00 frames. ], tot_loss[loss=1.177, simple_loss=1.036, pruned_loss=1.041, over 2009471.39 frames. ], batch size: 112, lr: 2.80e-02, grad_scale: 4.0 2024-06-19 13:54:05,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.59 vs. limit=4.22 2024-06-19 13:54:06,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=33.76 vs. limit=7.70625 2024-06-19 13:54:07,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=7.70625 2024-06-19 13:54:07,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.314e+01 4.826e+01 6.437e+01 1.085e+02 3.879e+02, threshold=1.287e+02, percent-clipped=36.0 2024-06-19 13:54:10,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=225.63 vs. limit=7.70625 2024-06-19 13:54:13,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=87.05 vs. limit=7.70625 2024-06-19 13:54:16,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=93.41 vs. limit=7.92625 2024-06-19 13:54:21,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.09 vs. limit=7.713125 2024-06-19 13:54:22,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=586.6666666666666, ans=0.17800000000000002 2024-06-19 13:54:23,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=21.34 vs. limit=7.72 2024-06-19 13:54:23,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.84 vs. limit=7.94 2024-06-19 13:54:24,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.29 vs. limit=7.94 2024-06-19 13:54:25,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=81.59 vs. limit=7.94 2024-06-19 13:54:26,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=99.15 vs. limit=7.72 2024-06-19 13:54:28,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=361.30 vs. limit=7.72 2024-06-19 13:54:42,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.25 vs. limit=7.9675 2024-06-19 13:54:43,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=31.05 vs. limit=7.9675 2024-06-19 13:54:44,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=51.66 vs. limit=7.73375 2024-06-19 13:54:46,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=623.3333333333334, ans=0.47078125 2024-06-19 13:54:47,532 INFO [train.py:1028] (0/2) Epoch 1, batch 350, loss[loss=0.853, simple_loss=0.7023, pruned_loss=0.8233, over 12949.00 frames. ], tot_loss[loss=1.082, simple_loss=0.9414, pruned_loss=0.9728, over 2138601.29 frames. ], batch size: 33, lr: 2.98e-02, grad_scale: 4.0 2024-06-19 13:54:49,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=641.6666666666666, ans=0.469921875 2024-06-19 13:55:01,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.43 vs. limit=7.995 2024-06-19 13:55:05,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.64 vs. limit=4.271333333333334 2024-06-19 13:55:14,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=678.3333333333334, ans=0.468203125 2024-06-19 13:55:18,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=696.6666666666666, ans=0.41291666666666665 2024-06-19 13:55:20,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696.6666666666666, ans=0.2930333333333333 2024-06-19 13:55:24,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=8.03625 2024-06-19 13:55:24,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=37.35 vs. limit=8.03625 2024-06-19 13:55:28,753 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.580e+00 2024-06-19 13:55:29,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.01 vs. limit=8.03625 2024-06-19 13:55:32,089 INFO [train.py:1028] (0/2) Epoch 1, batch 400, loss[loss=0.8059, simple_loss=0.6615, pruned_loss=0.7513, over 13182.00 frames. ], tot_loss[loss=1.015, simple_loss=0.8738, pruned_loss=0.9215, over 2239306.46 frames. ], batch size: 63, lr: 3.15e-02, grad_scale: 8.0 2024-06-19 13:55:33,582 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.392e+01 7.864e+01 9.618e+01 1.333e+02 3.416e+02, threshold=1.924e+02, percent-clipped=28.0 2024-06-19 13:55:36,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=733.3333333333334, ans=0.465625 2024-06-19 13:55:41,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=751.6666666666666, ans=0.2924833333333333 2024-06-19 13:55:46,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=751.6666666666666, ans=0.1718125 2024-06-19 13:55:46,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=751.6666666666666, ans=0.464765625 2024-06-19 13:56:05,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=35.62 vs. limit=5.403333333333333 2024-06-19 13:56:07,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=806.6666666666666, ans=0.4621875 2024-06-19 13:56:07,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.18 vs. limit=5.201666666666666 2024-06-19 13:56:12,876 INFO [train.py:1028] (0/2) Epoch 1, batch 450, loss[loss=0.8029, simple_loss=0.6544, pruned_loss=0.7339, over 13223.00 frames. ], tot_loss[loss=0.9668, simple_loss=0.8235, pruned_loss=0.8803, over 2314031.45 frames. ], batch size: 67, lr: 3.33e-02, grad_scale: 8.0 2024-06-19 13:56:13,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=825.0, ans=0.871125 2024-06-19 13:56:17,540 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=70.39 vs. limit=7.809375 2024-06-19 13:56:18,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=158.54 vs. limit=7.809375 2024-06-19 13:56:27,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=843.3333333333334, ans=0.21265 2024-06-19 13:56:28,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.58 vs. limit=7.81625 2024-06-19 13:56:30,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=843.3333333333334, ans=0.39458333333333334 2024-06-19 13:56:32,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=8.14625 2024-06-19 13:56:38,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=172.79 vs. limit=7.823125 2024-06-19 13:56:39,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=37.02 vs. limit=7.823125 2024-06-19 13:56:46,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=8.16 2024-06-19 13:56:50,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=898.3333333333334, ans=0.24101666666666666 2024-06-19 13:56:51,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=8.46 vs. limit=8.17375 2024-06-19 13:56:52,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=8.17375 2024-06-19 13:56:53,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=898.3333333333334, ans=0.09438541666666667 2024-06-19 13:56:54,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=898.3333333333334, ans=0.07978750000000001 2024-06-19 13:56:56,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=916.6666666666666, ans=0.45703125 2024-06-19 13:56:57,330 INFO [train.py:1028] (0/2) Epoch 1, batch 500, loss[loss=0.7544, simple_loss=0.6139, pruned_loss=0.6681, over 13091.00 frames. ], tot_loss[loss=0.9298, simple_loss=0.784, pruned_loss=0.8461, over 2375420.18 frames. ], batch size: 121, lr: 3.50e-02, grad_scale: 8.0 2024-06-19 13:56:58,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=916.6666666666666, ans=0.3854166666666667 2024-06-19 13:56:58,893 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.563e+01 6.632e+01 8.943e+01 1.139e+02 2.753e+02, threshold=1.789e+02, percent-clipped=4.0 2024-06-19 13:57:01,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.71 vs. limit=5.229166666666667 2024-06-19 13:57:05,357 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=8.20125 2024-06-19 13:57:08,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=935.0, ans=0.456171875 2024-06-19 13:57:11,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.95 vs. limit=4.374 2024-06-19 13:57:15,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=4.381333333333333 2024-06-19 13:57:17,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.16 vs. limit=7.8575 2024-06-19 13:57:22,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.20 vs. limit=7.864375 2024-06-19 13:57:24,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.79 vs. limit=8.22875 2024-06-19 13:57:25,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=971.6666666666666, ans=0.37854166666666667 2024-06-19 13:57:27,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=7.864375 2024-06-19 13:57:31,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=3.1485 2024-06-19 13:57:33,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.68 vs. limit=5.495 2024-06-19 13:57:35,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.59 vs. limit=8.2425 2024-06-19 13:57:38,760 INFO [train.py:1028] (0/2) Epoch 1, batch 550, loss[loss=0.7602, simple_loss=0.619, pruned_loss=0.6509, over 12958.00 frames. ], tot_loss[loss=0.9011, simple_loss=0.7526, pruned_loss=0.8164, over 2421010.71 frames. ], batch size: 158, lr: 3.50e-02, grad_scale: 8.0 2024-06-19 13:57:43,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.98 vs. limit=5.252083333333333 2024-06-19 13:57:45,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=27.37 vs. limit=7.878125 2024-06-19 13:57:47,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=1008.3333333333334, ans=0.19328125000000002 2024-06-19 13:57:48,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1008.3333333333334, ans=0.28991666666666666 2024-06-19 13:57:49,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=9.31 vs. limit=5.0 2024-06-19 13:57:53,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.35 vs. limit=7.885 2024-06-19 13:57:54,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=30.06 vs. limit=7.885 2024-06-19 13:58:00,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=27.78 vs. limit=7.891875 2024-06-19 13:58:00,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=173.82 vs. limit=7.891875 2024-06-19 13:58:00,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1045.0, ans=0.863425 2024-06-19 13:58:02,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1045.0, ans=0.369375 2024-06-19 13:58:04,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1045.0, ans=0.45101562500000003 2024-06-19 13:58:07,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=58.26 vs. limit=7.89875 2024-06-19 13:58:14,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=8.31125 2024-06-19 13:58:20,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1081.6666666666667, ans=0.449296875 2024-06-19 13:58:23,109 INFO [train.py:1028] (0/2) Epoch 1, batch 600, loss[loss=0.7555, simple_loss=0.6064, pruned_loss=0.6485, over 12976.00 frames. ], tot_loss[loss=0.8808, simple_loss=0.7286, pruned_loss=0.7932, over 2458170.47 frames. ], batch size: 144, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:58:24,727 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.020e+01 7.979e+01 1.031e+02 1.409e+02 2.698e+02, threshold=2.062e+02, percent-clipped=7.0 2024-06-19 13:58:28,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100.0, ans=0.4484375 2024-06-19 13:58:28,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1100.0, ans=0.4484375 2024-06-19 13:58:29,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=7.9125 2024-06-19 13:58:31,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=1118.3333333333333, ans=0.18709375 2024-06-19 13:58:34,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.48 vs. limit=5.279583333333333 2024-06-19 13:58:36,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.85 vs. limit=8.33875 2024-06-19 13:58:43,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=8.3525 2024-06-19 13:58:48,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.93 vs. limit=7.933125 2024-06-19 13:58:48,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1155.0, ans=0.28845 2024-06-19 13:58:48,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1155.0, ans=0.445859375 2024-06-19 13:59:04,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=7.94 2024-06-19 13:59:05,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=30.15 vs. limit=7.94 2024-06-19 13:59:07,944 INFO [train.py:1028] (0/2) Epoch 1, batch 650, loss[loss=0.8706, simple_loss=0.6863, pruned_loss=0.7532, over 13181.00 frames. ], tot_loss[loss=0.8678, simple_loss=0.7108, pruned_loss=0.7761, over 2489007.34 frames. ], batch size: 59, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:59:11,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=7.946875 2024-06-19 13:59:11,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=1191.6666666666667, ans=0.444140625 2024-06-19 13:59:13,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1191.6666666666667, ans=0.8582916666666667 2024-06-19 13:59:16,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=8.4075 2024-06-19 13:59:20,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=1210.0, ans=5.605 2024-06-19 13:59:22,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=81.77 vs. limit=7.95375 2024-06-19 13:59:31,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=26.96 vs. limit=7.960625 2024-06-19 13:59:35,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1246.6666666666667, ans=0.07195 2024-06-19 13:59:36,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.11 vs. limit=7.9675 2024-06-19 13:59:36,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=46.43 vs. limit=7.9675 2024-06-19 13:59:40,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1265.0, ans=0.440703125 2024-06-19 13:59:44,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1265.0, ans=0.28735 2024-06-19 13:59:48,838 INFO [train.py:1028] (0/2) Epoch 1, batch 700, loss[loss=0.8542, simple_loss=0.663, pruned_loss=0.74, over 13276.00 frames. ], tot_loss[loss=0.8559, simple_loss=0.6949, pruned_loss=0.7586, over 2511174.90 frames. ], batch size: 46, lr: 3.49e-02, grad_scale: 8.0 2024-06-19 13:59:50,393 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.630e+01 9.866e+01 1.210e+02 1.738e+02 9.321e+02, threshold=2.421e+02, percent-clipped=18.0 2024-06-19 13:59:51,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=25.97 vs. limit=7.98125 2024-06-19 13:59:52,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=8.4625 2024-06-19 13:59:53,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.39 vs. limit=8.4625 2024-06-19 13:59:53,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1283.3333333333333, ans=0.8550833333333334 2024-06-19 13:59:58,555 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=26.14 vs. limit=7.988125 2024-06-19 14:00:11,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1320.0, ans=0.5 2024-06-19 14:00:11,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.16 vs. limit=7.995 2024-06-19 14:00:12,011 WARNING [optim.py:503] (0/2) Scaling gradients by 0.0939645916223526, model_norm_threshold=242.07273864746094 2024-06-19 14:00:12,187 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.87, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=5.750e+06, grad_sumsq=1.477e+08, orig_rms_sq=3.893e-02 2024-06-19 14:00:14,071 WARNING [optim.py:503] (0/2) Scaling gradients by 0.07021020352840424, model_norm_threshold=242.07273864746094 2024-06-19 14:00:14,258 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.83, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=9.850e+06, grad_sumsq=2.530e+08, orig_rms_sq=3.893e-02 2024-06-19 14:00:15,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=39.50 vs. limit=8.001875 2024-06-19 14:00:16,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338.3333333333333, ans=0.28661666666666663 2024-06-19 14:00:18,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.41 vs. limit=3.20075 2024-06-19 14:00:24,595 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=39.84 vs. limit=8.001875 2024-06-19 14:00:26,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1356.6666666666667, ans=0.43640625 2024-06-19 14:00:31,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.48 vs. limit=8.5175 2024-06-19 14:00:34,294 INFO [train.py:1028] (0/2) Epoch 1, batch 750, loss[loss=0.8263, simple_loss=0.6387, pruned_loss=0.7022, over 13238.00 frames. ], tot_loss[loss=0.8515, simple_loss=0.6845, pruned_loss=0.7486, over 2526757.25 frames. ], batch size: 63, lr: 3.49e-02, grad_scale: 2.0 2024-06-19 14:00:36,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=1375.0, ans=0.0690625 2024-06-19 14:00:43,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=29.50 vs. limit=8.0225 2024-06-19 14:00:56,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1411.6666666666667, ans=0.433828125 2024-06-19 14:00:57,563 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=1.304e-01 2024-06-19 14:01:04,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=1430.0, ans=0.1695625 2024-06-19 14:01:06,709 WARNING [optim.py:503] (0/2) Scaling gradients by 0.0816488191485405, model_norm_threshold=242.07273864746094 2024-06-19 14:01:06,872 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.81, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=7.115e+06, grad_sumsq=1.943e+08, orig_rms_sq=3.662e-02 2024-06-19 14:01:07,669 WARNING [optim.py:503] (0/2) Scaling gradients by 0.0905035063624382, model_norm_threshold=242.07273864746094 2024-06-19 14:01:07,825 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.89, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=6.392e+06, grad_sumsq=1.746e+08, orig_rms_sq=3.662e-02 2024-06-19 14:01:08,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1448.3333333333333, ans=0.221725 2024-06-19 14:01:09,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=36.96 vs. limit=8.043125 2024-06-19 14:01:12,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1448.3333333333333, ans=0.432109375 2024-06-19 14:01:16,204 WARNING [optim.py:503] (0/2) Scaling gradients by 0.08115486800670624, model_norm_threshold=242.07273864746094 2024-06-19 14:01:16,361 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.81, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=7.232e+06, grad_sumsq=1.982e+08, orig_rms_sq=3.648e-02 2024-06-19 14:01:16,395 INFO [train.py:1028] (0/2) Epoch 1, batch 800, loss[loss=0.8739, simple_loss=0.6604, pruned_loss=0.7508, over 12822.00 frames. ], tot_loss[loss=0.847, simple_loss=0.6749, pruned_loss=0.7376, over 2538944.96 frames. ], batch size: 36, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:01:16,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=8.05 2024-06-19 14:01:19,874 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.297e+01 2.238e+02 4.222e+02 7.801e+02 3.448e+03, threshold=8.443e+02, percent-clipped=71.0 2024-06-19 14:01:20,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1466.6666666666667, ans=0.8486666666666667 2024-06-19 14:01:21,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=58.96 vs. limit=8.05 2024-06-19 14:01:23,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.89 vs. limit=8.6 2024-06-19 14:01:25,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.46 vs. limit=5.37125 2024-06-19 14:01:33,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=32.56 vs. limit=8.06375 2024-06-19 14:01:34,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1503.3333333333333, ans=0.06617500000000001 2024-06-19 14:01:42,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.95 vs. limit=8.6275 2024-06-19 14:01:46,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=38.41 vs. limit=8.070625 2024-06-19 14:01:52,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=8.0775 2024-06-19 14:01:52,960 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=27.33 vs. limit=8.0775 2024-06-19 14:01:57,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=31.53 vs. limit=8.0775 2024-06-19 14:01:58,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=32.02 vs. limit=8.0775 2024-06-19 14:02:01,451 INFO [train.py:1028] (0/2) Epoch 1, batch 850, loss[loss=0.8064, simple_loss=0.618, pruned_loss=0.6623, over 13158.00 frames. ], tot_loss[loss=0.8465, simple_loss=0.6678, pruned_loss=0.7308, over 2550760.21 frames. ], batch size: 95, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:02:04,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=8.084375 2024-06-19 14:02:20,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1595.0, ans=0.425234375 2024-06-19 14:02:21,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1595.0, ans=0.0641125 2024-06-19 14:02:23,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1595.0, ans=0.425234375 2024-06-19 14:02:27,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=131.39 vs. limit=8.105 2024-06-19 14:02:29,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=8.105 2024-06-19 14:02:30,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1613.3333333333333, ans=0.1395 2024-06-19 14:02:33,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=8.111875 2024-06-19 14:02:40,637 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=41.01 vs. limit=8.111875 2024-06-19 14:02:42,731 INFO [train.py:1028] (0/2) Epoch 1, batch 900, loss[loss=0.8632, simple_loss=0.644, pruned_loss=0.7206, over 12878.00 frames. ], tot_loss[loss=0.8449, simple_loss=0.6609, pruned_loss=0.7221, over 2555964.12 frames. ], batch size: 36, lr: 3.49e-02, grad_scale: 4.0 2024-06-19 14:02:46,743 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 4.980e+02 8.348e+02 1.361e+03 5.092e+03, threshold=1.670e+03, percent-clipped=48.0 2024-06-19 14:02:48,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1650.0, ans=0.84225 2024-06-19 14:02:52,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=1668.3333333333333, ans=0.28331666666666666 2024-06-19 14:02:58,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=8.125625 2024-06-19 14:03:02,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1686.6666666666667, ans=0.13674999999999998 2024-06-19 14:03:03,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=8.765 2024-06-19 14:03:04,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1686.6666666666667, ans=0.08945833333333333 2024-06-19 14:03:05,215 WARNING [optim.py:503] (0/2) Scaling gradients by 0.08229032158851624, model_norm_threshold=1669.5751953125 2024-06-19 14:03:05,405 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.91, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=3.743e+08, grad_sumsq=1.163e+10, orig_rms_sq=3.219e-02 2024-06-19 14:03:16,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=8.139375 2024-06-19 14:03:16,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.82 vs. limit=8.139375 2024-06-19 14:03:18,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1723.3333333333333, ans=0.41921875 2024-06-19 14:03:24,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=8.14625 2024-06-19 14:03:24,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.10 vs. limit=8.14625 2024-06-19 14:03:26,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=1741.6666666666667, ans=0.22612500000000002 2024-06-19 14:03:26,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=8.153125 2024-06-19 14:03:27,386 INFO [train.py:1028] (0/2) Epoch 1, batch 950, loss[loss=0.8899, simple_loss=0.6599, pruned_loss=0.7328, over 12914.00 frames. ], tot_loss[loss=0.8479, simple_loss=0.6575, pruned_loss=0.7172, over 2559035.32 frames. ], batch size: 39, lr: 3.49e-02, grad_scale: 1.0 2024-06-19 14:03:30,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=1741.6666666666667, ans=0.8390416666666667 2024-06-19 14:03:34,615 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=3.220e+02 2024-06-19 14:03:38,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=49.39 vs. limit=8.16 2024-06-19 14:03:39,086 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=4.308e-01 2024-06-19 14:03:44,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1778.3333333333333, ans=0.04444270833333334 2024-06-19 14:03:50,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=1778.3333333333333, ans=0.1333125 2024-06-19 14:03:52,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1796.6666666666667, ans=0.41578125 2024-06-19 14:03:56,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1796.6666666666667, ans=0.41578125 2024-06-19 14:03:59,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=22.77 vs. limit=8.17375 2024-06-19 14:04:03,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1815.0, ans=0.044328125 2024-06-19 14:04:06,488 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=18.28 vs. limit=8.180625 2024-06-19 14:04:07,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.14 vs. limit=5.9075 2024-06-19 14:04:11,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1833.3333333333333, ans=0.8358333333333333 2024-06-19 14:04:12,044 INFO [train.py:1028] (0/2) Epoch 1, batch 1000, loss[loss=0.9052, simple_loss=0.6746, pruned_loss=0.7255, over 13296.00 frames. ], tot_loss[loss=0.8478, simple_loss=0.6524, pruned_loss=0.7093, over 2562347.76 frames. ], batch size: 49, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:04:17,714 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 5.068e+02 7.710e+02 1.208e+03 2.029e+04, threshold=1.542e+03, percent-clipped=14.0 2024-06-19 14:04:22,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=1851.6666666666667, ans=0.14584375 2024-06-19 14:04:23,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=88.73 vs. limit=8.194375 2024-06-19 14:04:28,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=1870.0, ans=0.41234375 2024-06-19 14:04:37,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=8.208125 2024-06-19 14:04:43,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1888.3333333333333, ans=0.1291875 2024-06-19 14:04:43,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.56 vs. limit=8.91625 2024-06-19 14:04:53,369 INFO [train.py:1028] (0/2) Epoch 1, batch 1050, loss[loss=0.8808, simple_loss=0.6479, pruned_loss=0.7036, over 13132.00 frames. ], tot_loss[loss=0.8559, simple_loss=0.6525, pruned_loss=0.7094, over 2565368.35 frames. ], batch size: 77, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:05:01,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1943.3333333333333, ans=0.40890625 2024-06-19 14:05:03,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1943.3333333333333, ans=6.214583333333334 2024-06-19 14:05:14,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1961.6666666666667, ans=8.97125 2024-06-19 14:05:16,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=180.17 vs. limit=8.235625 2024-06-19 14:05:19,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1980.0, ans=0.4071875 2024-06-19 14:05:23,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1980.0, ans=8.2425 2024-06-19 14:05:25,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1980.0, ans=6.2375 2024-06-19 14:05:29,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.43 vs. limit=8.249375 2024-06-19 14:05:30,481 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=4.642e-02 2024-06-19 14:05:37,082 INFO [train.py:1028] (0/2) Epoch 1, batch 1100, loss[loss=0.8433, simple_loss=0.6152, pruned_loss=0.6674, over 13272.00 frames. ], tot_loss[loss=0.8614, simple_loss=0.6513, pruned_loss=0.7067, over 2569905.47 frames. ], batch size: 52, lr: 3.48e-02, grad_scale: 4.0 2024-06-19 14:05:42,721 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 4.247e+02 6.049e+02 8.531e+02 7.194e+03, threshold=1.210e+03, percent-clipped=14.0 2024-06-19 14:05:42,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2016.6666666666667, ans=0.2798333333333333 2024-06-19 14:05:43,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=2016.6666666666667, ans=0.40546875 2024-06-19 14:05:44,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.92 vs. limit=4.806666666666667 2024-06-19 14:05:47,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=9.68 vs. limit=9.026250000000001 2024-06-19 14:05:52,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=2053.3333333333335, ans=0.13449999999999998 2024-06-19 14:05:54,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.62 vs. limit=8.27 2024-06-19 14:05:55,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=8.27 2024-06-19 14:05:58,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=58.75 vs. limit=8.27 2024-06-19 14:06:11,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=8.28375 2024-06-19 14:06:12,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=2090.0, ans=0.13243749999999999 2024-06-19 14:06:15,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=2090.0, ans=0.13243749999999999 2024-06-19 14:06:17,977 INFO [train.py:1028] (0/2) Epoch 1, batch 1150, loss[loss=0.9228, simple_loss=0.6669, pruned_loss=0.7246, over 13272.00 frames. ], tot_loss[loss=0.8663, simple_loss=0.65, pruned_loss=0.7032, over 2570792.73 frames. ], batch size: 52, lr: 3.48e-02, grad_scale: 1.0 2024-06-19 14:06:24,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=27.22 vs. limit=8.290625 2024-06-19 14:06:32,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.19 vs. limit=8.2975 2024-06-19 14:06:51,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.44 vs. limit=9.1225 2024-06-19 14:06:53,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=28.18 vs. limit=9.13625 2024-06-19 14:06:55,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=2181.6666666666665, ans=0.12728125 2024-06-19 14:06:59,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2181.6666666666665, ans=0.08636458333333334 2024-06-19 14:07:01,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=9.15 2024-06-19 14:07:01,563 INFO [train.py:1028] (0/2) Epoch 1, batch 1200, loss[loss=0.8632, simple_loss=0.6268, pruned_loss=0.663, over 13205.00 frames. ], tot_loss[loss=0.8687, simple_loss=0.6475, pruned_loss=0.6974, over 2573183.59 frames. ], batch size: 77, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:07:08,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 4.499e+02 6.061e+02 9.470e+02 6.765e+03, threshold=1.212e+03, percent-clipped=14.0 2024-06-19 14:07:10,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=4.887333333333333 2024-06-19 14:07:12,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=9.16375 2024-06-19 14:07:15,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=8.331875 2024-06-19 14:07:23,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.58 vs. limit=4.894666666666667 2024-06-19 14:07:26,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.47 vs. limit=9.19125 2024-06-19 14:07:28,723 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=3.268e+02 2024-06-19 14:07:32,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2255.0, ans=0.394296875 2024-06-19 14:07:36,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=9.205 2024-06-19 14:07:37,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=22.40 vs. limit=8.3525 2024-06-19 14:07:41,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=8.3525 2024-06-19 14:07:44,622 INFO [train.py:1028] (0/2) Epoch 1, batch 1250, loss[loss=0.8157, simple_loss=0.5955, pruned_loss=0.6131, over 13131.00 frames. ], tot_loss[loss=0.872, simple_loss=0.6451, pruned_loss=0.6929, over 2582140.64 frames. ], batch size: 112, lr: 3.48e-02, grad_scale: 2.0 2024-06-19 14:07:45,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=2291.6666666666665, ans=0.12109375 2024-06-19 14:07:45,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=2291.6666666666665, ans=0.8197916666666667 2024-06-19 14:07:50,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=9.21875 2024-06-19 14:07:52,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=9.2325 2024-06-19 14:08:04,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=28.16 vs. limit=9.24625 2024-06-19 14:08:05,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2328.3333333333335, ans=0.27671666666666667 2024-06-19 14:08:08,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=98.58 vs. limit=8.38 2024-06-19 14:08:09,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=106.92 vs. limit=8.38 2024-06-19 14:08:09,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=8.38 2024-06-19 14:08:18,490 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=8.386875 2024-06-19 14:08:20,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=106.19 vs. limit=8.386875 2024-06-19 14:08:21,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2365.0, ans=0.23547500000000002 2024-06-19 14:08:24,332 INFO [train.py:1028] (0/2) Epoch 1, batch 1300, loss[loss=0.9172, simple_loss=0.6563, pruned_loss=0.692, over 12742.00 frames. ], tot_loss[loss=0.879, simple_loss=0.6445, pruned_loss=0.6926, over 2582429.48 frames. ], batch size: 176, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:08:25,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=112.55 vs. limit=8.39375 2024-06-19 14:08:26,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2383.3333333333335, ans=0.046375 2024-06-19 14:08:31,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2383.3333333333335, ans=0.04949747468305833 2024-06-19 14:08:32,502 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 4.415e+02 5.926e+02 9.033e+02 6.266e+03, threshold=1.185e+03, percent-clipped=14.0 2024-06-19 14:08:33,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2401.6666666666665, ans=0.387421875 2024-06-19 14:08:48,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2438.3333333333335, ans=0.385703125 2024-06-19 14:08:51,300 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.47 vs. limit=9.32875 2024-06-19 14:09:02,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.93 vs. limit=9.3425 2024-06-19 14:09:07,272 INFO [train.py:1028] (0/2) Epoch 1, batch 1350, loss[loss=0.9556, simple_loss=0.6558, pruned_loss=0.7363, over 13207.00 frames. ], tot_loss[loss=0.8915, simple_loss=0.6452, pruned_loss=0.6988, over 2584043.64 frames. ], batch size: 59, lr: 3.47e-02, grad_scale: 1.0 2024-06-19 14:09:08,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=2475.0, ans=0.383984375 2024-06-19 14:09:10,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=8.428125 2024-06-19 14:09:11,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=2475.0, ans=0.383984375 2024-06-19 14:09:11,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=9.35625 2024-06-19 14:09:18,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=42.07 vs. limit=9.370000000000001 2024-06-19 14:09:28,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2511.6666666666665, ans=0.1058125 2024-06-19 14:09:29,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=3.37675 2024-06-19 14:09:30,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.58 vs. limit=9.38375 2024-06-19 14:09:36,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=8.44875 2024-06-19 14:09:41,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2548.3333333333335, ans=0.04203645833333333 2024-06-19 14:09:47,390 INFO [train.py:1028] (0/2) Epoch 1, batch 1400, loss[loss=0.9908, simple_loss=0.6732, pruned_loss=0.7576, over 12312.00 frames. ], tot_loss[loss=0.8995, simple_loss=0.6454, pruned_loss=0.6991, over 2586456.99 frames. ], batch size: 25, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:09:49,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=2566.6666666666665, ans=0.2743333333333333 2024-06-19 14:09:51,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2566.6666666666665, ans=0.10375 2024-06-19 14:09:59,099 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 5.050e+02 8.148e+02 1.171e+03 6.097e+03, threshold=1.630e+03, percent-clipped=24.0 2024-06-19 14:09:59,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2585.0, ans=0.809525 2024-06-19 14:10:00,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.21 vs. limit=9.43875 2024-06-19 14:10:02,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2585.0, ans=0.378828125 2024-06-19 14:10:03,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=18.66 vs. limit=8.469375 2024-06-19 14:10:25,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2640.0, ans=0.37625 2024-06-19 14:10:27,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=8.49 2024-06-19 14:10:28,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2640.0, ans=0.37625 2024-06-19 14:10:29,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2658.3333333333335, ans=0.239875 2024-06-19 14:10:30,442 INFO [train.py:1028] (0/2) Epoch 1, batch 1450, loss[loss=0.8325, simple_loss=0.5874, pruned_loss=0.6085, over 13103.00 frames. ], tot_loss[loss=0.9054, simple_loss=0.6447, pruned_loss=0.6972, over 2585615.51 frames. ], batch size: 121, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:10:42,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=41.18 vs. limit=8.50375 2024-06-19 14:10:50,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=40.41 vs. limit=9.52125 2024-06-19 14:10:54,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=49.90 vs. limit=9.535 2024-06-19 14:10:59,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2713.3333333333335, ans=0.27286666666666665 2024-06-19 14:11:00,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.24 vs. limit=9.535 2024-06-19 14:11:07,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2731.6666666666665, ans=0.240975 2024-06-19 14:11:08,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.61 vs. limit=6.365833333333333 2024-06-19 14:11:10,144 INFO [train.py:1028] (0/2) Epoch 1, batch 1500, loss[loss=0.9313, simple_loss=0.637, pruned_loss=0.6877, over 13182.00 frames. ], tot_loss[loss=0.9124, simple_loss=0.6441, pruned_loss=0.6969, over 2588221.48 frames. ], batch size: 83, lr: 3.47e-02, grad_scale: 2.0 2024-06-19 14:11:13,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2750.0, ans=0.096875 2024-06-19 14:11:16,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.15 vs. limit=6.375 2024-06-19 14:11:18,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=2750.0, ans=0.09899494936611666 2024-06-19 14:11:21,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=9.57625 2024-06-19 14:11:23,526 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.511e+02 9.620e+02 1.404e+03 2.479e+03 1.351e+04, threshold=2.808e+03, percent-clipped=42.0 2024-06-19 14:11:38,199 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.60 vs. limit=8.551874999999999 2024-06-19 14:11:38,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=8.551874999999999 2024-06-19 14:11:47,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=2823.3333333333335, ans=0.0911875 2024-06-19 14:11:54,189 INFO [train.py:1028] (0/2) Epoch 1, batch 1550, loss[loss=0.9637, simple_loss=0.6604, pruned_loss=0.701, over 13054.00 frames. ], tot_loss[loss=0.9215, simple_loss=0.6449, pruned_loss=0.6982, over 2584083.53 frames. ], batch size: 102, lr: 3.46e-02, grad_scale: 1.0 2024-06-19 14:11:54,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2841.6666666666665, ans=0.27158333333333334 2024-06-19 14:11:57,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=8.565625 2024-06-19 14:12:00,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=9.63125 2024-06-19 14:12:04,199 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.53 vs. limit=6.43 2024-06-19 14:12:09,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2878.3333333333335, ans=0.03523749999999999 2024-06-19 14:12:16,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2878.3333333333335, ans=0.09206249999999998 2024-06-19 14:12:20,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=2896.6666666666665, ans=0.1379166666666667 2024-06-19 14:12:21,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=3.4345 2024-06-19 14:12:23,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2896.6666666666665, ans=0.27103333333333335 2024-06-19 14:12:33,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=2915.0, ans=0.0344125 2024-06-19 14:12:37,441 INFO [train.py:1028] (0/2) Epoch 1, batch 1600, loss[loss=0.966, simple_loss=0.6472, pruned_loss=0.7043, over 13205.00 frames. ], tot_loss[loss=0.9299, simple_loss=0.6454, pruned_loss=0.6987, over 2579648.51 frames. ], batch size: 77, lr: 3.46e-02, grad_scale: 1.0 2024-06-19 14:12:37,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.37 vs. limit=9.7 2024-06-19 14:12:45,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.55 vs. limit=8.606875 2024-06-19 14:12:46,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2951.6666666666665, ans=0.0893125 2024-06-19 14:12:48,321 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.621e+02 1.632e+03 2.277e+03 3.244e+03 1.287e+04, threshold=4.555e+03, percent-clipped=37.0 2024-06-19 14:12:52,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=8.61375 2024-06-19 14:12:53,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2970.0, ans=0.2703 2024-06-19 14:12:58,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2970.0, ans=0.36078125 2024-06-19 14:13:02,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=14.65 vs. limit=5.747083333333333 2024-06-19 14:13:05,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=2988.3333333333335, ans=0.7954083333333334 2024-06-19 14:13:12,310 WARNING [optim.py:503] (0/2) Scaling gradients by 0.07586484402418137, model_norm_threshold=4554.50439453125 2024-06-19 14:13:12,485 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.53, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.904e+09, grad_sumsq=9.839e+10, orig_rms_sq=1.935e-02 2024-06-19 14:13:13,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.81 vs. limit=6.503333333333333 2024-06-19 14:13:13,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=9.754999999999999 2024-06-19 14:13:15,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=34.26 vs. limit=8.6275 2024-06-19 14:13:16,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=9.76875 2024-06-19 14:13:16,585 INFO [train.py:1028] (0/2) Epoch 1, batch 1650, loss[loss=0.985, simple_loss=0.66, pruned_loss=0.7088, over 13151.00 frames. ], tot_loss[loss=0.935, simple_loss=0.6442, pruned_loss=0.6966, over 2576095.08 frames. ], batch size: 95, lr: 3.46e-02, grad_scale: 0.25 2024-06-19 14:13:26,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=3043.3333333333335, ans=0.35734374999999996 2024-06-19 14:13:27,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=3043.3333333333335, ans=0.35734374999999996 2024-06-19 14:13:30,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=39.70 vs. limit=9.7825 2024-06-19 14:13:30,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=3043.3333333333335, ans=0.24565 2024-06-19 14:13:33,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=3061.6666666666665, ans=9.79625 2024-06-19 14:13:42,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=50.93 vs. limit=8.648125 2024-06-19 14:13:48,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=3080.0, ans=0.35562499999999997 2024-06-19 14:13:54,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=3098.3333333333335, ans=0.354765625 2024-06-19 14:13:57,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=8.661875 2024-06-19 14:13:59,672 INFO [train.py:1028] (0/2) Epoch 1, batch 1700, loss[loss=0.9786, simple_loss=0.6367, pruned_loss=0.7079, over 12322.00 frames. ], tot_loss[loss=0.9453, simple_loss=0.6457, pruned_loss=0.6986, over 2581381.91 frames. ], batch size: 25, lr: 3.46e-02, grad_scale: 0.5 2024-06-19 14:14:02,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=3116.6666666666665, ans=0.7811666666666667 2024-06-19 14:14:04,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=8.66875 2024-06-19 14:14:10,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=8.675625 2024-06-19 14:14:12,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.04 vs. limit=8.675625 2024-06-19 14:14:12,627 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.901e+02 1.889e+03 2.542e+03 4.756e+03 6.003e+04, threshold=5.083e+03, percent-clipped=27.0 2024-06-19 14:14:15,440 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.69 vs. limit=9.865 2024-06-19 14:14:24,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.10 vs. limit=9.87875 2024-06-19 14:14:25,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=3171.6666666666665, ans=0.351328125 2024-06-19 14:14:25,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3171.6666666666665, ans=0.351328125 2024-06-19 14:14:26,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=9.87875 2024-06-19 14:14:26,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=32.40 vs. limit=6.585833333333333 2024-06-19 14:14:42,999 INFO [train.py:1028] (0/2) Epoch 1, batch 1750, loss[loss=1.087, simple_loss=0.6893, pruned_loss=0.7875, over 12553.00 frames. ], tot_loss[loss=0.9551, simple_loss=0.6471, pruned_loss=0.7002, over 2581861.45 frames. ], batch size: 22, lr: 3.45e-02, grad_scale: 0.5 2024-06-19 14:14:52,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=3226.6666666666665, ans=0.027399999999999994 2024-06-19 14:14:53,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.05 vs. limit=6.613333333333333 2024-06-19 14:15:04,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=3245.0, ans=0.026987499999999998 2024-06-19 14:15:05,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=3245.0, ans=7.028125 2024-06-19 14:15:06,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=3263.3333333333335, ans=0.34703125 2024-06-19 14:15:07,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=3263.3333333333335, ans=0.34703125 2024-06-19 14:15:10,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=3263.3333333333335, ans=0.34703125 2024-06-19 14:15:20,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=48.88 vs. limit=9.96125 2024-06-19 14:15:23,372 INFO [train.py:1028] (0/2) Epoch 1, batch 1800, loss[loss=1.013, simple_loss=0.6597, pruned_loss=0.7143, over 13275.00 frames. ], tot_loss[loss=0.9635, simple_loss=0.6477, pruned_loss=0.7006, over 2581634.42 frames. ], batch size: 67, lr: 3.45e-02, grad_scale: 1.0 2024-06-19 14:15:26,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=3300.0, ans=0.3453125 2024-06-19 14:15:28,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=8.7375 2024-06-19 14:15:37,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 2.109e+03 3.076e+03 4.523e+03 2.012e+04, threshold=6.152e+03, percent-clipped=20.0 2024-06-19 14:15:49,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=3355.0, ans=0.06128125000000001 2024-06-19 14:15:52,789 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=51.13 vs. limit=10.01625 2024-06-19 14:15:55,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=5.349333333333334 2024-06-19 14:16:02,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.24 vs. limit=10.03 2024-06-19 14:16:02,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=30.95 vs. limit=8.765 2024-06-19 14:16:06,432 INFO [train.py:1028] (0/2) Epoch 1, batch 1850, loss[loss=1.005, simple_loss=0.6413, pruned_loss=0.7079, over 13182.00 frames. ], tot_loss[loss=0.9718, simple_loss=0.6477, pruned_loss=0.7013, over 2582837.68 frames. ], batch size: 83, lr: 3.45e-02, grad_scale: 0.25 2024-06-19 14:16:06,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=3391.6666666666665, ans=0.341015625 2024-06-19 14:16:09,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.49 vs. limit=8.771875 2024-06-19 14:16:09,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.08 vs. limit=8.771875 2024-06-19 14:16:09,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=3391.6666666666665, ans=0.341015625 2024-06-19 14:16:17,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3410.0, ans=0.07374999999999998 2024-06-19 14:16:17,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=3410.0, ans=0.34015625 2024-06-19 14:16:20,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=3410.0, ans=0.072125 2024-06-19 14:16:27,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3428.3333333333335, ans=0.339296875 2024-06-19 14:16:29,086 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=8.785625 2024-06-19 14:16:36,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=3446.6666666666665, ans=0.06916666666666671 2024-06-19 14:16:39,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3465.0, ans=0.26535 2024-06-19 14:16:43,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=3465.0, ans=0.07006249999999997 2024-06-19 14:16:44,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=184.43 vs. limit=8.799375 2024-06-19 14:16:45,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3483.3333333333335, ans=0.33671875 2024-06-19 14:16:46,315 INFO [train.py:1028] (0/2) Epoch 1, batch 1900, loss[loss=0.9417, simple_loss=0.6061, pruned_loss=0.6527, over 13180.00 frames. ], tot_loss[loss=0.9745, simple_loss=0.645, pruned_loss=0.6976, over 2585859.27 frames. ], batch size: 95, lr: 3.45e-02, grad_scale: 0.5 2024-06-19 14:16:47,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=3483.3333333333335, ans=0.33671875 2024-06-19 14:17:04,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=3501.6666666666665, ans=7.188541666666667 2024-06-19 14:17:05,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.833e+02 4.394e+03 6.622e+03 1.141e+04 5.665e+04, threshold=1.324e+04, percent-clipped=53.0 2024-06-19 14:17:05,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=76.53 vs. limit=8.82 2024-06-19 14:17:09,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=3520.0, ans=0.33499999999999996 2024-06-19 14:17:10,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.38 vs. limit=5.88 2024-06-19 14:17:13,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=3538.3333333333335, ans=0.06731249999999997 2024-06-19 14:17:19,550 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=8.826875 2024-06-19 14:17:20,354 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=5.8845833333333335 2024-06-19 14:17:26,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=41.28 vs. limit=10.1675 2024-06-19 14:17:28,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.79 vs. limit=5.889166666666666 2024-06-19 14:17:29,163 INFO [train.py:1028] (0/2) Epoch 1, batch 1950, loss[loss=1.044, simple_loss=0.6523, pruned_loss=0.7254, over 13271.00 frames. ], tot_loss[loss=0.9768, simple_loss=0.6423, pruned_loss=0.6935, over 2592122.23 frames. ], batch size: 52, lr: 3.44e-02, grad_scale: 0.25 2024-06-19 14:17:29,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=3575.0, ans=10.18125 2024-06-19 14:17:33,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=3575.0, ans=0.04890625000000001 2024-06-19 14:17:34,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=10.18125 2024-06-19 14:17:40,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=3593.3333333333335, ans=0.26406666666666667 2024-06-19 14:17:44,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=95.78 vs. limit=8.8475 2024-06-19 14:17:50,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.91 vs. limit=10.20875 2024-06-19 14:17:52,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.25 vs. limit=8.854375000000001 2024-06-19 14:18:02,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=3648.3333333333335, ans=0.017912499999999984 2024-06-19 14:18:02,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=3648.3333333333335, ans=0.044781249999999995 2024-06-19 14:18:05,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=10.23625 2024-06-19 14:18:09,468 INFO [train.py:1028] (0/2) Epoch 1, batch 2000, loss[loss=1.074, simple_loss=0.6649, pruned_loss=0.7415, over 12667.00 frames. ], tot_loss[loss=0.9847, simple_loss=0.6432, pruned_loss=0.6933, over 2588130.80 frames. ], batch size: 22, lr: 3.44e-02, grad_scale: 0.25 2024-06-19 14:18:12,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=3666.6666666666665, ans=0.0175 2024-06-19 14:18:18,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=3666.6666666666665, ans=0.06249999999999997 2024-06-19 14:18:19,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=10.26375 2024-06-19 14:18:21,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=3685.0, ans=0.7710250000000001 2024-06-19 14:18:22,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=3685.0, ans=0.06181249999999999 2024-06-19 14:18:27,841 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+03 7.127e+03 1.010e+04 1.722e+04 6.991e+04, threshold=2.019e+04, percent-clipped=40.0 2024-06-19 14:18:36,357 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.10 vs. limit=5.930416666666667 2024-06-19 14:18:38,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=33.63 vs. limit=10.29125 2024-06-19 14:18:41,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=10.29125 2024-06-19 14:18:41,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.94 vs. limit=8.895624999999999 2024-06-19 14:18:44,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=3740.0, ans=0.05974999999999997 2024-06-19 14:18:46,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3740.0, ans=0.2626 2024-06-19 14:18:46,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=3740.0, ans=5.9350000000000005 2024-06-19 14:18:51,868 INFO [train.py:1028] (0/2) Epoch 1, batch 2050, loss[loss=1.099, simple_loss=0.6664, pruned_loss=0.7659, over 12667.00 frames. ], tot_loss[loss=0.9944, simple_loss=0.6448, pruned_loss=0.6955, over 2584600.36 frames. ], batch size: 29, lr: 3.44e-02, grad_scale: 0.125 2024-06-19 14:18:51,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=3758.3333333333335, ans=0.323828125 2024-06-19 14:19:11,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.32 vs. limit=3.5692500000000003 2024-06-19 14:19:12,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=21.88 vs. limit=8.923125 2024-06-19 14:19:12,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=5.518 2024-06-19 14:19:16,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=80.89 vs. limit=10.36 2024-06-19 14:19:18,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3813.3333333333335, ans=0.32125000000000004 2024-06-19 14:19:26,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.68 vs. limit=6.9158333333333335 2024-06-19 14:19:32,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=3831.6666666666665, ans=0.013787499999999994 2024-06-19 14:19:34,401 INFO [train.py:1028] (0/2) Epoch 1, batch 2100, loss[loss=1.075, simple_loss=0.672, pruned_loss=0.7388, over 13166.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6468, pruned_loss=0.6994, over 2587050.15 frames. ], batch size: 59, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:19:36,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=3850.0, ans=0.013374999999999998 2024-06-19 14:19:39,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=8.94375 2024-06-19 14:19:40,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.28 vs. limit=6.925 2024-06-19 14:19:42,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.28 vs. limit=10.401250000000001 2024-06-19 14:19:43,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.77 vs. limit=10.401250000000001 2024-06-19 14:19:43,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=8.950625 2024-06-19 14:19:52,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=10.415 2024-06-19 14:19:52,291 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.170e+03 5.486e+03 9.618e+03 1.340e+04 1.020e+05, threshold=1.924e+04, percent-clipped=11.0 2024-06-19 14:20:01,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=55.75 vs. limit=10.42875 2024-06-19 14:20:01,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=8.964375 2024-06-19 14:20:03,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3905.0, ans=0.26095 2024-06-19 14:20:04,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=5.562 2024-06-19 14:20:04,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=8.964375 2024-06-19 14:20:11,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=3923.3333333333335, ans=0.31609375 2024-06-19 14:20:14,577 INFO [train.py:1028] (0/2) Epoch 1, batch 2150, loss[loss=1.087, simple_loss=0.6728, pruned_loss=0.7504, over 13258.00 frames. ], tot_loss[loss=1.01, simple_loss=0.6472, pruned_loss=0.7007, over 2589222.45 frames. ], batch size: 52, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:20:15,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.17 vs. limit=5.985416666666667 2024-06-19 14:20:22,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=3960.0, ans=0.037625000000000006 2024-06-19 14:20:29,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.60 vs. limit=5.99 2024-06-19 14:20:31,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=8.991875 2024-06-19 14:20:35,469 WARNING [optim.py:503] (0/2) Scaling gradients by 0.09887780249118805, model_norm_threshold=19236.69140625 2024-06-19 14:20:35,648 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.65, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=2.449e+10, grad_sumsq=1.444e+12, orig_rms_sq=1.696e-02 2024-06-19 14:20:41,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.79 vs. limit=8.99875 2024-06-19 14:20:41,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=31.13 vs. limit=10.4975 2024-06-19 14:20:53,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=4015.0, ans=0.035 2024-06-19 14:20:56,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=41.28 vs. limit=7.0075 2024-06-19 14:20:57,930 INFO [train.py:1028] (0/2) Epoch 1, batch 2200, loss[loss=0.9816, simple_loss=0.624, pruned_loss=0.6696, over 13209.00 frames. ], tot_loss[loss=1.015, simple_loss=0.649, pruned_loss=0.7018, over 2589891.48 frames. ], batch size: 83, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:21:15,954 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.117e+03 1.038e+04 1.339e+04 2.109e+04 1.946e+05, threshold=2.678e+04, percent-clipped=26.0 2024-06-19 14:21:37,522 INFO [train.py:1028] (0/2) Epoch 1, batch 2250, loss[loss=1.037, simple_loss=0.6507, pruned_loss=0.7113, over 13236.00 frames. ], tot_loss[loss=1.018, simple_loss=0.6494, pruned_loss=0.7021, over 2589958.35 frames. ], batch size: 63, lr: 3.43e-02, grad_scale: 0.125 2024-06-19 14:21:46,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=4143.333333333333, ans=10.6075 2024-06-19 14:21:51,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=9.053749999999999 2024-06-19 14:21:52,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.99 vs. limit=6.035833333333333 2024-06-19 14:21:54,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=9.060625 2024-06-19 14:21:56,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=4161.666666666667, ans=0.304921875 2024-06-19 14:22:04,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=4180.0, ans=0.161335 2024-06-19 14:22:04,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=4180.0, ans=0.07 2024-06-19 14:22:07,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=60.90 vs. limit=10.635 2024-06-19 14:22:12,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4198.333333333333, ans=0.25801666666666667 2024-06-19 14:22:12,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=3.62975 2024-06-19 14:22:12,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=10.64875 2024-06-19 14:22:13,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=10.64875 2024-06-19 14:22:19,953 INFO [train.py:1028] (0/2) Epoch 1, batch 2300, loss[loss=1.08, simple_loss=0.6639, pruned_loss=0.7485, over 12984.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6484, pruned_loss=0.7011, over 2584757.27 frames. ], batch size: 33, lr: 3.42e-02, grad_scale: 0.125 2024-06-19 14:22:22,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=4216.666666666667, ans=0.26325 2024-06-19 14:22:23,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.09 vs. limit=10.662500000000001 2024-06-19 14:22:31,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.27 vs. limit=9.088125 2024-06-19 14:22:31,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.98 vs. limit=7.1175 2024-06-19 14:22:35,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=4253.333333333333, ans=0.04894444444444445 2024-06-19 14:22:35,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=4253.333333333333, ans=0.04949747468305833 2024-06-19 14:22:37,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.35 vs. limit=10.69 2024-06-19 14:22:38,937 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.177e+03 1.387e+04 2.044e+04 3.046e+04 1.525e+05, threshold=4.088e+04, percent-clipped=32.0 2024-06-19 14:22:51,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4290.0, ans=0.2571 2024-06-19 14:22:53,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=4290.0, ans=0.74985 2024-06-19 14:22:59,682 INFO [train.py:1028] (0/2) Epoch 1, batch 2350, loss[loss=1.099, simple_loss=0.6834, pruned_loss=0.7575, over 13259.00 frames. ], tot_loss[loss=1.02, simple_loss=0.6479, pruned_loss=0.7015, over 2587373.33 frames. ], batch size: 67, lr: 3.42e-02, grad_scale: 0.125 2024-06-19 14:23:16,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=38.07 vs. limit=10.745000000000001 2024-06-19 14:23:27,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.68 vs. limit=10.772499999999999 2024-06-19 14:23:39,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=10.786249999999999 2024-06-19 14:23:39,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=9.143125 2024-06-19 14:23:40,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=4381.666666666667, ans=0.036307291666666665 2024-06-19 14:23:43,066 INFO [train.py:1028] (0/2) Epoch 1, batch 2400, loss[loss=1.076, simple_loss=0.6652, pruned_loss=0.743, over 13277.00 frames. ], tot_loss[loss=1.018, simple_loss=0.6462, pruned_loss=0.6987, over 2588955.95 frames. ], batch size: 46, lr: 3.42e-02, grad_scale: 0.25 2024-06-19 14:23:46,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=37.68 vs. limit=9.15 2024-06-19 14:23:58,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=4436.666666666667, ans=0.04818055555555556 2024-06-19 14:23:58,531 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=37.63 vs. limit=10.8275 2024-06-19 14:24:05,694 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.426e+03 1.668e+04 2.149e+04 3.437e+04 1.491e+05, threshold=4.299e+04, percent-clipped=21.0 2024-06-19 14:24:10,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.27 vs. limit=7.2275 2024-06-19 14:24:22,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=58.17 vs. limit=9.1775 2024-06-19 14:24:24,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=4491.666666666667, ans=0.025 2024-06-19 14:24:25,470 INFO [train.py:1028] (0/2) Epoch 1, batch 2450, loss[loss=1.064, simple_loss=0.6594, pruned_loss=0.7342, over 13236.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6434, pruned_loss=0.6931, over 2584660.15 frames. ], batch size: 63, lr: 3.41e-02, grad_scale: 0.0625 2024-06-19 14:24:28,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4491.666666666667, ans=0.289453125 2024-06-19 14:24:30,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=4491.666666666667, ans=0.025 2024-06-19 14:24:31,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=9.184375 2024-06-19 14:24:37,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4510.0, ans=0.28859375 2024-06-19 14:24:40,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=42.29 vs. limit=10.89625 2024-06-19 14:24:42,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=4528.333333333333, ans=0.009885144927536232 2024-06-19 14:24:43,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=10.89625 2024-06-19 14:24:49,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=72.79 vs. limit=9.205 2024-06-19 14:24:50,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=4546.666666666667, ans=0.286875 2024-06-19 14:24:51,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=10.91 2024-06-19 14:25:04,603 INFO [train.py:1028] (0/2) Epoch 1, batch 2500, loss[loss=0.966, simple_loss=0.6254, pruned_loss=0.6533, over 13210.00 frames. ], tot_loss[loss=1.007, simple_loss=0.6404, pruned_loss=0.6891, over 2588568.61 frames. ], batch size: 83, lr: 3.41e-02, grad_scale: 0.125 2024-06-19 14:25:08,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=4583.333333333333, ans=0.7958333333333333 2024-06-19 14:25:12,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.33 vs. limit=10.95125 2024-06-19 14:25:27,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4620.0, ans=0.2834375 2024-06-19 14:25:28,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.45 vs. limit=10.965 2024-06-19 14:25:28,259 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.430e+03 1.773e+04 2.264e+04 3.120e+04 1.795e+05, threshold=4.529e+04, percent-clipped=14.0 2024-06-19 14:25:34,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=10.97875 2024-06-19 14:25:36,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=5.855333333333333 2024-06-19 14:25:41,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=4656.666666666667, ans=0.28171875 2024-06-19 14:25:47,226 INFO [train.py:1028] (0/2) Epoch 1, batch 2550, loss[loss=1.003, simple_loss=0.6164, pruned_loss=0.6945, over 12568.00 frames. ], tot_loss[loss=1.006, simple_loss=0.6395, pruned_loss=0.6879, over 2589291.68 frames. ], batch size: 22, lr: 3.41e-02, grad_scale: 0.125 2024-06-19 14:26:00,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.40 vs. limit=9.26 2024-06-19 14:26:01,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.71 vs. limit=9.26 2024-06-19 14:26:03,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=41.12 vs. limit=11.03375 2024-06-19 14:26:10,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4730.0, ans=0.2527 2024-06-19 14:26:20,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=4730.0, ans=0.04695833333333334 2024-06-19 14:26:23,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=9.280625 2024-06-19 14:26:28,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=11.06125 2024-06-19 14:26:29,499 INFO [train.py:1028] (0/2) Epoch 1, batch 2600, loss[loss=0.9267, simple_loss=0.5706, pruned_loss=0.6414, over 13264.00 frames. ], tot_loss[loss=1.002, simple_loss=0.6366, pruned_loss=0.6848, over 2588791.62 frames. ], batch size: 52, lr: 3.40e-02, grad_scale: 0.125 2024-06-19 14:26:39,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4785.0, ans=0.275703125 2024-06-19 14:26:46,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=11.1025 2024-06-19 14:26:47,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.01 vs. limit=11.1025 2024-06-19 14:26:48,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.98 vs. limit=9.30125 2024-06-19 14:26:48,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=9.30125 2024-06-19 14:26:52,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.214e+04 2.319e+04 2.865e+04 4.018e+04 1.762e+05, threshold=5.729e+04, percent-clipped=20.0 2024-06-19 14:26:53,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=9.308125 2024-06-19 14:27:00,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=11.129999999999999 2024-06-19 14:27:02,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=4840.0, ans=0.2516 2024-06-19 14:27:08,999 INFO [train.py:1028] (0/2) Epoch 1, batch 2650, loss[loss=0.9324, simple_loss=0.5987, pruned_loss=0.633, over 13047.00 frames. ], tot_loss[loss=0.9993, simple_loss=0.6345, pruned_loss=0.6832, over 2587771.39 frames. ], batch size: 144, lr: 3.40e-02, grad_scale: 0.03125 2024-06-19 14:27:11,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=11.14375 2024-06-19 14:27:12,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=9.321875 2024-06-19 14:27:14,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=4858.333333333333, ans=0.272265625 2024-06-19 14:27:14,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=30.71 vs. limit=9.321875 2024-06-19 14:27:17,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.26 vs. limit=9.32875 2024-06-19 14:27:26,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.43 vs. limit=9.335625 2024-06-19 14:27:27,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=4895.0, ans=0.0 2024-06-19 14:27:37,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.56 vs. limit=11.185 2024-06-19 14:27:38,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=4913.333333333333, ans=0.26968749999999997 2024-06-19 14:27:40,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4913.333333333333, ans=0.26968749999999997 2024-06-19 14:27:40,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=4913.333333333333, ans=0.26968749999999997 2024-06-19 14:27:50,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.95 vs. limit=7.475 2024-06-19 14:27:50,496 INFO [train.py:1028] (0/2) Epoch 1, batch 2700, loss[loss=0.9418, simple_loss=0.604, pruned_loss=0.6398, over 13239.00 frames. ], tot_loss[loss=0.9934, simple_loss=0.6307, pruned_loss=0.6789, over 2584005.94 frames. ], batch size: 89, lr: 3.39e-02, grad_scale: 0.0625 2024-06-19 14:27:52,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=4950.0, ans=0.03453125 2024-06-19 14:27:54,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=4950.0, ans=0.72675 2024-06-19 14:27:55,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=4950.0, ans=0.04604166666666667 2024-06-19 14:27:55,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=42.95 vs. limit=11.2125 2024-06-19 14:27:56,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=4950.0, ans=0.26796875 2024-06-19 14:27:57,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.91 vs. limit=7.484166666666667 2024-06-19 14:28:00,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=9.363125 2024-06-19 14:28:10,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=4986.666666666667, ans=0.009785507246376812 2024-06-19 14:28:11,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4986.666666666667, ans=0.2501333333333333 2024-06-19 14:28:13,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.068e+03 1.255e+04 1.591e+04 2.053e+04 1.587e+05, threshold=3.182e+04, percent-clipped=5.0 2024-06-19 14:28:21,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=5023.333333333333, ans=0.045736111111111116 2024-06-19 14:28:23,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=56.33 vs. limit=11.2675 2024-06-19 14:28:26,434 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=11.2675 2024-06-19 14:28:32,960 INFO [train.py:1028] (0/2) Epoch 1, batch 2750, loss[loss=0.9979, simple_loss=0.628, pruned_loss=0.6839, over 13304.00 frames. ], tot_loss[loss=0.9888, simple_loss=0.627, pruned_loss=0.676, over 2581553.21 frames. ], batch size: 43, lr: 3.39e-02, grad_scale: 0.0625 2024-06-19 14:28:39,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=5041.666666666667, ans=0.263671875 2024-06-19 14:28:40,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=59.15 vs. limit=9.3975 2024-06-19 14:28:41,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=11.295 2024-06-19 14:28:48,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.49 vs. limit=9.404375 2024-06-19 14:28:59,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=5096.666666666667, ans=0.26109375 2024-06-19 14:29:02,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5096.666666666667, ans=0.26109375 2024-06-19 14:29:02,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=11.3225 2024-06-19 14:29:03,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=5115.0, ans=0.04535416666666667 2024-06-19 14:29:12,218 INFO [train.py:1028] (0/2) Epoch 1, batch 2800, loss[loss=0.8914, simple_loss=0.6084, pruned_loss=0.5872, over 10910.00 frames. ], tot_loss[loss=0.9861, simple_loss=0.626, pruned_loss=0.6737, over 2579006.80 frames. ], batch size: 304, lr: 3.39e-02, grad_scale: 0.125 2024-06-19 14:29:13,408 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.59 vs. limit=11.35 2024-06-19 14:29:20,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=5151.666666666667, ans=0.258515625 2024-06-19 14:29:27,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.52 vs. limit=7.585 2024-06-19 14:29:30,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=5170.0, ans=0.0 2024-06-19 14:29:37,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=11.3775 2024-06-19 14:29:37,677 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.341e+03 1.417e+04 1.784e+04 2.288e+04 9.876e+04, threshold=3.569e+04, percent-clipped=12.0 2024-06-19 14:29:41,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=9.445625 2024-06-19 14:29:53,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=11.41875 2024-06-19 14:29:53,813 INFO [train.py:1028] (0/2) Epoch 1, batch 2850, loss[loss=1.044, simple_loss=0.6486, pruned_loss=0.7195, over 13290.00 frames. ], tot_loss[loss=0.9818, simple_loss=0.6241, pruned_loss=0.6701, over 2576585.95 frames. ], batch size: 49, lr: 3.38e-02, grad_scale: 0.125 2024-06-19 14:29:56,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=5225.0, ans=0.0 2024-06-19 14:30:05,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=9.46625 2024-06-19 14:30:08,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.80 vs. limit=9.46625 2024-06-19 14:30:11,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=5261.666666666667, ans=0.253359375 2024-06-19 14:30:14,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=5261.666666666667, ans=0.025 2024-06-19 14:30:17,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=5280.0, ans=0.2525 2024-06-19 14:30:18,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=5280.0, ans=0.04466666666666667 2024-06-19 14:30:21,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.50 vs. limit=11.46 2024-06-19 14:30:24,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=9.486875 2024-06-19 14:30:26,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5298.333333333333, ans=0.251640625 2024-06-19 14:30:35,780 INFO [train.py:1028] (0/2) Epoch 1, batch 2900, loss[loss=1.018, simple_loss=0.6598, pruned_loss=0.6884, over 13115.00 frames. ], tot_loss[loss=0.9753, simple_loss=0.6209, pruned_loss=0.6652, over 2584704.95 frames. ], batch size: 55, lr: 3.38e-02, grad_scale: 0.125 2024-06-19 14:30:39,555 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=9.49375 2024-06-19 14:30:43,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=9.49375 2024-06-19 14:30:43,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=9.500625 2024-06-19 14:30:49,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.84 vs. limit=7.6675 2024-06-19 14:30:50,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=5335.0, ans=0.7132750000000001 2024-06-19 14:30:51,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=5353.333333333333, ans=0.2803 2024-06-19 14:30:51,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=5353.333333333333, ans=0.06654166666666667 2024-06-19 14:30:52,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=3.803 2024-06-19 14:30:57,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=5353.333333333333, ans=0.025 2024-06-19 14:31:00,632 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.046e+03 2.074e+04 2.789e+04 4.836e+04 2.596e+05, threshold=5.579e+04, percent-clipped=33.0 2024-06-19 14:31:00,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=5371.666666666667, ans=0.8037166666666666 2024-06-19 14:31:00,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5371.666666666667, ans=0.24628333333333333 2024-06-19 14:31:04,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=5371.666666666667, ans=0.044284722222222225 2024-06-19 14:31:10,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.56 vs. limit=7.695 2024-06-19 14:31:11,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.21 vs. limit=11.5425 2024-06-19 14:31:14,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.43 vs. limit=7.695 2024-06-19 14:31:15,767 INFO [train.py:1028] (0/2) Epoch 1, batch 2950, loss[loss=0.9741, simple_loss=0.6114, pruned_loss=0.6684, over 13284.00 frames. ], tot_loss[loss=0.9765, simple_loss=0.6207, pruned_loss=0.6664, over 2578311.03 frames. ], batch size: 43, lr: 3.38e-02, grad_scale: 0.0625 2024-06-19 14:31:19,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=9.528125 2024-06-19 14:31:26,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=5426.666666666667, ans=0.7100666666666667 2024-06-19 14:31:26,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=50.19 vs. limit=11.57 2024-06-19 14:31:28,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=5426.666666666667, ans=0.8042666666666667 2024-06-19 14:31:34,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=5445.0, ans=0.009685869565217392 2024-06-19 14:31:34,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=5445.0, ans=0.07 2024-06-19 14:31:35,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=5445.0, ans=0.025 2024-06-19 14:31:46,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=5463.333333333333, ans=9.54875 2024-06-19 14:31:50,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=5481.666666666667, ans=0.243046875 2024-06-19 14:31:59,946 INFO [train.py:1028] (0/2) Epoch 1, batch 3000, loss[loss=0.9716, simple_loss=0.615, pruned_loss=0.6641, over 13205.00 frames. ], tot_loss[loss=0.9716, simple_loss=0.6175, pruned_loss=0.663, over 2576608.42 frames. ], batch size: 59, lr: 3.37e-02, grad_scale: 0.125 2024-06-19 14:31:59,947 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 14:32:07,989 INFO [train.py:1060] (0/2) Epoch 1, validation: loss=1.03, simple_loss=0.6516, pruned_loss=0.704, over 351949.00 frames. 2024-06-19 14:32:07,989 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16599MB 2024-06-19 14:32:09,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=5500.0, ans=0.035 2024-06-19 14:32:12,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=5500.0, ans=0.009673913043478262 2024-06-19 14:32:19,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5518.333333333333, ans=0.241328125 2024-06-19 14:32:19,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=32.62 vs. limit=9.569375 2024-06-19 14:32:20,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.43 vs. limit=6.379583333333333 2024-06-19 14:32:24,049 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=9.57625 2024-06-19 14:32:32,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.151e+04 1.896e+04 2.552e+04 3.548e+04 1.643e+05, threshold=5.104e+04, percent-clipped=9.0 2024-06-19 14:32:32,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=5555.0, ans=0.7055750000000001 2024-06-19 14:32:35,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=48.44 vs. limit=9.583124999999999 2024-06-19 14:32:38,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=50.54 vs. limit=9.583124999999999 2024-06-19 14:32:40,665 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.484e+00 2024-06-19 14:32:43,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=5573.333333333333, ans=8.483333333333333 2024-06-19 14:32:50,947 INFO [train.py:1028] (0/2) Epoch 1, batch 3050, loss[loss=0.9611, simple_loss=0.6066, pruned_loss=0.6577, over 13306.00 frames. ], tot_loss[loss=0.9656, simple_loss=0.6151, pruned_loss=0.6582, over 2576714.72 frames. ], batch size: 46, lr: 3.37e-02, grad_scale: 0.125 2024-06-19 14:32:51,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=5591.666666666667, ans=0.23789062500000002 2024-06-19 14:32:53,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=41.14 vs. limit=9.596875 2024-06-19 14:32:55,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.59 vs. limit=9.596875 2024-06-19 14:32:57,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.28 vs. limit=6.397916666666667 2024-06-19 14:32:59,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5610.0, ans=0.23703125000000003 2024-06-19 14:33:02,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=9.60375 2024-06-19 14:33:02,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=5610.0, ans=0.025 2024-06-19 14:33:07,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=5628.333333333333, ans=8.517708333333333 2024-06-19 14:33:12,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=9.610625 2024-06-19 14:33:13,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.23 vs. limit=11.721250000000001 2024-06-19 14:33:13,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=5628.333333333333, ans=0.009646014492753624 2024-06-19 14:33:17,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=5646.666666666667, ans=0.23531249999999998 2024-06-19 14:33:20,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=5646.666666666667, ans=0.23531249999999998 2024-06-19 14:33:30,008 INFO [train.py:1028] (0/2) Epoch 1, batch 3100, loss[loss=0.9077, simple_loss=0.5944, pruned_loss=0.6105, over 13026.00 frames. ], tot_loss[loss=0.9601, simple_loss=0.6123, pruned_loss=0.6541, over 2577889.23 frames. ], batch size: 144, lr: 3.36e-02, grad_scale: 0.125 2024-06-19 14:33:31,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=5683.333333333333, ans=0.23359375 2024-06-19 14:33:33,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.78 vs. limit=9.63125 2024-06-19 14:33:40,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=11.776250000000001 2024-06-19 14:33:46,000 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=11.79 2024-06-19 14:33:46,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=9.645 2024-06-19 14:33:46,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=5720.0, ans=0.231875 2024-06-19 14:33:47,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.00 vs. limit=9.645 2024-06-19 14:33:50,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.45 vs. limit=9.645 2024-06-19 14:33:55,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.730e+03 1.916e+04 2.485e+04 3.355e+04 2.110e+05, threshold=4.970e+04, percent-clipped=10.0 2024-06-19 14:33:57,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=55.51 vs. limit=9.651875 2024-06-19 14:34:06,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=11.817499999999999 2024-06-19 14:34:08,646 INFO [train.py:1028] (0/2) Epoch 1, batch 3150, loss[loss=0.8935, simple_loss=0.5929, pruned_loss=0.5971, over 12884.00 frames. ], tot_loss[loss=0.9596, simple_loss=0.6112, pruned_loss=0.6541, over 2579398.21 frames. ], batch size: 158, lr: 3.36e-02, grad_scale: 0.0625 2024-06-19 14:34:11,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=9.665625 2024-06-19 14:34:13,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.98 vs. limit=9.665625 2024-06-19 14:34:16,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=9.6725 2024-06-19 14:34:21,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=9.6725 2024-06-19 14:34:23,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=91.36 vs. limit=11.844999999999999 2024-06-19 14:34:32,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=9.679375 2024-06-19 14:34:34,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.26 vs. limit=6.452916666666667 2024-06-19 14:34:39,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=5830.0, ans=0.22671875000000002 2024-06-19 14:34:39,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5830.0, ans=0.2417 2024-06-19 14:34:41,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.28 vs. limit=11.872499999999999 2024-06-19 14:34:41,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=28.35 vs. limit=9.68625 2024-06-19 14:34:41,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.04 vs. limit=9.68625 2024-06-19 14:34:46,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.93 vs. limit=9.693125 2024-06-19 14:34:49,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=31.20 vs. limit=9.693125 2024-06-19 14:34:50,749 INFO [train.py:1028] (0/2) Epoch 1, batch 3200, loss[loss=0.9844, simple_loss=0.6097, pruned_loss=0.6795, over 13107.00 frames. ], tot_loss[loss=0.9585, simple_loss=0.6105, pruned_loss=0.6533, over 2579071.60 frames. ], batch size: 55, lr: 3.36e-02, grad_scale: 0.0625 2024-06-19 14:34:57,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=5866.666666666667, ans=0.09899494936611666 2024-06-19 14:34:58,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.28 vs. limit=7.9425 2024-06-19 14:35:07,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.59 vs. limit=11.9275 2024-06-19 14:35:12,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.63 vs. limit=9.713750000000001 2024-06-19 14:35:12,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.54 vs. limit=11.9275 2024-06-19 14:35:17,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:20,187 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.236e+03 1.850e+04 2.496e+04 3.320e+04 2.574e+05, threshold=4.993e+04, percent-clipped=12.0 2024-06-19 14:35:22,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=5921.666666666667, ans=0.222421875 2024-06-19 14:35:29,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5940.0, ans=0.24059999999999998 2024-06-19 14:35:32,925 INFO [train.py:1028] (0/2) Epoch 1, batch 3250, loss[loss=1.017, simple_loss=0.6439, pruned_loss=0.6953, over 13215.00 frames. ], tot_loss[loss=0.9562, simple_loss=0.6095, pruned_loss=0.6515, over 2583116.04 frames. ], batch size: 72, lr: 3.35e-02, grad_scale: 0.0625 2024-06-19 14:35:42,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=5976.666666666667, ans=0.0 2024-06-19 14:35:45,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=5976.666666666667, ans=0.21984375 2024-06-19 14:35:49,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=5995.0, ans=0.218984375 2024-06-19 14:36:04,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.79 vs. limit=12.02375 2024-06-19 14:36:09,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6031.666666666667, ans=0.23968333333333333 2024-06-19 14:36:12,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=27.76 vs. limit=9.76875 2024-06-19 14:36:12,770 INFO [train.py:1028] (0/2) Epoch 1, batch 3300, loss[loss=0.9169, simple_loss=0.6007, pruned_loss=0.6165, over 12811.00 frames. ], tot_loss[loss=0.9546, simple_loss=0.6074, pruned_loss=0.651, over 2580933.78 frames. ], batch size: 176, lr: 3.35e-02, grad_scale: 0.125 2024-06-19 14:36:20,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=6068.333333333333, ans=0.21554687500000003 2024-06-19 14:36:22,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6068.333333333333, ans=0.23931666666666668 2024-06-19 14:36:23,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.39 vs. limit=6.517083333333333 2024-06-19 14:36:29,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=6068.333333333333, ans=0.031036458333333336 2024-06-19 14:36:32,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=3.9130000000000003 2024-06-19 14:36:32,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=6086.666666666667, ans=0.21468749999999998 2024-06-19 14:36:33,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.59 vs. limit=9.7825 2024-06-19 14:36:35,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.67 vs. limit=12.065000000000001 2024-06-19 14:36:37,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.85 vs. limit=9.7825 2024-06-19 14:36:38,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6105.0, ans=0.23895 2024-06-19 14:36:41,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=71.27 vs. limit=12.07875 2024-06-19 14:36:42,063 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.598e+03 1.394e+04 2.186e+04 4.132e+04 2.733e+05, threshold=4.373e+04, percent-clipped=17.0 2024-06-19 14:36:44,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=6105.0, ans=0.009542391304347825 2024-06-19 14:36:46,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=6123.333333333333, ans=0.04115277777777778 2024-06-19 14:36:54,214 INFO [train.py:1028] (0/2) Epoch 1, batch 3350, loss[loss=0.8958, simple_loss=0.5932, pruned_loss=0.5991, over 12937.00 frames. ], tot_loss[loss=0.9488, simple_loss=0.6055, pruned_loss=0.6461, over 2576730.96 frames. ], batch size: 158, lr: 3.34e-02, grad_scale: 0.125 2024-06-19 14:36:56,072 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.125e+02 2024-06-19 14:37:00,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=6141.666666666667, ans=9.803125 2024-06-19 14:37:02,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=6160.0, ans=0.6844 2024-06-19 14:37:05,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=12.120000000000001 2024-06-19 14:37:05,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=9.81 2024-06-19 14:37:11,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=6178.333333333333, ans=0.210390625 2024-06-19 14:37:12,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.71 vs. limit=8.089166666666667 2024-06-19 14:37:14,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.32 vs. limit=12.13375 2024-06-19 14:37:16,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=6196.666666666667, ans=0.0 2024-06-19 14:37:18,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=6196.666666666667, ans=0.6831166666666667 2024-06-19 14:37:28,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6215.0, ans=0.208671875 2024-06-19 14:37:36,685 INFO [train.py:1028] (0/2) Epoch 1, batch 3400, loss[loss=1.059, simple_loss=0.6372, pruned_loss=0.7406, over 12583.00 frames. ], tot_loss[loss=0.9452, simple_loss=0.6034, pruned_loss=0.6435, over 2575094.00 frames. ], batch size: 22, lr: 3.34e-02, grad_scale: 0.0625 2024-06-19 14:37:37,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.68 vs. limit=9.8375 2024-06-19 14:37:56,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=12.2025 2024-06-19 14:37:58,579 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.296e-02 2024-06-19 14:38:04,334 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.282e+03 2.316e+04 2.958e+04 4.175e+04 3.129e+05, threshold=5.917e+04, percent-clipped=22.0 2024-06-19 14:38:05,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=9.858125 2024-06-19 14:38:08,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=6.522666666666667 2024-06-19 14:38:10,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.71 vs. limit=12.23 2024-06-19 14:38:15,196 INFO [train.py:1028] (0/2) Epoch 1, batch 3450, loss[loss=0.9378, simple_loss=0.6257, pruned_loss=0.625, over 12719.00 frames. ], tot_loss[loss=0.9432, simple_loss=0.6019, pruned_loss=0.6423, over 2575917.34 frames. ], batch size: 177, lr: 3.34e-02, grad_scale: 0.0625 2024-06-19 14:38:19,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=9.871875 2024-06-19 14:38:25,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6343.333333333333, ans=0.20265624999999998 2024-06-19 14:38:29,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=6361.666666666667, ans=0.201796875 2024-06-19 14:38:34,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=6361.666666666667, ans=0.04015972222222222 2024-06-19 14:38:35,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=6361.666666666667, ans=0.201796875 2024-06-19 14:38:46,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=6380.0, ans=0.04008333333333333 2024-06-19 14:38:49,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6398.333333333333, ans=0.200078125 2024-06-19 14:38:52,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=24.18 vs. limit=9.899375 2024-06-19 14:38:53,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=6398.333333333333, ans=0.200078125 2024-06-19 14:38:54,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.23 vs. limit=9.899375 2024-06-19 14:38:55,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6398.333333333333, ans=0.23601666666666665 2024-06-19 14:38:56,652 INFO [train.py:1028] (0/2) Epoch 1, batch 3500, loss[loss=0.9634, simple_loss=0.5971, pruned_loss=0.6648, over 12820.00 frames. ], tot_loss[loss=0.9414, simple_loss=0.6005, pruned_loss=0.6412, over 2574456.14 frames. ], batch size: 33, lr: 3.33e-02, grad_scale: 0.125 2024-06-19 14:38:58,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=53.14 vs. limit=9.90625 2024-06-19 14:39:00,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.25 vs. limit=6.566666666666666 2024-06-19 14:39:05,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=6435.0, ans=0.009470652173913043 2024-06-19 14:39:11,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6453.333333333333, ans=0.1975 2024-06-19 14:39:11,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=6453.333333333333, ans=0.1975 2024-06-19 14:39:19,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=6471.666666666667, ans=0.196640625 2024-06-19 14:39:23,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=22.62 vs. limit=9.926874999999999 2024-06-19 14:39:25,183 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.867e+03 1.621e+04 1.966e+04 2.572e+04 1.665e+05, threshold=3.932e+04, percent-clipped=4.0 2024-06-19 14:39:26,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=6471.666666666667, ans=0.196640625 2024-06-19 14:39:31,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=3.9735 2024-06-19 14:39:35,234 INFO [train.py:1028] (0/2) Epoch 1, batch 3550, loss[loss=0.8576, simple_loss=0.5571, pruned_loss=0.579, over 13123.00 frames. ], tot_loss[loss=0.9402, simple_loss=0.6001, pruned_loss=0.6402, over 2576059.85 frames. ], batch size: 95, lr: 3.33e-02, grad_scale: 0.0625 2024-06-19 14:39:39,379 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.07 vs. limit=9.940625 2024-06-19 14:39:50,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=6526.666666666667, ans=0.00945072463768116 2024-06-19 14:39:57,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=6545.0, ans=0.029546875000000004 2024-06-19 14:40:00,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=6563.333333333333, ans=0.19234374999999998 2024-06-19 14:40:16,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=3.99 2024-06-19 14:40:17,098 INFO [train.py:1028] (0/2) Epoch 1, batch 3600, loss[loss=1.002, simple_loss=0.6364, pruned_loss=0.684, over 13348.00 frames. ], tot_loss[loss=0.9369, simple_loss=0.5993, pruned_loss=0.6372, over 2579983.63 frames. ], batch size: 49, lr: 3.32e-02, grad_scale: 0.125 2024-06-19 14:40:18,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6600.0, ans=0.190625 2024-06-19 14:40:19,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=12.45 2024-06-19 14:40:21,220 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=9.504e-01 2024-06-19 14:40:21,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.24 vs. limit=6.65 2024-06-19 14:40:21,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=6600.0, ans=0.190625 2024-06-19 14:40:22,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=6600.0, ans=0.190625 2024-06-19 14:40:22,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=6600.0, ans=0.29900000000000004 2024-06-19 14:40:25,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=6618.333333333333, ans=0.0 2024-06-19 14:40:31,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=6618.333333333333, ans=0.03909027777777778 2024-06-19 14:40:43,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=3.73 vs. limit=8.3275 2024-06-19 14:40:45,523 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=9.995625 2024-06-19 14:40:46,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.863e+03 1.410e+04 1.710e+04 2.163e+04 1.056e+05, threshold=3.419e+04, percent-clipped=4.0 2024-06-19 14:40:46,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=15.09 vs. limit=6.66375 2024-06-19 14:40:51,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=6673.333333333333, ans=0.1871875 2024-06-19 14:40:54,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=6673.333333333333, ans=0.009418840579710146 2024-06-19 14:40:55,536 INFO [train.py:1028] (0/2) Epoch 1, batch 3650, loss[loss=0.8531, simple_loss=0.5449, pruned_loss=0.5807, over 13060.00 frames. ], tot_loss[loss=0.9383, simple_loss=0.5993, pruned_loss=0.6387, over 2579588.05 frames. ], batch size: 102, lr: 3.32e-02, grad_scale: 0.0625 2024-06-19 14:41:00,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=73.86 vs. limit=10.009375 2024-06-19 14:41:01,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=4.00375 2024-06-19 14:41:04,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=6710.0, ans=0.2 2024-06-19 14:41:04,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=6710.0, ans=0.09899494936611666 2024-06-19 14:41:11,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=56.37 vs. limit=10.01625 2024-06-19 14:41:11,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=10.01625 2024-06-19 14:41:15,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=6728.333333333333, ans=12.54625 2024-06-19 14:41:15,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=6728.333333333333, ans=12.54625 2024-06-19 14:41:20,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6728.333333333333, ans=0.23271666666666668 2024-06-19 14:41:29,882 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=2.537e-03 2024-06-19 14:41:33,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.86 vs. limit=4.01475 2024-06-19 14:41:35,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.85 vs. limit=10.036875 2024-06-19 14:41:37,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=10.036875 2024-06-19 14:41:38,002 INFO [train.py:1028] (0/2) Epoch 1, batch 3700, loss[loss=0.9281, simple_loss=0.5939, pruned_loss=0.6312, over 13203.00 frames. ], tot_loss[loss=0.9328, simple_loss=0.5959, pruned_loss=0.6349, over 2584313.54 frames. ], batch size: 72, lr: 3.31e-02, grad_scale: 0.125 2024-06-19 14:41:38,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6783.333333333333, ans=0.18203124999999998 2024-06-19 14:41:40,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=10.04375 2024-06-19 14:41:46,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=6801.666666666667, ans=0.03832638888888889 2024-06-19 14:41:51,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=34.17 vs. limit=12.60125 2024-06-19 14:41:56,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=6820.0, ans=0.009386956521739131 2024-06-19 14:41:57,960 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.19 vs. limit=10.057500000000001 2024-06-19 14:41:58,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=6820.0, ans=0.1803125 2024-06-19 14:41:58,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6820.0, ans=0.1803125 2024-06-19 14:41:59,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6820.0, ans=0.1803125 2024-06-19 14:42:03,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=12.62875 2024-06-19 14:42:09,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=6838.333333333333, ans=0.038173611111111116 2024-06-19 14:42:11,653 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.024e+03 1.385e+04 1.736e+04 2.252e+04 1.767e+05, threshold=3.473e+04, percent-clipped=14.0 2024-06-19 14:42:12,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=6856.666666666667, ans=0.03809722222222223 2024-06-19 14:42:15,842 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.986e-02 2024-06-19 14:42:16,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.46 vs. limit=8.428333333333335 2024-06-19 14:42:20,012 INFO [train.py:1028] (0/2) Epoch 1, batch 3750, loss[loss=0.9969, simple_loss=0.6054, pruned_loss=0.6942, over 12642.00 frames. ], tot_loss[loss=0.93, simple_loss=0.5949, pruned_loss=0.6326, over 2586821.45 frames. ], batch size: 22, lr: 3.31e-02, grad_scale: 0.0625 2024-06-19 14:42:25,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=6875.0, ans=0.009375 2024-06-19 14:42:26,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=5.375 2024-06-19 14:42:36,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=6911.666666666667, ans=0.03786805555555556 2024-06-19 14:42:37,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=63.10 vs. limit=12.68375 2024-06-19 14:42:40,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6911.666666666667, ans=0.23088333333333333 2024-06-19 14:42:45,792 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=34.46 vs. limit=6.7325 2024-06-19 14:42:56,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=6948.333333333333, ans=0.17429687500000002 2024-06-19 14:42:58,012 INFO [train.py:1028] (0/2) Epoch 1, batch 3800, loss[loss=0.8606, simple_loss=0.5548, pruned_loss=0.5832, over 13245.00 frames. ], tot_loss[loss=0.9321, simple_loss=0.5956, pruned_loss=0.6343, over 2583853.76 frames. ], batch size: 83, lr: 3.31e-02, grad_scale: 0.125 2024-06-19 14:43:00,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=4.045 2024-06-19 14:43:07,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=6985.0, ans=0.17257812500000003 2024-06-19 14:43:11,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.34 vs. limit=10.119375 2024-06-19 14:43:17,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=7003.333333333333, ans=0.009347101449275363 2024-06-19 14:43:21,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7021.666666666667, ans=0.22978333333333334 2024-06-19 14:43:31,790 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.430e+03 1.220e+04 1.669e+04 2.499e+04 1.319e+05, threshold=3.339e+04, percent-clipped=11.0 2024-06-19 14:43:32,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=7040.0, ans=0.009339130434782609 2024-06-19 14:43:33,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=10.14 2024-06-19 14:43:39,574 INFO [train.py:1028] (0/2) Epoch 1, batch 3850, loss[loss=0.8188, simple_loss=0.5473, pruned_loss=0.5452, over 13064.00 frames. ], tot_loss[loss=0.9302, simple_loss=0.5946, pruned_loss=0.6329, over 2583501.15 frames. ], batch size: 144, lr: 3.30e-02, grad_scale: 0.0625 2024-06-19 14:43:39,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7058.333333333333, ans=0.22941666666666666 2024-06-19 14:43:41,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7058.333333333333, ans=0.169140625 2024-06-19 14:43:48,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.25 vs. limit=12.807500000000001 2024-06-19 14:43:49,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=8.63 vs. limit=4.0615000000000006 2024-06-19 14:43:53,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=7076.666666666667, ans=0.03718055555555556 2024-06-19 14:44:01,583 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=1.054e-02 2024-06-19 14:44:03,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=23.39 vs. limit=8.556666666666667 2024-06-19 14:44:06,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.98 vs. limit=4.067 2024-06-19 14:44:07,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=7113.333333333333, ans=0.1665625 2024-06-19 14:44:08,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=26.73 vs. limit=10.1675 2024-06-19 14:44:10,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=7131.666666666667, ans=0.165703125 2024-06-19 14:44:12,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=7131.666666666667, ans=0.165703125 2024-06-19 14:44:14,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=7131.666666666667, ans=0.009319202898550726 2024-06-19 14:44:16,959 INFO [train.py:1028] (0/2) Epoch 1, batch 3900, loss[loss=0.9149, simple_loss=0.587, pruned_loss=0.6214, over 13204.00 frames. ], tot_loss[loss=0.9308, simple_loss=0.5945, pruned_loss=0.6335, over 2586568.32 frames. ], batch size: 83, lr: 3.30e-02, grad_scale: 0.125 2024-06-19 14:44:21,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=7150.0, ans=0.036875000000000005 2024-06-19 14:44:24,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=7168.333333333333, ans=0.6491083333333334 2024-06-19 14:44:36,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.38 vs. limit=12.89 2024-06-19 14:44:51,387 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.119e+03 1.522e+04 1.834e+04 2.422e+04 1.135e+05, threshold=3.668e+04, percent-clipped=10.0 2024-06-19 14:44:58,967 INFO [train.py:1028] (0/2) Epoch 1, batch 3950, loss[loss=0.8599, simple_loss=0.5622, pruned_loss=0.5788, over 13112.00 frames. ], tot_loss[loss=0.9296, simple_loss=0.5937, pruned_loss=0.6327, over 2587883.35 frames. ], batch size: 132, lr: 3.29e-02, grad_scale: 0.125 2024-06-19 14:45:19,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=7278.333333333333, ans=0.158828125 2024-06-19 14:45:19,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.25 vs. limit=10.229375000000001 2024-06-19 14:45:21,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.64 vs. limit=12.9725 2024-06-19 14:45:22,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=7296.666666666667, ans=0.17703333333333332 2024-06-19 14:45:29,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.17 vs. limit=12.98625 2024-06-19 14:45:36,951 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-4000.pt 2024-06-19 14:45:43,061 INFO [train.py:1028] (0/2) Epoch 1, batch 4000, loss[loss=0.9064, simple_loss=0.5647, pruned_loss=0.624, over 12900.00 frames. ], tot_loss[loss=0.9271, simple_loss=0.5928, pruned_loss=0.6307, over 2581755.57 frames. ], batch size: 39, lr: 3.29e-02, grad_scale: 0.25 2024-06-19 14:45:45,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.53 vs. limit=13.0 2024-06-19 14:45:47,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=10.25 2024-06-19 14:45:50,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=58.53 vs. limit=10.256875 2024-06-19 14:45:51,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7351.666666666667, ans=0.22648333333333331 2024-06-19 14:45:54,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=13.80 vs. limit=6.837916666666667 2024-06-19 14:45:56,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=13.01375 2024-06-19 14:46:01,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=7370.0, ans=0.053937500000000006 2024-06-19 14:46:03,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=7370.0, ans=0.15453125 2024-06-19 14:46:03,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=7370.0, ans=0.035958333333333335 2024-06-19 14:46:06,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=7370.0, ans=0.15453125 2024-06-19 14:46:10,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=7388.333333333333, ans=0.153671875 2024-06-19 14:46:14,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=7388.333333333333, ans=10.270624999999999 2024-06-19 14:46:15,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7388.333333333333, ans=0.153671875 2024-06-19 14:46:16,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.75 vs. limit=6.847083333333333 2024-06-19 14:46:18,764 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.068e+04 1.900e+04 2.371e+04 3.165e+04 1.518e+05, threshold=4.741e+04, percent-clipped=16.0 2024-06-19 14:46:20,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.47 vs. limit=13.055 2024-06-19 14:46:24,589 INFO [train.py:1028] (0/2) Epoch 1, batch 4050, loss[loss=0.8691, simple_loss=0.5944, pruned_loss=0.5719, over 10859.00 frames. ], tot_loss[loss=0.9253, simple_loss=0.5918, pruned_loss=0.6294, over 2579819.87 frames. ], batch size: 304, lr: 3.28e-02, grad_scale: 0.0625 2024-06-19 14:46:28,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=7425.0, ans=0.15195312500000002 2024-06-19 14:46:28,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=10.284375 2024-06-19 14:46:30,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=7425.0, ans=0.0 2024-06-19 14:46:38,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.86 vs. limit=13.0825 2024-06-19 14:46:57,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=10.305 2024-06-19 14:46:57,897 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=10.311875 2024-06-19 14:47:01,436 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.572e+02 2024-06-19 14:47:05,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=13.1375 2024-06-19 14:47:06,049 INFO [train.py:1028] (0/2) Epoch 1, batch 4100, loss[loss=0.907, simple_loss=0.5826, pruned_loss=0.6157, over 13060.00 frames. ], tot_loss[loss=0.9246, simple_loss=0.5916, pruned_loss=0.6288, over 2577309.97 frames. ], batch size: 102, lr: 3.28e-02, grad_scale: 0.125 2024-06-19 14:47:06,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.66 vs. limit=13.1375 2024-06-19 14:47:06,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=7516.666666666667, ans=0.14765625 2024-06-19 14:47:11,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.48 vs. limit=10.31875 2024-06-19 14:47:11,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=7516.666666666667, ans=0.035347222222222224 2024-06-19 14:47:14,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=7535.0, ans=0.2 2024-06-19 14:47:17,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.73 vs. limit=13.151250000000001 2024-06-19 14:47:20,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=7535.0, ans=0.009231521739130435 2024-06-19 14:47:20,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.15 vs. limit=6.88375 2024-06-19 14:47:23,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=7553.333333333333, ans=0.1459375 2024-06-19 14:47:23,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=7553.333333333333, ans=0.8255333333333333 2024-06-19 14:47:26,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7553.333333333333, ans=0.22446666666666665 2024-06-19 14:47:30,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=7553.333333333333, ans=0.22446666666666665 2024-06-19 14:47:35,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=59.69 vs. limit=10.339375 2024-06-19 14:47:43,042 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.462e+03 1.364e+04 2.076e+04 3.257e+04 1.638e+05, threshold=4.152e+04, percent-clipped=12.0 2024-06-19 14:47:44,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=10.34625 2024-06-19 14:47:44,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=30.43 vs. limit=10.34625 2024-06-19 14:47:45,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.35 vs. limit=13.192499999999999 2024-06-19 14:47:48,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=7590.0, ans=0.2241 2024-06-19 14:47:50,569 INFO [train.py:1028] (0/2) Epoch 1, batch 4150, loss[loss=0.8959, simple_loss=0.5679, pruned_loss=0.6119, over 13073.00 frames. ], tot_loss[loss=0.9245, simple_loss=0.5908, pruned_loss=0.6291, over 2577166.19 frames. ], batch size: 55, lr: 3.27e-02, grad_scale: 0.125 2024-06-19 14:47:52,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=10.353125 2024-06-19 14:47:54,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=10.353125 2024-06-19 14:47:56,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=13.20625 2024-06-19 14:47:58,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=35.48 vs. limit=10.353125 2024-06-19 14:48:05,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.08 vs. limit=8.813333333333333 2024-06-19 14:48:10,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=10.366875 2024-06-19 14:48:15,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=13.23375 2024-06-19 14:48:20,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=7.065333333333333 2024-06-19 14:48:24,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7663.333333333333, ans=0.0 2024-06-19 14:48:25,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=7681.666666666667, ans=0.13992187499999997 2024-06-19 14:48:29,756 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.69 vs. limit=10.380625 2024-06-19 14:48:35,746 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.090e-02 2024-06-19 14:48:36,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.77 vs. limit=13.26125 2024-06-19 14:48:38,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=28.93 vs. limit=7.072666666666667 2024-06-19 14:48:40,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.38 vs. limit=13.275 2024-06-19 14:48:40,976 INFO [train.py:1028] (0/2) Epoch 1, batch 4200, loss[loss=0.8818, simple_loss=0.573, pruned_loss=0.5954, over 13178.00 frames. ], tot_loss[loss=0.9221, simple_loss=0.5896, pruned_loss=0.6273, over 2579276.28 frames. ], batch size: 103, lr: 3.27e-02, grad_scale: 0.0625 2024-06-19 14:48:41,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7700.0, ans=0.13906249999999998 2024-06-19 14:48:51,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=13.28875 2024-06-19 14:48:52,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=48.03 vs. limit=10.394375 2024-06-19 14:48:53,717 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=10.394375 2024-06-19 14:48:55,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=7718.333333333333, ans=0.009191666666666667 2024-06-19 14:49:11,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=13.31625 2024-06-19 14:49:13,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=7773.333333333333, ans=10.415 2024-06-19 14:49:13,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7773.333333333333, ans=0.135625 2024-06-19 14:49:15,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.71 vs. limit=13.33 2024-06-19 14:49:16,008 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.831e+03 1.227e+04 1.611e+04 2.697e+04 3.867e+05, threshold=3.223e+04, percent-clipped=11.0 2024-06-19 14:49:16,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=7773.333333333333, ans=0.135625 2024-06-19 14:49:18,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.61 vs. limit=10.415 2024-06-19 14:49:19,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=7773.333333333333, ans=0.135625 2024-06-19 14:49:20,805 INFO [train.py:1028] (0/2) Epoch 1, batch 4250, loss[loss=0.8782, simple_loss=0.5596, pruned_loss=0.5984, over 13326.00 frames. ], tot_loss[loss=0.9184, simple_loss=0.5879, pruned_loss=0.6245, over 2580781.78 frames. ], batch size: 46, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:49:22,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=7791.666666666667, ans=0.134765625 2024-06-19 14:49:22,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.00 vs. limit=13.34375 2024-06-19 14:49:24,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.47 vs. limit=13.34375 2024-06-19 14:49:30,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.67 vs. limit=13.34375 2024-06-19 14:49:35,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=7810.0, ans=0.009171739130434783 2024-06-19 14:49:40,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=7828.333333333333, ans=0.133046875 2024-06-19 14:49:43,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=7828.333333333333, ans=0.03404861111111111 2024-06-19 14:49:44,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7828.333333333333, ans=0.133046875 2024-06-19 14:49:45,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=46.07 vs. limit=10.435625 2024-06-19 14:49:46,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=10.435625 2024-06-19 14:49:52,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=7846.666666666667, ans=0.03397222222222222 2024-06-19 14:49:56,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=22.62 vs. limit=10.449375 2024-06-19 14:49:56,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.78 vs. limit=13.39875 2024-06-19 14:50:01,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7865.0, ans=0.13132812500000002 2024-06-19 14:50:01,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.40 vs. limit=13.39875 2024-06-19 14:50:01,762 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.08 vs. limit=10.449375 2024-06-19 14:50:02,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=4.1825 2024-06-19 14:50:02,751 INFO [train.py:1028] (0/2) Epoch 1, batch 4300, loss[loss=0.8982, simple_loss=0.5691, pruned_loss=0.6137, over 13173.00 frames. ], tot_loss[loss=0.9146, simple_loss=0.5857, pruned_loss=0.6217, over 2579973.99 frames. ], batch size: 59, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:50:02,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7883.333333333333, ans=0.13046875000000002 2024-06-19 14:50:02,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7883.333333333333, ans=0.0 2024-06-19 14:50:14,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.41 vs. limit=10.463125 2024-06-19 14:50:15,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=46.55 vs. limit=10.463125 2024-06-19 14:50:17,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.35 vs. limit=6.98 2024-06-19 14:50:18,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7920.0, ans=0.12874999999999998 2024-06-19 14:50:24,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=7920.0, ans=0.009147826086956521 2024-06-19 14:50:24,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=48.16 vs. limit=10.47 2024-06-19 14:50:25,061 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=46.16 vs. limit=10.47 2024-06-19 14:50:27,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=7938.333333333333, ans=0.0 2024-06-19 14:50:31,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.87 vs. limit=13.45375 2024-06-19 14:50:32,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=7938.333333333333, ans=0.6221583333333334 2024-06-19 14:50:33,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7956.666666666667, ans=0.22043333333333331 2024-06-19 14:50:38,628 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.302e+03 1.350e+04 1.658e+04 2.610e+04 2.186e+05, threshold=3.317e+04, percent-clipped=21.0 2024-06-19 14:50:42,935 INFO [train.py:1028] (0/2) Epoch 1, batch 4350, loss[loss=0.9364, simple_loss=0.6035, pruned_loss=0.6347, over 13145.00 frames. ], tot_loss[loss=0.9112, simple_loss=0.5847, pruned_loss=0.6189, over 2585200.14 frames. ], batch size: 59, lr: 3.26e-02, grad_scale: 0.0625 2024-06-19 14:50:45,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=4.19625 2024-06-19 14:50:49,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=107.00 vs. limit=10.490625 2024-06-19 14:51:06,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=8011.666666666667, ans=0.009127898550724638 2024-06-19 14:51:20,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.24 vs. limit=10.51125 2024-06-19 14:51:29,739 INFO [train.py:1028] (0/2) Epoch 1, batch 4400, loss[loss=0.9528, simple_loss=0.6075, pruned_loss=0.6491, over 13210.00 frames. ], tot_loss[loss=0.9109, simple_loss=0.5845, pruned_loss=0.6187, over 2584914.67 frames. ], batch size: 83, lr: 3.25e-02, grad_scale: 0.125 2024-06-19 14:51:30,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.79 vs. limit=13.55 2024-06-19 14:51:32,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.14 vs. limit=7.226666666666667 2024-06-19 14:51:36,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=35.77 vs. limit=9.033333333333333 2024-06-19 14:51:49,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=8103.333333333333, ans=0.6163833333333334 2024-06-19 14:52:03,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=37.63 vs. limit=9.060833333333335 2024-06-19 14:52:04,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=67.91 vs. limit=10.545625 2024-06-19 14:52:04,674 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=16.19 vs. limit=10.545625 2024-06-19 14:52:13,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.46 vs. limit=7.035 2024-06-19 14:52:16,351 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.629e+03 1.189e+04 1.635e+04 2.249e+04 9.656e+04, threshold=3.270e+04, percent-clipped=5.0 2024-06-19 14:52:16,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.40 vs. limit=13.605 2024-06-19 14:52:21,373 INFO [train.py:1028] (0/2) Epoch 1, batch 4450, loss[loss=0.9981, simple_loss=0.619, pruned_loss=0.6886, over 12893.00 frames. ], tot_loss[loss=0.9107, simple_loss=0.5842, pruned_loss=0.6186, over 2580368.52 frames. ], batch size: 33, lr: 3.25e-02, grad_scale: 0.125 2024-06-19 14:52:26,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=41.59 vs. limit=13.61875 2024-06-19 14:52:26,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.56 vs. limit=10.559375 2024-06-19 14:52:34,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=10.56625 2024-06-19 14:52:41,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=10.573125000000001 2024-06-19 14:52:44,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=27.64 vs. limit=10.573125000000001 2024-06-19 14:52:46,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.84 vs. limit=10.58 2024-06-19 14:52:53,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=13.66 2024-06-19 14:52:55,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8231.666666666666, ans=0.21768333333333334 2024-06-19 14:52:56,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.87 vs. limit=10.586875 2024-06-19 14:53:04,420 INFO [train.py:1028] (0/2) Epoch 1, batch 4500, loss[loss=0.8786, simple_loss=0.5551, pruned_loss=0.601, over 13206.00 frames. ], tot_loss[loss=0.9096, simple_loss=0.5832, pruned_loss=0.618, over 2584657.42 frames. ], batch size: 89, lr: 3.24e-02, grad_scale: 0.125 2024-06-19 14:53:05,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=8250.0, ans=0.03229166666666667 2024-06-19 14:53:05,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=10.59375 2024-06-19 14:53:11,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.44 vs. limit=13.6875 2024-06-19 14:53:15,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=8268.333333333334, ans=0.6106083333333334 2024-06-19 14:53:21,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=10.6075 2024-06-19 14:53:26,738 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=7.327e-01 2024-06-19 14:53:27,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=13.715 2024-06-19 14:53:41,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=8323.333333333334, ans=0.04949747468305833 2024-06-19 14:53:46,550 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.264e+03 1.316e+04 1.672e+04 2.176e+04 6.204e+04, threshold=3.345e+04, percent-clipped=10.0 2024-06-19 14:53:48,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=8323.333333333334, ans=0.009060144927536233 2024-06-19 14:53:49,837 INFO [train.py:1028] (0/2) Epoch 1, batch 4550, loss[loss=0.9274, simple_loss=0.5846, pruned_loss=0.6351, over 13236.00 frames. ], tot_loss[loss=0.9102, simple_loss=0.5833, pruned_loss=0.6186, over 2588241.78 frames. ], batch size: 52, lr: 3.24e-02, grad_scale: 0.125 2024-06-19 14:53:52,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=8341.666666666666, ans=0.21658333333333335 2024-06-19 14:54:01,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=64.40 vs. limit=7.343999999999999 2024-06-19 14:54:02,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=8360.0, ans=0.009052173913043478 2024-06-19 14:54:03,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=8360.0, ans=0.035 2024-06-19 14:54:10,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=10.641875 2024-06-19 14:54:11,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=8378.333333333334, ans=0.125 2024-06-19 14:54:15,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=8396.666666666666, ans=0.05 2024-06-19 14:54:23,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=13.811250000000001 2024-06-19 14:54:30,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8415.0, ans=0.125 2024-06-19 14:54:33,805 INFO [train.py:1028] (0/2) Epoch 1, batch 4600, loss[loss=0.9135, simple_loss=0.5964, pruned_loss=0.6152, over 12577.00 frames. ], tot_loss[loss=0.9109, simple_loss=0.5832, pruned_loss=0.6193, over 2584615.14 frames. ], batch size: 202, lr: 3.23e-02, grad_scale: 0.25 2024-06-19 14:54:41,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8451.666666666666, ans=0.0 2024-06-19 14:54:47,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=8451.666666666666, ans=0.326775 2024-06-19 14:54:52,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=8470.0, ans=0.60355 2024-06-19 14:54:57,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=8488.333333333334, ans=0.125 2024-06-19 14:54:58,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=8488.333333333334, ans=0.6029083333333334 2024-06-19 14:54:59,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=10.683125 2024-06-19 14:54:59,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=8488.333333333334, ans=0.03129861111111111 2024-06-19 14:55:01,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.84 vs. limit=4.27325 2024-06-19 14:55:10,335 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.269e+03 1.447e+04 1.887e+04 2.301e+04 1.060e+05, threshold=3.774e+04, percent-clipped=7.0 2024-06-19 14:55:13,238 INFO [train.py:1028] (0/2) Epoch 1, batch 4650, loss[loss=0.7962, simple_loss=0.5222, pruned_loss=0.5351, over 13125.00 frames. ], tot_loss[loss=0.9092, simple_loss=0.5823, pruned_loss=0.618, over 2588198.00 frames. ], batch size: 132, lr: 3.23e-02, grad_scale: 0.25 2024-06-19 14:55:20,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8543.333333333334, ans=0.21456666666666666 2024-06-19 14:55:25,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=8543.333333333334, ans=0.32815 2024-06-19 14:55:26,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=8543.333333333334, ans=0.6009833333333334 2024-06-19 14:55:28,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=10.70375 2024-06-19 14:55:32,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=10.710625 2024-06-19 14:55:56,880 INFO [train.py:1028] (0/2) Epoch 1, batch 4700, loss[loss=0.977, simple_loss=0.6092, pruned_loss=0.6724, over 12318.00 frames. ], tot_loss[loss=0.9091, simple_loss=0.5825, pruned_loss=0.6178, over 2584255.33 frames. ], batch size: 25, lr: 3.22e-02, grad_scale: 0.25 2024-06-19 14:56:05,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=47.39 vs. limit=10.73125 2024-06-19 14:56:15,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=10.738125 2024-06-19 14:56:19,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=10.745000000000001 2024-06-19 14:56:25,799 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.171e+00 2024-06-19 14:56:25,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=8671.666666666666, ans=0.025 2024-06-19 14:56:29,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.99 vs. limit=7.167916666666667 2024-06-19 14:56:35,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=41.55 vs. limit=10.75875 2024-06-19 14:56:41,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=8690.0, ans=0.025 2024-06-19 14:56:41,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=8690.0, ans=0.030458333333333337 2024-06-19 14:56:42,692 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.233e+03 1.564e+04 2.027e+04 2.703e+04 1.479e+05, threshold=4.054e+04, percent-clipped=10.0 2024-06-19 14:56:42,727 INFO [train.py:1028] (0/2) Epoch 1, batch 4750, loss[loss=0.9021, simple_loss=0.6071, pruned_loss=0.5986, over 12577.00 frames. ], tot_loss[loss=0.9044, simple_loss=0.5808, pruned_loss=0.6141, over 2580800.03 frames. ], batch size: 202, lr: 3.22e-02, grad_scale: 0.03125 2024-06-19 14:56:47,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8708.333333333334, ans=0.21291666666666664 2024-06-19 14:56:47,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.52 vs. limit=7.177083333333334 2024-06-19 14:56:50,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=10.765625 2024-06-19 14:57:00,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=8726.666666666666, ans=0.04949747468305833 2024-06-19 14:57:00,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=8726.666666666666, ans=0.125 2024-06-19 14:57:03,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=48.57 vs. limit=10.772499999999999 2024-06-19 14:57:03,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=18.94 vs. limit=7.490666666666666 2024-06-19 14:57:07,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=8745.0, ans=0.025 2024-06-19 14:57:08,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=8745.0, ans=0.008968478260869566 2024-06-19 14:57:14,762 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=10.779375 2024-06-19 14:57:23,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=8763.333333333334, ans=0.5932833333333334 2024-06-19 14:57:33,289 INFO [train.py:1028] (0/2) Epoch 1, batch 4800, loss[loss=0.9017, simple_loss=0.5645, pruned_loss=0.6195, over 13261.00 frames. ], tot_loss[loss=0.9025, simple_loss=0.5796, pruned_loss=0.6127, over 2576690.49 frames. ], batch size: 63, lr: 3.21e-02, grad_scale: 0.0625 2024-06-19 14:57:34,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.40 vs. limit=14.1 2024-06-19 14:57:38,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.34 vs. limit=7.2 2024-06-19 14:57:40,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=8818.333333333334, ans=0.5913583333333334 2024-06-19 14:57:41,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=8818.333333333334, ans=0.8381833333333333 2024-06-19 14:57:50,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=8836.666666666666, ans=0.029847222222222226 2024-06-19 14:58:00,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=8855.0, ans=0.029770833333333337 2024-06-19 14:58:06,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=8873.333333333334, ans=10.8275 2024-06-19 14:58:14,994 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.507e+03 1.145e+04 1.437e+04 1.934e+04 1.429e+05, threshold=2.875e+04, percent-clipped=5.0 2024-06-19 14:58:15,029 INFO [train.py:1028] (0/2) Epoch 1, batch 4850, loss[loss=0.896, simple_loss=0.5716, pruned_loss=0.6103, over 13246.00 frames. ], tot_loss[loss=0.9051, simple_loss=0.5809, pruned_loss=0.6147, over 2573928.61 frames. ], batch size: 89, lr: 3.21e-02, grad_scale: 0.0625 2024-06-19 14:58:17,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=8891.666666666666, ans=0.5887916666666667 2024-06-19 14:58:17,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=48.04 vs. limit=10.834375 2024-06-19 14:58:26,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=26.17 vs. limit=14.182500000000001 2024-06-19 14:58:27,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=8910.0, ans=0.008932608695652175 2024-06-19 14:58:32,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.76 vs. limit=10.848125 2024-06-19 14:58:33,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=40.92 vs. limit=14.19625 2024-06-19 14:58:45,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=44.24 vs. limit=10.848125 2024-06-19 14:58:53,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.82 vs. limit=9.473333333333333 2024-06-19 14:58:54,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.67 vs. limit=10.855 2024-06-19 14:59:00,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=8965.0, ans=10.0 2024-06-19 14:59:00,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=8965.0, ans=0.025 2024-06-19 14:59:04,801 INFO [train.py:1028] (0/2) Epoch 1, batch 4900, loss[loss=0.9672, simple_loss=0.6044, pruned_loss=0.665, over 13142.00 frames. ], tot_loss[loss=0.9074, simple_loss=0.5813, pruned_loss=0.6168, over 2574350.70 frames. ], batch size: 59, lr: 3.20e-02, grad_scale: 0.125 2024-06-19 14:59:19,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=9001.666666666666, ans=0.125 2024-06-19 14:59:25,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=10.8825 2024-06-19 14:59:35,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9020.0, ans=0.125 2024-06-19 14:59:38,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=42.57 vs. limit=10.889375 2024-06-19 14:59:39,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=9038.333333333334, ans=0.125 2024-06-19 14:59:44,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=5.807666666666667 2024-06-19 14:59:45,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=9056.666666666666, ans=0.33585 2024-06-19 14:59:46,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.78 vs. limit=7.264166666666666 2024-06-19 14:59:50,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=10.89625 2024-06-19 14:59:51,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=9056.666666666666, ans=0.05 2024-06-19 14:59:53,193 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.526e+03 8.688e+03 1.131e+04 1.534e+04 7.408e+04, threshold=2.262e+04, percent-clipped=5.0 2024-06-19 14:59:53,229 INFO [train.py:1028] (0/2) Epoch 1, batch 4950, loss[loss=0.8295, simple_loss=0.5611, pruned_loss=0.5489, over 10988.00 frames. ], tot_loss[loss=0.9029, simple_loss=0.5797, pruned_loss=0.613, over 2567618.90 frames. ], batch size: 304, lr: 3.20e-02, grad_scale: 0.125 2024-06-19 14:59:59,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=9075.0, ans=0.125 2024-06-19 15:00:01,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.12 vs. limit=10.91 2024-06-19 15:00:06,585 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=10.91 2024-06-19 15:00:14,576 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 15:00:34,335 INFO [train.py:1028] (0/2) Epoch 1, batch 5000, loss[loss=0.8307, simple_loss=0.5416, pruned_loss=0.5599, over 13140.00 frames. ], tot_loss[loss=0.9026, simple_loss=0.5793, pruned_loss=0.6129, over 2572493.37 frames. ], batch size: 95, lr: 3.19e-02, grad_scale: 0.125 2024-06-19 15:00:39,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=7.666666666666666 2024-06-19 15:00:42,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=22.01 vs. limit=10.944375 2024-06-19 15:00:46,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=9185.0, ans=0.008872826086956522 2024-06-19 15:01:04,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9221.666666666666, ans=0.125 2024-06-19 15:01:19,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9240.0, ans=0.2076 2024-06-19 15:01:19,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.66 vs. limit=7.3100000000000005 2024-06-19 15:01:24,617 INFO [train.py:1028] (0/2) Epoch 1, batch 5050, loss[loss=0.9016, simple_loss=0.5729, pruned_loss=0.6152, over 13042.00 frames. ], tot_loss[loss=0.9063, simple_loss=0.5808, pruned_loss=0.6159, over 2571238.30 frames. ], batch size: 36, lr: 3.19e-02, grad_scale: 0.0625 2024-06-19 15:01:25,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=10.971875 2024-06-19 15:01:26,236 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.044e+03 1.222e+04 1.607e+04 2.151e+04 1.008e+05, threshold=3.214e+04, percent-clipped=21.0 2024-06-19 15:01:30,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=9258.333333333334, ans=0.125 2024-06-19 15:01:34,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=10.97875 2024-06-19 15:01:35,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9276.666666666666, ans=0.125 2024-06-19 15:01:35,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=9276.666666666666, ans=0.028013888888888894 2024-06-19 15:01:37,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=4.3915 2024-06-19 15:01:41,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=9295.0, ans=0.027937500000000004 2024-06-19 15:01:44,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=9295.0, ans=0.125 2024-06-19 15:01:44,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=9295.0, ans=0.125 2024-06-19 15:01:45,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=14.471250000000001 2024-06-19 15:01:47,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=24.98 vs. limit=9.6475 2024-06-19 15:01:48,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=10.985625 2024-06-19 15:01:48,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.47 vs. limit=14.471250000000001 2024-06-19 15:01:51,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.73 vs. limit=14.485 2024-06-19 15:01:53,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=9313.333333333334, ans=0.125 2024-06-19 15:01:54,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9313.333333333334, ans=0.125 2024-06-19 15:02:13,329 INFO [train.py:1028] (0/2) Epoch 1, batch 5100, loss[loss=1.005, simple_loss=0.6178, pruned_loss=0.6957, over 12869.00 frames. ], tot_loss[loss=0.9046, simple_loss=0.58, pruned_loss=0.6146, over 2566831.20 frames. ], batch size: 39, lr: 3.18e-02, grad_scale: 0.125 2024-06-19 15:02:13,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9350.0, ans=0.20650000000000002 2024-06-19 15:02:22,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=9368.333333333334, ans=0.125 2024-06-19 15:02:28,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.30 vs. limit=11.013125 2024-06-19 15:02:31,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=13.92 vs. limit=7.346666666666666 2024-06-19 15:02:35,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.11 vs. limit=11.02 2024-06-19 15:02:37,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9386.666666666666, ans=0.125 2024-06-19 15:02:40,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=14.55375 2024-06-19 15:02:43,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=9405.0, ans=0.02747916666666667 2024-06-19 15:02:43,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9405.0, ans=0.125 2024-06-19 15:02:45,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=9423.333333333334, ans=0.0 2024-06-19 15:02:49,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=9423.333333333334, ans=0.125 2024-06-19 15:02:49,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.57 vs. limit=11.03375 2024-06-19 15:02:50,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=47.00 vs. limit=11.03375 2024-06-19 15:02:53,547 INFO [train.py:1028] (0/2) Epoch 1, batch 5150, loss[loss=0.8458, simple_loss=0.5523, pruned_loss=0.5697, over 13076.00 frames. ], tot_loss[loss=0.9012, simple_loss=0.5785, pruned_loss=0.612, over 2569877.18 frames. ], batch size: 132, lr: 3.18e-02, grad_scale: 0.0625 2024-06-19 15:02:56,008 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.970e+03 1.167e+04 1.854e+04 2.650e+04 1.531e+05, threshold=3.709e+04, percent-clipped=18.0 2024-06-19 15:02:56,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=9441.666666666666, ans=0.125 2024-06-19 15:03:02,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=9460.0, ans=11.0475 2024-06-19 15:03:07,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=24.89 vs. limit=9.73 2024-06-19 15:03:10,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=9478.333333333334, ans=0.027173611111111114 2024-06-19 15:03:13,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=9478.333333333334, ans=0.125 2024-06-19 15:03:13,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=11.054375 2024-06-19 15:03:21,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=9496.666666666666, ans=0.008805072463768117 2024-06-19 15:03:26,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=11.06125 2024-06-19 15:03:29,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=9515.0, ans=0.20485 2024-06-19 15:03:32,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9515.0, ans=0.125 2024-06-19 15:03:42,973 INFO [train.py:1028] (0/2) Epoch 1, batch 5200, loss[loss=0.8656, simple_loss=0.5576, pruned_loss=0.5868, over 13167.00 frames. ], tot_loss[loss=0.9025, simple_loss=0.5786, pruned_loss=0.6132, over 2573923.04 frames. ], batch size: 95, lr: 3.17e-02, grad_scale: 0.125 2024-06-19 15:03:47,897 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.63 vs. limit=14.65 2024-06-19 15:03:51,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.74 vs. limit=14.66375 2024-06-19 15:03:52,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=9551.666666666666, ans=0.008793115942028987 2024-06-19 15:03:58,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.88 vs. limit=14.6775 2024-06-19 15:04:01,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=41.83 vs. limit=11.088750000000001 2024-06-19 15:04:12,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=14.69125 2024-06-19 15:04:12,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.16 vs. limit=7.397083333333334 2024-06-19 15:04:17,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=9606.666666666666, ans=0.026638888888888893 2024-06-19 15:04:17,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9606.666666666666, ans=0.20393333333333336 2024-06-19 15:04:21,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=11.1025 2024-06-19 15:04:25,521 INFO [train.py:1028] (0/2) Epoch 1, batch 5250, loss[loss=0.8626, simple_loss=0.5555, pruned_loss=0.5848, over 13265.00 frames. ], tot_loss[loss=0.9025, simple_loss=0.5778, pruned_loss=0.6136, over 2570426.42 frames. ], batch size: 52, lr: 3.17e-02, grad_scale: 0.0625 2024-06-19 15:04:28,979 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.544e+03 1.334e+04 1.737e+04 2.229e+04 1.798e+05, threshold=3.474e+04, percent-clipped=8.0 2024-06-19 15:04:29,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=9625.0, ans=11.109375 2024-06-19 15:04:31,898 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=3.874e-01 2024-06-19 15:04:38,489 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.92 vs. limit=11.109375 2024-06-19 15:04:41,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=9643.333333333334, ans=0.0 2024-06-19 15:04:42,968 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=14.7325 2024-06-19 15:04:42,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.14 vs. limit=11.11625 2024-06-19 15:04:48,245 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=11.123125 2024-06-19 15:04:48,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=11.123125 2024-06-19 15:04:51,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9661.666666666666, ans=0.008769202898550725 2024-06-19 15:04:53,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=11.123125 2024-06-19 15:04:58,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=9680.0, ans=0.026333333333333337 2024-06-19 15:05:01,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=18.34 vs. limit=11.129999999999999 2024-06-19 15:05:05,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=9698.333333333334, ans=9.849166666666667 2024-06-19 15:05:10,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=11.136875 2024-06-19 15:05:12,962 INFO [train.py:1028] (0/2) Epoch 1, batch 5300, loss[loss=0.8558, simple_loss=0.5582, pruned_loss=0.5767, over 12988.00 frames. ], tot_loss[loss=0.901, simple_loss=0.5767, pruned_loss=0.6126, over 2566529.93 frames. ], batch size: 144, lr: 3.16e-02, grad_scale: 0.125 2024-06-19 15:05:21,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.93 vs. limit=9.8675 2024-06-19 15:05:21,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=11.150625 2024-06-19 15:05:21,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=9735.0, ans=0.125 2024-06-19 15:05:25,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=11.150625 2024-06-19 15:05:42,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.37 vs. limit=14.82875 2024-06-19 15:05:44,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=11.164375 2024-06-19 15:05:45,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=11.164375 2024-06-19 15:05:48,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=37.35 vs. limit=11.17125 2024-06-19 15:05:56,299 INFO [train.py:1028] (0/2) Epoch 1, batch 5350, loss[loss=0.9631, simple_loss=0.5816, pruned_loss=0.6723, over 11101.00 frames. ], tot_loss[loss=0.8977, simple_loss=0.575, pruned_loss=0.6102, over 2572486.43 frames. ], batch size: 16, lr: 3.16e-02, grad_scale: 0.03125 2024-06-19 15:05:56,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9808.333333333334, ans=0.20191666666666666 2024-06-19 15:05:58,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=33.15 vs. limit=14.85625 2024-06-19 15:05:59,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=9808.333333333334, ans=0.125 2024-06-19 15:05:59,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.94 vs. limit=7.452083333333333 2024-06-19 15:06:01,533 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.631e+03 1.383e+04 1.863e+04 2.866e+04 2.751e+05, threshold=3.726e+04, percent-clipped=14.0 2024-06-19 15:06:14,656 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=43.87 vs. limit=11.185 2024-06-19 15:06:17,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.38 vs. limit=14.88375 2024-06-19 15:06:20,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=9845.0, ans=0.0 2024-06-19 15:06:24,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=9845.0, ans=0.125 2024-06-19 15:06:37,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=9881.666666666666, ans=0.02549305555555556 2024-06-19 15:06:41,165 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=86.53 vs. limit=11.205625 2024-06-19 15:06:43,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.13 vs. limit=14.925 2024-06-19 15:06:43,465 INFO [train.py:1028] (0/2) Epoch 1, batch 5400, loss[loss=0.8552, simple_loss=0.5722, pruned_loss=0.5691, over 12239.00 frames. ], tot_loss[loss=0.8946, simple_loss=0.5741, pruned_loss=0.6076, over 2564401.92 frames. ], batch size: 240, lr: 3.15e-02, grad_scale: 0.0625 2024-06-19 15:06:48,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=9900.0, ans=0.849 2024-06-19 15:06:50,998 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.78 vs. limit=14.925 2024-06-19 15:07:00,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=9918.333333333334, ans=0.125 2024-06-19 15:07:01,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=14.938749999999999 2024-06-19 15:07:05,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.26 vs. limit=14.938749999999999 2024-06-19 15:07:06,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=78.82 vs. limit=9.968333333333334 2024-06-19 15:07:11,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=11.22625 2024-06-19 15:07:16,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=33.69 vs. limit=14.966249999999999 2024-06-19 15:07:25,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=11.24 2024-06-19 15:07:32,063 INFO [train.py:1028] (0/2) Epoch 1, batch 5450, loss[loss=0.9618, simple_loss=0.5929, pruned_loss=0.6653, over 12227.00 frames. ], tot_loss[loss=0.897, simple_loss=0.5752, pruned_loss=0.6094, over 2568328.37 frames. ], batch size: 25, lr: 3.15e-02, grad_scale: 0.0625 2024-06-19 15:07:35,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=11.246875 2024-06-19 15:07:36,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.08 vs. limit=14.99375 2024-06-19 15:07:37,627 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.314e+03 9.296e+03 1.443e+04 1.969e+04 7.796e+04, threshold=2.886e+04, percent-clipped=3.0 2024-06-19 15:07:38,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=9991.666666666666, ans=0.008697463768115941 2024-06-19 15:07:41,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=10010.0, ans=0.125 2024-06-19 15:07:42,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10010.0, ans=0.1999 2024-06-19 15:07:43,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=29.48 vs. limit=11.25375 2024-06-19 15:07:51,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10028.333333333334, ans=0.19971666666666665 2024-06-19 15:08:02,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10046.666666666666, ans=0.0 2024-06-19 15:08:15,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=10065.0, ans=15.04875 2024-06-19 15:08:16,354 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=9.337e-01 2024-06-19 15:08:17,058 INFO [train.py:1028] (0/2) Epoch 1, batch 5500, loss[loss=0.8712, simple_loss=0.5874, pruned_loss=0.5775, over 12226.00 frames. ], tot_loss[loss=0.8981, simple_loss=0.5757, pruned_loss=0.6103, over 2563178.10 frames. ], batch size: 240, lr: 3.14e-02, grad_scale: 0.125 2024-06-19 15:08:19,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=10083.333333333334, ans=0.125 2024-06-19 15:08:23,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.95 vs. limit=15.0625 2024-06-19 15:08:24,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=10101.666666666666, ans=0.125 2024-06-19 15:08:46,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=11.301874999999999 2024-06-19 15:08:58,030 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=14.63 vs. limit=7.539166666666667 2024-06-19 15:08:59,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=10156.666666666666, ans=0.04949747468305833 2024-06-19 15:09:05,232 INFO [train.py:1028] (0/2) Epoch 1, batch 5550, loss[loss=0.9009, simple_loss=0.5587, pruned_loss=0.6216, over 13231.00 frames. ], tot_loss[loss=0.8998, simple_loss=0.576, pruned_loss=0.6118, over 2565807.97 frames. ], batch size: 43, lr: 3.14e-02, grad_scale: 0.125 2024-06-19 15:09:10,036 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.765e+03 8.437e+03 1.109e+04 1.602e+04 8.377e+04, threshold=2.218e+04, percent-clipped=4.0 2024-06-19 15:09:20,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=41.87 vs. limit=11.3225 2024-06-19 15:09:24,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10211.666666666666, ans=0.19788333333333336 2024-06-19 15:09:28,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=10211.666666666666, ans=0.125 2024-06-19 15:09:35,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.99 vs. limit=15.1725 2024-06-19 15:09:42,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.35 vs. limit=11.343125 2024-06-19 15:09:43,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.23 vs. limit=7.562083333333334 2024-06-19 15:09:50,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=10266.666666666666, ans=0.354 2024-06-19 15:09:50,906 INFO [train.py:1028] (0/2) Epoch 1, batch 5600, loss[loss=0.8475, simple_loss=0.5518, pruned_loss=0.5716, over 13214.00 frames. ], tot_loss[loss=0.8961, simple_loss=0.5741, pruned_loss=0.6091, over 2569005.23 frames. ], batch size: 89, lr: 3.13e-02, grad_scale: 0.25 2024-06-19 15:09:59,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=30.86 vs. limit=11.356875 2024-06-19 15:09:59,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=10285.0, ans=0.125 2024-06-19 15:10:06,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=10285.0, ans=0.008633695652173912 2024-06-19 15:10:09,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=10303.333333333334, ans=0.125 2024-06-19 15:10:13,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.62 vs. limit=11.36375 2024-06-19 15:10:17,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=27.47 vs. limit=11.370625 2024-06-19 15:10:20,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=40.20 vs. limit=11.370625 2024-06-19 15:10:28,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.81 vs. limit=8.136 2024-06-19 15:10:28,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=11.3775 2024-06-19 15:10:33,535 INFO [train.py:1028] (0/2) Epoch 1, batch 5650, loss[loss=0.899, simple_loss=0.6019, pruned_loss=0.5981, over 12595.00 frames. ], tot_loss[loss=0.8994, simple_loss=0.5751, pruned_loss=0.6118, over 2574419.12 frames. ], batch size: 203, lr: 3.13e-02, grad_scale: 0.0625 2024-06-19 15:10:40,212 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.543e+03 1.199e+04 1.630e+04 2.347e+04 1.398e+05, threshold=3.261e+04, percent-clipped=27.0 2024-06-19 15:10:45,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=10376.666666666666, ans=0.125 2024-06-19 15:10:45,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=4.5565 2024-06-19 15:10:51,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=10395.0, ans=0.125 2024-06-19 15:10:53,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=10395.0, ans=0.008609782608695653 2024-06-19 15:10:59,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=10413.333333333334, ans=0.5355333333333334 2024-06-19 15:11:00,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.28 vs. limit=8.165333333333333 2024-06-19 15:11:01,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=10413.333333333334, ans=0.125 2024-06-19 15:11:06,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=10431.666666666666, ans=0.023201388888888893 2024-06-19 15:11:11,334 INFO [train.py:1028] (0/2) Epoch 1, batch 5700, loss[loss=0.9514, simple_loss=0.6135, pruned_loss=0.6446, over 13278.00 frames. ], tot_loss[loss=0.8964, simple_loss=0.5739, pruned_loss=0.6095, over 2579069.68 frames. ], batch size: 63, lr: 3.12e-02, grad_scale: 0.125 2024-06-19 15:11:12,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=10450.0, ans=0.125 2024-06-19 15:11:36,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=10505.0, ans=0.125 2024-06-19 15:11:43,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=11.44625 2024-06-19 15:11:51,998 INFO [train.py:1028] (0/2) Epoch 1, batch 5750, loss[loss=0.8536, simple_loss=0.5689, pruned_loss=0.5692, over 12731.00 frames. ], tot_loss[loss=0.8955, simple_loss=0.5747, pruned_loss=0.6082, over 2578758.93 frames. ], batch size: 176, lr: 3.12e-02, grad_scale: 0.0625 2024-06-19 15:12:01,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.03 vs. limit=10.270833333333332 2024-06-19 15:12:02,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=17.33 vs. limit=11.453125 2024-06-19 15:12:04,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=10541.666666666666, ans=0.008577898550724638 2024-06-19 15:12:05,615 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.521e+03 1.315e+04 1.752e+04 2.484e+04 1.271e+05, threshold=3.503e+04, percent-clipped=9.0 2024-06-19 15:12:07,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=10560.0, ans=7.640000000000001 2024-06-19 15:12:09,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=10560.0, ans=0.008573913043478262 2024-06-19 15:12:11,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10560.0, ans=0.1944 2024-06-19 15:12:19,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=10578.333333333334, ans=0.125 2024-06-19 15:12:19,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=10578.333333333334, ans=0.19421666666666665 2024-06-19 15:12:35,158 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=37.47 vs. limit=11.480625 2024-06-19 15:12:41,588 INFO [train.py:1028] (0/2) Epoch 1, batch 5800, loss[loss=0.8693, simple_loss=0.5809, pruned_loss=0.5789, over 12777.00 frames. ], tot_loss[loss=0.8947, simple_loss=0.5757, pruned_loss=0.6068, over 2579048.45 frames. ], batch size: 176, lr: 3.11e-02, grad_scale: 0.125 2024-06-19 15:12:42,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=36.63 vs. limit=11.4875 2024-06-19 15:12:45,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=11.4875 2024-06-19 15:12:46,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=10633.333333333334, ans=0.09899494936611666 2024-06-19 15:12:50,082 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.58 vs. limit=15.48875 2024-06-19 15:12:52,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=27.53 vs. limit=11.494375 2024-06-19 15:12:57,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=10651.666666666666, ans=0.022284722222222227 2024-06-19 15:13:02,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.12 vs. limit=15.5025 2024-06-19 15:13:16,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=11.515 2024-06-19 15:13:18,977 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.18 vs. limit=15.53 2024-06-19 15:13:23,917 INFO [train.py:1028] (0/2) Epoch 1, batch 5850, loss[loss=0.9281, simple_loss=0.6181, pruned_loss=0.619, over 12459.00 frames. ], tot_loss[loss=0.9005, simple_loss=0.5794, pruned_loss=0.6108, over 2577115.23 frames. ], batch size: 202, lr: 3.11e-02, grad_scale: 0.0625 2024-06-19 15:13:28,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.60 vs. limit=11.521875 2024-06-19 15:13:29,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=11.521875 2024-06-19 15:13:31,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.146e+03 1.160e+04 1.714e+04 2.777e+04 1.818e+05, threshold=3.429e+04, percent-clipped=10.0 2024-06-19 15:13:39,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=10743.333333333334, ans=0.008534057971014492 2024-06-19 15:13:40,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=10743.333333333334, ans=0.125 2024-06-19 15:13:44,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=10761.666666666666, ans=0.0 2024-06-19 15:14:02,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=10798.333333333334, ans=0.025 2024-06-19 15:14:11,442 INFO [train.py:1028] (0/2) Epoch 1, batch 5900, loss[loss=0.8972, simple_loss=0.588, pruned_loss=0.6033, over 13184.00 frames. ], tot_loss[loss=0.9087, simple_loss=0.5846, pruned_loss=0.6164, over 2577974.49 frames. ], batch size: 121, lr: 3.10e-02, grad_scale: 0.125 2024-06-19 15:14:11,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=11.55625 2024-06-19 15:14:14,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.30 vs. limit=11.55625 2024-06-19 15:14:20,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=10835.0, ans=0.00851413043478261 2024-06-19 15:14:21,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.31 vs. limit=4.62525 2024-06-19 15:14:21,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.45 vs. limit=7.70875 2024-06-19 15:14:26,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=10835.0, ans=0.19165 2024-06-19 15:14:40,038 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=25.44 vs. limit=11.57 2024-06-19 15:14:42,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=11.576875000000001 2024-06-19 15:14:50,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.14 vs. limit=15.65375 2024-06-19 15:14:50,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.02 vs. limit=15.65375 2024-06-19 15:14:52,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.66 vs. limit=11.58375 2024-06-19 15:14:55,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.69 vs. limit=7.7225 2024-06-19 15:14:55,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.26 vs. limit=11.58375 2024-06-19 15:14:57,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=10890.0, ans=0.125 2024-06-19 15:14:58,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=11.58375 2024-06-19 15:15:01,126 INFO [train.py:1028] (0/2) Epoch 1, batch 5950, loss[loss=0.8257, simple_loss=0.5466, pruned_loss=0.5524, over 13071.00 frames. ], tot_loss[loss=0.9136, simple_loss=0.5881, pruned_loss=0.6195, over 2582919.64 frames. ], batch size: 121, lr: 3.10e-02, grad_scale: 0.125 2024-06-19 15:15:03,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.77 vs. limit=15.68125 2024-06-19 15:15:05,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=10908.333333333334, ans=0.008498188405797101 2024-06-19 15:15:09,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=10908.333333333334, ans=0.09899494936611666 2024-06-19 15:15:10,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.148e+03 1.226e+04 1.519e+04 1.920e+04 4.746e+04, threshold=3.037e+04, percent-clipped=5.0 2024-06-19 15:15:12,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.14 vs. limit=15.695 2024-06-19 15:15:16,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.61 vs. limit=11.5975 2024-06-19 15:15:25,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=4.64175 2024-06-19 15:15:30,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=10963.333333333334, ans=0.125 2024-06-19 15:15:35,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.73 vs. limit=11.61125 2024-06-19 15:15:43,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=28.61 vs. limit=11.618125 2024-06-19 15:15:44,277 INFO [train.py:1028] (0/2) Epoch 1, batch 6000, loss[loss=0.95, simple_loss=0.6352, pruned_loss=0.6324, over 12202.00 frames. ], tot_loss[loss=0.9184, simple_loss=0.5915, pruned_loss=0.6226, over 2575342.26 frames. ], batch size: 240, lr: 3.09e-02, grad_scale: 0.25 2024-06-19 15:15:44,278 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 15:15:54,152 INFO [train.py:1060] (0/2) Epoch 1, validation: loss=0.9935, simple_loss=0.6369, pruned_loss=0.6751, over 351949.00 frames. 2024-06-19 15:15:54,153 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 15:16:13,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=11.63875 2024-06-19 15:16:36,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=11073.333333333334, ans=0.0 2024-06-19 15:16:36,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=11073.333333333334, ans=0.125 2024-06-19 15:16:44,892 INFO [train.py:1028] (0/2) Epoch 1, batch 6050, loss[loss=0.9961, simple_loss=0.6298, pruned_loss=0.6812, over 13279.00 frames. ], tot_loss[loss=0.9241, simple_loss=0.5947, pruned_loss=0.6268, over 2577569.10 frames. ], batch size: 40, lr: 3.09e-02, grad_scale: 0.125 2024-06-19 15:16:45,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=11091.666666666666, ans=0.020451388888888894 2024-06-19 15:16:49,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=26.87 vs. limit=11.659375 2024-06-19 15:16:53,446 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.967e+03 1.562e+04 2.095e+04 3.031e+04 1.028e+05, threshold=4.190e+04, percent-clipped=23.0 2024-06-19 15:17:10,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11146.666666666666, ans=0.18853333333333333 2024-06-19 15:17:10,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.36 vs. limit=11.68 2024-06-19 15:17:28,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=11165.0, ans=0.07 2024-06-19 15:17:33,828 INFO [train.py:1028] (0/2) Epoch 1, batch 6100, loss[loss=0.8863, simple_loss=0.5883, pruned_loss=0.5922, over 13059.00 frames. ], tot_loss[loss=0.9279, simple_loss=0.5973, pruned_loss=0.6293, over 2580339.20 frames. ], batch size: 121, lr: 3.08e-02, grad_scale: 0.25 2024-06-19 15:17:36,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=15.8875 2024-06-19 15:17:46,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=11201.666666666666, ans=0.125 2024-06-19 15:17:50,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=11201.666666666666, ans=0.125 2024-06-19 15:17:54,499 WARNING [optim.py:503] (0/2) Scaling gradients by 0.07710374146699905, model_norm_threshold=41897.69140625 2024-06-19 15:17:54,692 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.44, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.304e+11, grad_sumsq=7.881e+10, orig_rms_sq=1.654e+00 2024-06-19 15:17:59,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=11.7075 2024-06-19 15:17:59,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=11238.333333333334, ans=0.019840277777777773 2024-06-19 15:18:12,092 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.17 vs. limit=11.721250000000001 2024-06-19 15:18:15,380 INFO [train.py:1028] (0/2) Epoch 1, batch 6150, loss[loss=0.893, simple_loss=0.6084, pruned_loss=0.5887, over 10860.00 frames. ], tot_loss[loss=0.9351, simple_loss=0.6014, pruned_loss=0.6344, over 2579591.89 frames. ], batch size: 304, lr: 3.08e-02, grad_scale: 0.03125 2024-06-19 15:18:18,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.80 vs. limit=11.728125 2024-06-19 15:18:27,527 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=3.805e-01 2024-06-19 15:18:28,103 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.390e+03 1.395e+04 1.990e+04 2.832e+04 5.434e+05, threshold=3.980e+04, percent-clipped=9.0 2024-06-19 15:18:30,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=11293.333333333334, ans=0.125 2024-06-19 15:18:32,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.95 vs. limit=15.98375 2024-06-19 15:18:35,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=11311.666666666666, ans=0.125 2024-06-19 15:18:49,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=45.22 vs. limit=11.748750000000001 2024-06-19 15:18:53,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.01 vs. limit=11.755625 2024-06-19 15:19:07,375 INFO [train.py:1028] (0/2) Epoch 1, batch 6200, loss[loss=0.9892, simple_loss=0.6538, pruned_loss=0.6623, over 13202.00 frames. ], tot_loss[loss=0.9419, simple_loss=0.6049, pruned_loss=0.6394, over 2575615.59 frames. ], batch size: 89, lr: 3.07e-02, grad_scale: 0.0625 2024-06-19 15:19:26,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=11403.333333333334, ans=0.125 2024-06-19 15:19:29,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.72 vs. limit=11.776250000000001 2024-06-19 15:19:35,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.07 vs. limit=7.855416666666667 2024-06-19 15:19:41,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=11440.0, ans=0.019000000000000003 2024-06-19 15:19:58,236 INFO [train.py:1028] (0/2) Epoch 1, batch 6250, loss[loss=0.9432, simple_loss=0.6052, pruned_loss=0.6407, over 13197.00 frames. ], tot_loss[loss=0.945, simple_loss=0.6064, pruned_loss=0.6418, over 2567738.44 frames. ], batch size: 83, lr: 3.07e-02, grad_scale: 0.0625 2024-06-19 15:20:00,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.01 vs. limit=7.864583333333334 2024-06-19 15:20:01,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=11458.333333333334, ans=0.18541666666666667 2024-06-19 15:20:09,855 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.563e+03 7.199e+03 9.948e+03 1.234e+04 7.441e+04, threshold=1.990e+04, percent-clipped=1.0 2024-06-19 15:20:20,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=21.42 vs. limit=11.810625 2024-06-19 15:20:27,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=11513.333333333334, ans=0.018694444444444444 2024-06-19 15:20:35,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=16.14875 2024-06-19 15:20:40,901 INFO [train.py:1028] (0/2) Epoch 1, batch 6300, loss[loss=1.011, simple_loss=0.6225, pruned_loss=0.6994, over 11667.00 frames. ], tot_loss[loss=0.9494, simple_loss=0.6091, pruned_loss=0.6448, over 2564098.26 frames. ], batch size: 17, lr: 3.06e-02, grad_scale: 0.125 2024-06-19 15:20:45,768 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.957e+00 2024-06-19 15:20:49,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.73 vs. limit=11.838125 2024-06-19 15:20:52,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11568.333333333334, ans=0.18431666666666666 2024-06-19 15:21:02,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=25.19 vs. limit=11.844999999999999 2024-06-19 15:21:05,458 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=4.063e+00 2024-06-19 15:21:24,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11641.666666666666, ans=0.125 2024-06-19 15:21:24,914 INFO [train.py:1028] (0/2) Epoch 1, batch 6350, loss[loss=0.937, simple_loss=0.6279, pruned_loss=0.623, over 12525.00 frames. ], tot_loss[loss=0.9574, simple_loss=0.6131, pruned_loss=0.6509, over 2573533.47 frames. ], batch size: 202, lr: 3.06e-02, grad_scale: 0.125 2024-06-19 15:21:26,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=11641.666666666666, ans=0.025 2024-06-19 15:21:29,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=16.23125 2024-06-19 15:21:34,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.93 vs. limit=11.872499999999999 2024-06-19 15:21:36,503 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.593e+03 9.663e+03 1.217e+04 1.640e+04 5.662e+04, threshold=2.434e+04, percent-clipped=17.0 2024-06-19 15:21:36,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11660.0, ans=0.1834 2024-06-19 15:21:53,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11678.333333333334, ans=0.18321666666666667 2024-06-19 15:21:54,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=11696.666666666666, ans=0.01793055555555556 2024-06-19 15:22:00,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=31.69 vs. limit=11.88625 2024-06-19 15:22:03,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=11.893125000000001 2024-06-19 15:22:11,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=11733.333333333334, ans=0.025 2024-06-19 15:22:12,038 INFO [train.py:1028] (0/2) Epoch 1, batch 6400, loss[loss=0.922, simple_loss=0.593, pruned_loss=0.6255, over 13174.00 frames. ], tot_loss[loss=0.967, simple_loss=0.6185, pruned_loss=0.6578, over 2575304.18 frames. ], batch size: 67, lr: 3.05e-02, grad_scale: 0.25 2024-06-19 15:22:17,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=11733.333333333334, ans=0.125 2024-06-19 15:22:20,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.25 vs. limit=16.31375 2024-06-19 15:22:37,869 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.424e-02 2024-06-19 15:22:44,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=38.62 vs. limit=11.920625000000001 2024-06-19 15:22:49,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=11788.333333333334, ans=0.017548611111111105 2024-06-19 15:23:00,986 INFO [train.py:1028] (0/2) Epoch 1, batch 6450, loss[loss=0.9902, simple_loss=0.6608, pruned_loss=0.6598, over 12568.00 frames. ], tot_loss[loss=0.9718, simple_loss=0.6215, pruned_loss=0.6611, over 2581056.01 frames. ], batch size: 202, lr: 3.05e-02, grad_scale: 0.25 2024-06-19 15:23:08,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=11.94125 2024-06-19 15:23:12,654 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.158e+03 1.193e+04 1.508e+04 2.229e+04 6.080e+04, threshold=3.015e+04, percent-clipped=19.0 2024-06-19 15:23:15,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.91 vs. limit=16.396250000000002 2024-06-19 15:23:18,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=7.965416666666666 2024-06-19 15:23:29,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=11898.333333333334, ans=0.025 2024-06-19 15:23:36,950 INFO [train.py:1028] (0/2) Epoch 1, batch 6500, loss[loss=0.9382, simple_loss=0.6386, pruned_loss=0.6189, over 10688.00 frames. ], tot_loss[loss=0.9763, simple_loss=0.6242, pruned_loss=0.6642, over 2584441.39 frames. ], batch size: 303, lr: 3.04e-02, grad_scale: 0.25 2024-06-19 15:23:43,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=11916.666666666666, ans=0.18083333333333335 2024-06-19 15:23:50,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.33 vs. limit=11.975625 2024-06-19 15:23:54,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11953.333333333334, ans=0.125 2024-06-19 15:23:56,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=11953.333333333334, ans=0.4816333333333333 2024-06-19 15:24:19,005 INFO [train.py:1028] (0/2) Epoch 1, batch 6550, loss[loss=1.056, simple_loss=0.6464, pruned_loss=0.7328, over 12534.00 frames. ], tot_loss[loss=0.9798, simple_loss=0.6262, pruned_loss=0.6667, over 2588056.32 frames. ], batch size: 22, lr: 3.04e-02, grad_scale: 0.125 2024-06-19 15:24:26,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=12.003125 2024-06-19 15:24:38,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.576e+03 1.680e+04 2.097e+04 2.799e+04 9.664e+04, threshold=4.195e+04, percent-clipped=17.0 2024-06-19 15:24:48,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.09 vs. limit=12.016874999999999 2024-06-19 15:24:56,077 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.28 vs. limit=16.5475 2024-06-19 15:25:00,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12081.666666666666, ans=0.17918333333333333 2024-06-19 15:25:07,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12100.0, ans=0.0 2024-06-19 15:25:07,749 INFO [train.py:1028] (0/2) Epoch 1, batch 6600, loss[loss=0.8903, simple_loss=0.5717, pruned_loss=0.6045, over 13255.00 frames. ], tot_loss[loss=0.9804, simple_loss=0.6267, pruned_loss=0.6671, over 2590335.73 frames. ], batch size: 72, lr: 3.03e-02, grad_scale: 0.25 2024-06-19 15:25:10,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=12100.0, ans=0.47650000000000003 2024-06-19 15:25:13,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=12100.0, ans=0.01625 2024-06-19 15:25:37,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.05 vs. limit=12.05125 2024-06-19 15:25:43,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.15 vs. limit=8.03875 2024-06-19 15:25:50,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=16.70 vs. limit=12.065000000000001 2024-06-19 15:25:54,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=12191.666666666666, ans=0.025 2024-06-19 15:25:55,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=12191.666666666666, ans=12.071875 2024-06-19 15:25:55,426 INFO [train.py:1028] (0/2) Epoch 1, batch 6650, loss[loss=0.979, simple_loss=0.6436, pruned_loss=0.6572, over 12919.00 frames. ], tot_loss[loss=0.9852, simple_loss=0.6298, pruned_loss=0.6704, over 2585297.46 frames. ], batch size: 158, lr: 3.03e-02, grad_scale: 0.125 2024-06-19 15:26:01,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=12191.666666666666, ans=0.125 2024-06-19 15:26:07,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.74 vs. limit=16.6575 2024-06-19 15:26:07,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=12210.0, ans=0.025 2024-06-19 15:26:09,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12210.0, ans=0.1779 2024-06-19 15:26:09,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=35.38 vs. limit=12.07875 2024-06-19 15:26:10,340 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.596e+03 1.318e+04 1.703e+04 2.370e+04 1.907e+05, threshold=3.406e+04, percent-clipped=8.0 2024-06-19 15:26:18,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12228.333333333334, ans=0.125 2024-06-19 15:26:20,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.23 vs. limit=16.67125 2024-06-19 15:26:27,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=12246.666666666666, ans=0.125 2024-06-19 15:26:30,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=12265.0, ans=12.099375 2024-06-19 15:26:36,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=18.08 vs. limit=12.099375 2024-06-19 15:26:39,846 INFO [train.py:1028] (0/2) Epoch 1, batch 6700, loss[loss=0.9843, simple_loss=0.6565, pruned_loss=0.656, over 12714.00 frames. ], tot_loss[loss=0.9883, simple_loss=0.6322, pruned_loss=0.6722, over 2583376.58 frames. ], batch size: 176, lr: 3.02e-02, grad_scale: 0.25 2024-06-19 15:26:42,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=8.913333333333334 2024-06-19 15:26:43,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=12.10625 2024-06-19 15:26:43,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=32.90 vs. limit=12.10625 2024-06-19 15:26:45,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=12.10625 2024-06-19 15:26:50,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=12301.666666666666, ans=0.125 2024-06-19 15:27:01,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.16 vs. limit=8.08 2024-06-19 15:27:15,795 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=12.126875 2024-06-19 15:27:22,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=12356.666666666666, ans=0.015180555555555558 2024-06-19 15:27:26,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=12356.666666666666, ans=0.38534999999999997 2024-06-19 15:27:26,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=12356.666666666666, ans=0.125 2024-06-19 15:27:27,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.49 vs. limit=12.13375 2024-06-19 15:27:27,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=12375.0, ans=0.125 2024-06-19 15:27:28,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.76 vs. limit=8.09375 2024-06-19 15:27:28,245 INFO [train.py:1028] (0/2) Epoch 1, batch 6750, loss[loss=1.027, simple_loss=0.6882, pruned_loss=0.6824, over 12237.00 frames. ], tot_loss[loss=0.9895, simple_loss=0.6332, pruned_loss=0.6729, over 2576660.34 frames. ], batch size: 241, lr: 3.02e-02, grad_scale: 0.125 2024-06-19 15:27:30,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.14 vs. limit=8.09375 2024-06-19 15:27:31,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12375.0, ans=0.17625 2024-06-19 15:27:43,386 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.471e+03 1.488e+04 2.069e+04 2.751e+04 1.024e+05, threshold=4.138e+04, percent-clipped=14.0 2024-06-19 15:27:45,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=12411.666666666666, ans=0.008171376811594203 2024-06-19 15:28:01,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=12430.0, ans=0.125 2024-06-19 15:28:06,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=12430.0, ans=0.125 2024-06-19 15:28:09,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.16 vs. limit=12.168125 2024-06-19 15:28:15,957 INFO [train.py:1028] (0/2) Epoch 1, batch 6800, loss[loss=1.02, simple_loss=0.6521, pruned_loss=0.6936, over 13165.00 frames. ], tot_loss[loss=0.9944, simple_loss=0.6365, pruned_loss=0.6761, over 2578828.10 frames. ], batch size: 67, lr: 3.01e-02, grad_scale: 0.25 2024-06-19 15:28:18,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=12466.666666666666, ans=0.014722222222222227 2024-06-19 15:28:29,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.97 vs. limit=16.86375 2024-06-19 15:28:42,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12521.666666666666, ans=0.17478333333333335 2024-06-19 15:28:42,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.93 vs. limit=12.195625 2024-06-19 15:29:00,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.38 vs. limit=11.279166666666667 2024-06-19 15:29:00,193 INFO [train.py:1028] (0/2) Epoch 1, batch 6850, loss[loss=1.097, simple_loss=0.6961, pruned_loss=0.749, over 13245.00 frames. ], tot_loss[loss=0.9994, simple_loss=0.6387, pruned_loss=0.6801, over 2582974.46 frames. ], batch size: 63, lr: 3.01e-02, grad_scale: 0.0625 2024-06-19 15:29:01,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=12558.333333333334, ans=0.014340277777777771 2024-06-19 15:29:04,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=12558.333333333334, ans=0.125 2024-06-19 15:29:06,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=12558.333333333334, ans=0.014340277777777771 2024-06-19 15:29:09,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.60 vs. limit=16.932499999999997 2024-06-19 15:29:13,096 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.19 vs. limit=16.932499999999997 2024-06-19 15:29:13,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.29 vs. limit=16.932499999999997 2024-06-19 15:29:18,022 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.490e+03 1.554e+04 2.261e+04 3.516e+04 1.635e+05, threshold=4.522e+04, percent-clipped=18.0 2024-06-19 15:29:27,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.39 vs. limit=16.96 2024-06-19 15:29:36,648 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.708e+00 2024-06-19 15:29:38,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=12631.666666666666, ans=0.17368333333333333 2024-06-19 15:29:41,015 INFO [train.py:1028] (0/2) Epoch 1, batch 6900, loss[loss=1.023, simple_loss=0.6434, pruned_loss=0.7009, over 13238.00 frames. ], tot_loss[loss=1.002, simple_loss=0.6407, pruned_loss=0.6818, over 2585198.42 frames. ], batch size: 49, lr: 3.00e-02, grad_scale: 0.125 2024-06-19 15:29:41,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=12650.0, ans=0.008119565217391305 2024-06-19 15:29:45,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.72 vs. limit=16.9875 2024-06-19 15:29:47,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=12650.0, ans=0.1735 2024-06-19 15:30:02,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=12668.333333333334, ans=0.125 2024-06-19 15:30:03,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=12686.666666666666, ans=0.45596666666666674 2024-06-19 15:30:08,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=12686.666666666666, ans=0.013805555555555557 2024-06-19 15:30:20,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=12.264375000000001 2024-06-19 15:30:21,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=12.27125 2024-06-19 15:30:23,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.11 vs. limit=9.089333333333332 2024-06-19 15:30:25,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=12.27125 2024-06-19 15:30:27,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.77 vs. limit=17.0425 2024-06-19 15:30:28,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=12723.333333333334, ans=0.45468333333333333 2024-06-19 15:30:29,993 INFO [train.py:1028] (0/2) Epoch 1, batch 6950, loss[loss=0.9522, simple_loss=0.5845, pruned_loss=0.6599, over 10942.00 frames. ], tot_loss[loss=1.003, simple_loss=0.6409, pruned_loss=0.6824, over 2578734.68 frames. ], batch size: 16, lr: 3.00e-02, grad_scale: 0.125 2024-06-19 15:30:30,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=12741.666666666666, ans=0.013576388888888895 2024-06-19 15:30:40,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=12741.666666666666, ans=0.013576388888888895 2024-06-19 15:30:41,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.36 vs. limit=12.285 2024-06-19 15:30:45,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=12760.0, ans=0.4534 2024-06-19 15:30:50,987 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.023e+03 8.847e+03 1.233e+04 1.711e+04 6.541e+04, threshold=2.466e+04, percent-clipped=2.0 2024-06-19 15:30:57,994 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=27.90 vs. limit=11.389166666666668 2024-06-19 15:31:17,769 INFO [train.py:1028] (0/2) Epoch 1, batch 7000, loss[loss=0.9591, simple_loss=0.6403, pruned_loss=0.6389, over 12919.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6411, pruned_loss=0.6833, over 2575804.96 frames. ], batch size: 158, lr: 2.99e-02, grad_scale: 0.25 2024-06-19 15:31:19,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=12833.333333333334, ans=0.125 2024-06-19 15:31:20,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=12833.333333333334, ans=0.008079710144927536 2024-06-19 15:31:20,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=12833.333333333334, ans=0.125 2024-06-19 15:31:21,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=40.70 vs. limit=12.3125 2024-06-19 15:31:22,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.49 vs. limit=17.125 2024-06-19 15:31:22,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=12833.333333333334, ans=0.008079710144927536 2024-06-19 15:31:26,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=12851.666666666666, ans=0.125 2024-06-19 15:31:38,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=12870.0, ans=0.125 2024-06-19 15:31:40,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=12870.0, ans=0.125 2024-06-19 15:31:45,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.31 vs. limit=12.333124999999999 2024-06-19 15:31:49,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=12888.333333333334, ans=0.125 2024-06-19 15:32:03,231 INFO [train.py:1028] (0/2) Epoch 1, batch 7050, loss[loss=0.9772, simple_loss=0.6391, pruned_loss=0.6577, over 12701.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6455, pruned_loss=0.6891, over 2581918.54 frames. ], batch size: 176, lr: 2.99e-02, grad_scale: 0.25 2024-06-19 15:32:06,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12925.0, ans=0.17074999999999999 2024-06-19 15:32:06,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.58 vs. limit=17.19375 2024-06-19 15:32:07,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=12925.0, ans=0.447625 2024-06-19 15:32:14,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=12943.333333333334, ans=0.012736111111111108 2024-06-19 15:32:18,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.11 vs. limit=17.2075 2024-06-19 15:32:21,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.414e+03 1.230e+04 1.674e+04 2.295e+04 8.542e+04, threshold=3.349e+04, percent-clipped=17.0 2024-06-19 15:32:44,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=12998.333333333334, ans=0.17001666666666665 2024-06-19 15:32:46,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=26.81 vs. limit=12.374375 2024-06-19 15:32:49,141 INFO [train.py:1028] (0/2) Epoch 1, batch 7100, loss[loss=0.9872, simple_loss=0.649, pruned_loss=0.6627, over 13144.00 frames. ], tot_loss[loss=1.01, simple_loss=0.6453, pruned_loss=0.6869, over 2573706.06 frames. ], batch size: 112, lr: 2.98e-02, grad_scale: 0.25 2024-06-19 15:32:49,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=13016.666666666666, ans=0.025 2024-06-19 15:32:54,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.76 vs. limit=17.2625 2024-06-19 15:32:55,887 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.92 vs. limit=4.9525 2024-06-19 15:33:04,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.66 vs. limit=12.388124999999999 2024-06-19 15:33:21,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.73 vs. limit=12.395 2024-06-19 15:33:21,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.92 vs. limit=12.395 2024-06-19 15:33:24,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.97 vs. limit=11.535833333333333 2024-06-19 15:33:26,579 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.74 vs. limit=8.267916666666666 2024-06-19 15:33:32,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.13 vs. limit=17.317500000000003 2024-06-19 15:33:33,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=13090.0, ans=0.025 2024-06-19 15:33:38,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=13090.0, ans=0.125 2024-06-19 15:33:39,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.97 vs. limit=17.317500000000003 2024-06-19 15:33:40,291 INFO [train.py:1028] (0/2) Epoch 1, batch 7150, loss[loss=1.031, simple_loss=0.6941, pruned_loss=0.6837, over 12527.00 frames. ], tot_loss[loss=1.014, simple_loss=0.6469, pruned_loss=0.6901, over 2572821.26 frames. ], batch size: 202, lr: 2.98e-02, grad_scale: 0.25 2024-06-19 15:33:42,490 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.53 vs. limit=12.415625 2024-06-19 15:33:49,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=13126.666666666666, ans=0.125 2024-06-19 15:33:54,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=12.4225 2024-06-19 15:33:57,856 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.179e+03 1.152e+04 1.599e+04 2.117e+04 7.402e+04, threshold=3.198e+04, percent-clipped=9.0 2024-06-19 15:34:02,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=13145.0, ans=0.439925 2024-06-19 15:34:23,991 INFO [train.py:1028] (0/2) Epoch 1, batch 7200, loss[loss=1.049, simple_loss=0.6774, pruned_loss=0.7098, over 13185.00 frames. ], tot_loss[loss=1.017, simple_loss=0.6493, pruned_loss=0.6922, over 2577508.57 frames. ], batch size: 112, lr: 2.97e-02, grad_scale: 0.125 2024-06-19 15:34:35,233 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.91 vs. limit=9.287333333333333 2024-06-19 15:34:37,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=13218.333333333334, ans=0.125 2024-06-19 15:34:39,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13218.333333333334, ans=0.16781666666666667 2024-06-19 15:34:41,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.96 vs. limit=12.463750000000001 2024-06-19 15:34:41,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=13236.666666666666, ans=9.294666666666666 2024-06-19 15:34:43,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=12.463750000000001 2024-06-19 15:34:43,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=13236.666666666666, ans=0.007992028985507247 2024-06-19 15:34:47,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=12.463750000000001 2024-06-19 15:34:50,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=13255.0, ans=0.43607500000000005 2024-06-19 15:34:52,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=13255.0, ans=0.011437500000000003 2024-06-19 15:34:55,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=13255.0, ans=0.125 2024-06-19 15:34:58,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=13273.333333333334, ans=0.011361111111111107 2024-06-19 15:34:59,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=30.09 vs. limit=17.455 2024-06-19 15:35:00,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=13273.333333333334, ans=0.43543333333333334 2024-06-19 15:35:06,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=13291.666666666666, ans=0.04949747468305833 2024-06-19 15:35:06,740 INFO [train.py:1028] (0/2) Epoch 1, batch 7250, loss[loss=0.9237, simple_loss=0.5809, pruned_loss=0.6332, over 12976.00 frames. ], tot_loss[loss=1.018, simple_loss=0.6501, pruned_loss=0.6932, over 2578386.83 frames. ], batch size: 36, lr: 2.97e-02, grad_scale: 0.125 2024-06-19 15:35:09,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=13291.666666666666, ans=0.399375 2024-06-19 15:35:10,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=13291.666666666666, ans=0.125 2024-06-19 15:35:12,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=13291.666666666666, ans=0.125 2024-06-19 15:35:27,610 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 15:35:33,244 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.443e+03 1.599e+04 1.986e+04 2.820e+04 1.891e+05, threshold=3.971e+04, percent-clipped=22.0 2024-06-19 15:35:38,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=13328.333333333334, ans=0.125 2024-06-19 15:35:45,999 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.30 vs. limit=17.509999999999998 2024-06-19 15:35:53,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13365.0, ans=0.125 2024-06-19 15:35:57,879 INFO [train.py:1028] (0/2) Epoch 1, batch 7300, loss[loss=0.9981, simple_loss=0.6252, pruned_loss=0.6855, over 12981.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6513, pruned_loss=0.6929, over 2577875.83 frames. ], batch size: 36, lr: 2.96e-02, grad_scale: 0.25 2024-06-19 15:36:14,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.44 vs. limit=9.360666666666667 2024-06-19 15:36:16,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13401.666666666666, ans=0.16598333333333334 2024-06-19 15:36:17,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.63 vs. limit=12.532499999999999 2024-06-19 15:36:28,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=13438.333333333334, ans=0.4296583333333333 2024-06-19 15:36:34,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=23.35 vs. limit=12.54625 2024-06-19 15:36:43,979 INFO [train.py:1028] (0/2) Epoch 1, batch 7350, loss[loss=1.109, simple_loss=0.6999, pruned_loss=0.7588, over 13330.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6524, pruned_loss=0.693, over 2579292.25 frames. ], batch size: 46, lr: 2.96e-02, grad_scale: 0.125 2024-06-19 15:36:50,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=13475.0, ans=0.125 2024-06-19 15:36:55,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=13493.333333333334, ans=0.125 2024-06-19 15:37:01,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=24.58 vs. limit=12.566875 2024-06-19 15:37:02,629 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.463e+03 1.230e+04 1.765e+04 2.410e+04 1.110e+05, threshold=3.530e+04, percent-clipped=10.0 2024-06-19 15:37:11,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13530.0, ans=0.16469999999999999 2024-06-19 15:37:25,774 INFO [train.py:1028] (0/2) Epoch 1, batch 7400, loss[loss=1.139, simple_loss=0.7298, pruned_loss=0.7743, over 13245.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6531, pruned_loss=0.6929, over 2584894.52 frames. ], batch size: 63, lr: 2.95e-02, grad_scale: 0.25 2024-06-19 15:37:27,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=13566.666666666666, ans=0.4251666666666667 2024-06-19 15:37:30,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.67 vs. limit=12.5875 2024-06-19 15:37:32,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=13566.666666666666, ans=8.391666666666666 2024-06-19 15:37:45,543 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=12.60125 2024-06-19 15:38:16,476 INFO [train.py:1028] (0/2) Epoch 1, batch 7450, loss[loss=1.014, simple_loss=0.631, pruned_loss=0.6985, over 12759.00 frames. ], tot_loss[loss=1.019, simple_loss=0.6525, pruned_loss=0.6931, over 2577275.27 frames. ], batch size: 29, lr: 2.95e-02, grad_scale: 0.25 2024-06-19 15:38:21,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=47.60 vs. limit=12.621875 2024-06-19 15:38:28,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=13676.666666666666, ans=0.025 2024-06-19 15:38:39,052 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.142e+03 1.233e+04 1.526e+04 1.898e+04 6.914e+04, threshold=3.052e+04, percent-clipped=5.0 2024-06-19 15:38:53,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=13713.333333333334, ans=0.4200333333333333 2024-06-19 15:39:06,936 INFO [train.py:1028] (0/2) Epoch 1, batch 7500, loss[loss=0.9332, simple_loss=0.6296, pruned_loss=0.6185, over 10646.00 frames. ], tot_loss[loss=1.026, simple_loss=0.6563, pruned_loss=0.6981, over 2575628.97 frames. ], batch size: 303, lr: 2.94e-02, grad_scale: 0.25 2024-06-19 15:39:09,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=13750.0, ans=0.125 2024-06-19 15:39:15,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13768.333333333334, ans=0.0 2024-06-19 15:39:16,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13768.333333333334, ans=0.16231666666666666 2024-06-19 15:39:17,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=13768.333333333334, ans=10.0 2024-06-19 15:39:19,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=13768.333333333334, ans=0.125 2024-06-19 15:39:22,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=12.67 2024-06-19 15:39:29,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13786.666666666666, ans=0.16213333333333332 2024-06-19 15:39:35,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=12.676874999999999 2024-06-19 15:39:47,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13823.333333333334, ans=0.16176666666666667 2024-06-19 15:39:49,270 INFO [train.py:1028] (0/2) Epoch 1, batch 7550, loss[loss=1.01, simple_loss=0.6655, pruned_loss=0.6768, over 12944.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6569, pruned_loss=0.6968, over 2575936.14 frames. ], batch size: 158, lr: 2.94e-02, grad_scale: 0.125 2024-06-19 15:39:51,282 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=6.953e-02 2024-06-19 15:40:01,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.64 vs. limit=17.895 2024-06-19 15:40:04,269 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=19.87 vs. limit=12.6975 2024-06-19 15:40:11,783 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.004e+03 8.419e+03 1.080e+04 1.610e+04 7.338e+04, threshold=2.161e+04, percent-clipped=5.0 2024-06-19 15:40:15,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=9.558666666666667 2024-06-19 15:40:27,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=13915.0, ans=0.125 2024-06-19 15:40:32,637 INFO [train.py:1028] (0/2) Epoch 1, batch 7600, loss[loss=0.9968, simple_loss=0.633, pruned_loss=0.6803, over 13210.00 frames. ], tot_loss[loss=1.029, simple_loss=0.6588, pruned_loss=0.6994, over 2577383.57 frames. ], batch size: 83, lr: 2.93e-02, grad_scale: 0.25 2024-06-19 15:40:33,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=13933.333333333334, ans=0.025 2024-06-19 15:40:37,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.00 vs. limit=17.95 2024-06-19 15:40:42,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=13951.666666666666, ans=0.008534722222222228 2024-06-19 15:40:58,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13970.0, ans=0.0 2024-06-19 15:41:01,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.92 vs. limit=17.9775 2024-06-19 15:41:03,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.21 vs. limit=12.73875 2024-06-19 15:41:11,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=13988.333333333334, ans=0.008381944444444442 2024-06-19 15:41:13,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=13988.333333333334, ans=0.125 2024-06-19 15:41:14,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=55.11 vs. limit=12.752500000000001 2024-06-19 15:41:27,803 INFO [train.py:1028] (0/2) Epoch 1, batch 7650, loss[loss=1.04, simple_loss=0.6472, pruned_loss=0.7164, over 12856.00 frames. ], tot_loss[loss=1.031, simple_loss=0.6598, pruned_loss=0.7011, over 2572156.65 frames. ], batch size: 33, lr: 2.93e-02, grad_scale: 0.25 2024-06-19 15:41:28,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14025.0, ans=0.125 2024-06-19 15:41:33,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=32.38 vs. limit=12.759375 2024-06-19 15:41:33,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=14025.0, ans=0.125 2024-06-19 15:41:39,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=14043.333333333334, ans=0.4084833333333333 2024-06-19 15:41:51,884 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.008e+03 8.799e+03 1.196e+04 1.710e+04 8.040e+04, threshold=2.393e+04, percent-clipped=12.0 2024-06-19 15:41:55,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=70.40 vs. limit=12.780000000000001 2024-06-19 15:42:00,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=49.89 vs. limit=12.04 2024-06-19 15:42:02,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=14080.0, ans=0.125 2024-06-19 15:42:06,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=14098.333333333334, ans=0.025 2024-06-19 15:42:10,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=12.786875 2024-06-19 15:42:11,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=22.02 vs. limit=12.786875 2024-06-19 15:42:13,039 INFO [train.py:1028] (0/2) Epoch 1, batch 7700, loss[loss=1.12, simple_loss=0.72, pruned_loss=0.7596, over 13266.00 frames. ], tot_loss[loss=1.03, simple_loss=0.6594, pruned_loss=0.7001, over 2568885.62 frames. ], batch size: 63, lr: 2.92e-02, grad_scale: 0.5 2024-06-19 15:42:37,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.68 vs. limit=12.807500000000001 2024-06-19 15:42:45,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14171.666666666666, ans=0.15828333333333333 2024-06-19 15:42:55,677 INFO [train.py:1028] (0/2) Epoch 1, batch 7750, loss[loss=1.048, simple_loss=0.6749, pruned_loss=0.7102, over 13265.00 frames. ], tot_loss[loss=1.027, simple_loss=0.6596, pruned_loss=0.6976, over 2573602.44 frames. ], batch size: 72, lr: 2.92e-02, grad_scale: 0.25 2024-06-19 15:43:09,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=14226.666666666666, ans=0.007388888888888896 2024-06-19 15:43:17,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=14226.666666666666, ans=0.40206666666666674 2024-06-19 15:43:18,895 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.228e+03 2024-06-19 15:43:25,846 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.996e+03 1.334e+04 1.762e+04 2.386e+04 7.682e+04, threshold=3.523e+04, percent-clipped=23.0 2024-06-19 15:43:29,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=14263.333333333334, ans=0.00723611111111111 2024-06-19 15:43:36,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=14281.666666666666, ans=0.40014166666666673 2024-06-19 15:43:37,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=14281.666666666666, ans=0.40014166666666673 2024-06-19 15:43:39,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.31 vs. limit=18.21125 2024-06-19 15:43:41,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=14281.666666666666, ans=0.07 2024-06-19 15:43:52,012 INFO [train.py:1028] (0/2) Epoch 1, batch 7800, loss[loss=1.08, simple_loss=0.708, pruned_loss=0.7256, over 13108.00 frames. ], tot_loss[loss=1.032, simple_loss=0.6622, pruned_loss=0.7009, over 2579161.54 frames. ], batch size: 95, lr: 2.91e-02, grad_scale: 0.25 2024-06-19 15:43:52,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.56 vs. limit=12.8625 2024-06-19 15:43:53,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=14300.0, ans=0.09899494936611666 2024-06-19 15:43:59,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=14318.333333333334, ans=0.007756884057971015 2024-06-19 15:44:01,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=14318.333333333334, ans=0.15681666666666666 2024-06-19 15:44:17,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.61 vs. limit=8.588750000000001 2024-06-19 15:44:20,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=24.17 vs. limit=12.883125 2024-06-19 15:44:28,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=14373.333333333334, ans=12.89 2024-06-19 15:44:32,211 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.712e+00 2024-06-19 15:44:36,057 INFO [train.py:1028] (0/2) Epoch 1, batch 7850, loss[loss=1.091, simple_loss=0.664, pruned_loss=0.7591, over 11307.00 frames. ], tot_loss[loss=1.037, simple_loss=0.6652, pruned_loss=0.7049, over 2572605.89 frames. ], batch size: 17, lr: 2.91e-02, grad_scale: 0.125 2024-06-19 15:44:39,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=14391.666666666666, ans=0.006701388888888889 2024-06-19 15:44:44,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=14410.0, ans=12.903749999999999 2024-06-19 15:44:49,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=14410.0, ans=0.125 2024-06-19 15:44:56,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=14428.333333333334, ans=0.3950083333333333 2024-06-19 15:45:00,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=12.910625 2024-06-19 15:45:01,346 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.179e+03 9.341e+03 1.250e+04 1.829e+04 9.521e+04, threshold=2.499e+04, percent-clipped=9.0 2024-06-19 15:45:01,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=14446.666666666666, ans=0.125 2024-06-19 15:45:02,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.36 vs. limit=12.9175 2024-06-19 15:45:03,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.46 vs. limit=18.335 2024-06-19 15:45:10,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=22.40 vs. limit=12.9175 2024-06-19 15:45:12,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=33.76 vs. limit=12.924375000000001 2024-06-19 15:45:14,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=14465.0, ans=0.125 2024-06-19 15:45:17,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=14465.0, ans=0.393725 2024-06-19 15:45:20,054 INFO [train.py:1028] (0/2) Epoch 1, batch 7900, loss[loss=1.05, simple_loss=0.6692, pruned_loss=0.7152, over 13185.00 frames. ], tot_loss[loss=1.038, simple_loss=0.6658, pruned_loss=0.7047, over 2571204.31 frames. ], batch size: 77, lr: 2.90e-02, grad_scale: 0.25 2024-06-19 15:45:24,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=14483.333333333334, ans=0.035 2024-06-19 15:45:55,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=14538.333333333334, ans=0.0 2024-06-19 15:45:56,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14538.333333333334, ans=0.15461666666666665 2024-06-19 15:46:07,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.15 vs. limit=18.4175 2024-06-19 15:46:11,422 INFO [train.py:1028] (0/2) Epoch 1, batch 7950, loss[loss=0.9692, simple_loss=0.6594, pruned_loss=0.6395, over 10571.00 frames. ], tot_loss[loss=1.039, simple_loss=0.6668, pruned_loss=0.7053, over 2574273.23 frames. ], batch size: 304, lr: 2.90e-02, grad_scale: 0.125 2024-06-19 15:46:25,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=14593.333333333334, ans=0.3892333333333333 2024-06-19 15:46:34,212 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.534e-02 2024-06-19 15:46:40,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.85 vs. limit=18.458750000000002 2024-06-19 15:46:41,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=44.10 vs. limit=12.315000000000001 2024-06-19 15:46:43,109 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.329e+03 1.257e+04 1.902e+04 2.657e+04 1.034e+05, threshold=3.803e+04, percent-clipped=27.0 2024-06-19 15:46:48,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=7.37 vs. limit=5.0 2024-06-19 15:46:49,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.39 vs. limit=18.4725 2024-06-19 15:46:50,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=14648.333333333334, ans=0.025 2024-06-19 15:46:51,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.91 vs. limit=8.662083333333333 2024-06-19 15:46:54,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.64 vs. limit=12.324166666666667 2024-06-19 15:46:56,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.31 vs. limit=18.48625 2024-06-19 15:46:59,308 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-8000.pt 2024-06-19 15:47:05,810 INFO [train.py:1028] (0/2) Epoch 1, batch 8000, loss[loss=1.016, simple_loss=0.6263, pruned_loss=0.7028, over 12742.00 frames. ], tot_loss[loss=1.042, simple_loss=0.6677, pruned_loss=0.7084, over 2572188.43 frames. ], batch size: 29, lr: 2.89e-02, grad_scale: 0.125 2024-06-19 15:47:09,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=91.19 vs. limit=13.0 2024-06-19 15:47:15,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.52 vs. limit=18.51375 2024-06-19 15:47:18,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=14685.0, ans=0.007677173913043478 2024-06-19 15:47:20,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=14685.0, ans=0.125 2024-06-19 15:47:27,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=13.01375 2024-06-19 15:47:29,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=14721.666666666666, ans=0.0053263888888888875 2024-06-19 15:47:33,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=40.65 vs. limit=12.360833333333332 2024-06-19 15:47:41,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.57 vs. limit=8.685 2024-06-19 15:47:48,030 INFO [train.py:1028] (0/2) Epoch 1, batch 8050, loss[loss=1.06, simple_loss=0.6834, pruned_loss=0.7179, over 13217.00 frames. ], tot_loss[loss=1.04, simple_loss=0.6664, pruned_loss=0.7069, over 2572686.78 frames. ], batch size: 83, lr: 2.89e-02, grad_scale: 0.125 2024-06-19 15:47:52,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14758.333333333334, ans=0.15241666666666667 2024-06-19 15:47:56,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.07 vs. limit=12.388333333333332 2024-06-19 15:47:58,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.87 vs. limit=8.694166666666666 2024-06-19 15:48:01,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=13.04125 2024-06-19 15:48:01,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=68.55 vs. limit=13.04125 2024-06-19 15:48:06,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=14795.0, ans=0.0 2024-06-19 15:48:15,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.894e+03 1.004e+04 1.467e+04 1.959e+04 6.956e+04, threshold=2.934e+04, percent-clipped=6.0 2024-06-19 15:48:25,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=83.33 vs. limit=13.061875 2024-06-19 15:48:30,706 INFO [train.py:1028] (0/2) Epoch 1, batch 8100, loss[loss=1.046, simple_loss=0.6751, pruned_loss=0.7088, over 13203.00 frames. ], tot_loss[loss=1.042, simple_loss=0.6685, pruned_loss=0.7082, over 2577361.45 frames. ], batch size: 112, lr: 2.88e-02, grad_scale: 0.125 2024-06-19 15:48:47,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.41 vs. limit=18.651249999999997 2024-06-19 15:49:00,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=13.0825 2024-06-19 15:49:18,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=92.93 vs. limit=13.096250000000001 2024-06-19 15:49:19,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=14923.333333333334, ans=0.004486111111111107 2024-06-19 15:49:22,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=14923.333333333334, ans=0.125 2024-06-19 15:49:27,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.81 vs. limit=12.470833333333333 2024-06-19 15:49:27,825 INFO [train.py:1028] (0/2) Epoch 1, batch 8150, loss[loss=0.9986, simple_loss=0.6501, pruned_loss=0.6736, over 13055.00 frames. ], tot_loss[loss=1.046, simple_loss=0.6693, pruned_loss=0.7115, over 2580041.60 frames. ], batch size: 121, lr: 2.88e-02, grad_scale: 0.0625 2024-06-19 15:49:31,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.97 vs. limit=13.103125 2024-06-19 15:49:40,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.33 vs. limit=13.11 2024-06-19 15:49:43,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=13.11 2024-06-19 15:49:51,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.96 vs. limit=13.116875 2024-06-19 15:49:56,345 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.247e+03 8.756e+03 1.130e+04 1.576e+04 8.630e+04, threshold=2.259e+04, percent-clipped=4.0 2024-06-19 15:50:12,053 INFO [train.py:1028] (0/2) Epoch 1, batch 8200, loss[loss=1.045, simple_loss=0.6801, pruned_loss=0.7045, over 13163.00 frames. ], tot_loss[loss=1.048, simple_loss=0.6707, pruned_loss=0.7125, over 2583124.50 frames. ], batch size: 112, lr: 2.88e-02, grad_scale: 0.125 2024-06-19 15:50:14,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=15033.333333333334, ans=0.125 2024-06-19 15:50:17,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=15033.333333333334, ans=0.004027777777777776 2024-06-19 15:50:23,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=15051.666666666666, ans=0.125 2024-06-19 15:50:28,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=13.144375 2024-06-19 15:50:39,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.34 vs. limit=13.158125 2024-06-19 15:50:50,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=15106.666666666666, ans=0.125 2024-06-19 15:50:53,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=15106.666666666666, ans=0.125 2024-06-19 15:50:55,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=15106.666666666666, ans=0.125 2024-06-19 15:50:56,548 INFO [train.py:1028] (0/2) Epoch 1, batch 8250, loss[loss=1.096, simple_loss=0.7038, pruned_loss=0.7443, over 13190.00 frames. ], tot_loss[loss=1.048, simple_loss=0.671, pruned_loss=0.7126, over 2583121.72 frames. ], batch size: 52, lr: 2.87e-02, grad_scale: 0.125 2024-06-19 15:50:56,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=15125.0, ans=0.007581521739130435 2024-06-19 15:51:23,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=15161.666666666666, ans=0.125 2024-06-19 15:51:26,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=18.884999999999998 2024-06-19 15:51:29,221 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.859e+03 1.031e+04 1.279e+04 1.778e+04 6.785e+04, threshold=2.558e+04, percent-clipped=11.0 2024-06-19 15:51:34,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.08 vs. limit=18.89875 2024-06-19 15:51:40,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=15198.333333333334, ans=0.0 2024-06-19 15:51:42,072 INFO [train.py:1028] (0/2) Epoch 1, batch 8300, loss[loss=0.9678, simple_loss=0.6243, pruned_loss=0.6556, over 13100.00 frames. ], tot_loss[loss=1.043, simple_loss=0.6681, pruned_loss=0.709, over 2580586.28 frames. ], batch size: 103, lr: 2.87e-02, grad_scale: 0.25 2024-06-19 15:51:43,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.69 vs. limit=18.9125 2024-06-19 15:52:05,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=13.219999999999999 2024-06-19 15:52:08,682 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=5.054e-03 2024-06-19 15:52:21,006 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=4.820e+00 2024-06-19 15:52:30,322 INFO [train.py:1028] (0/2) Epoch 1, batch 8350, loss[loss=1.031, simple_loss=0.678, pruned_loss=0.6924, over 13161.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6682, pruned_loss=0.7105, over 2581980.26 frames. ], batch size: 112, lr: 2.86e-02, grad_scale: 0.125 2024-06-19 15:52:38,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=68.45 vs. limit=12.663333333333334 2024-06-19 15:52:47,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=15345.0, ans=0.007533695652173914 2024-06-19 15:52:57,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=15363.333333333334, ans=0.125 2024-06-19 15:52:58,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.861e+03 1.233e+04 1.610e+04 2.127e+04 1.379e+05, threshold=3.220e+04, percent-clipped=19.0 2024-06-19 15:52:58,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.41 vs. limit=19.0225 2024-06-19 15:52:59,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.23 vs. limit=5.3045 2024-06-19 15:53:03,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.29 vs. limit=13.268125000000001 2024-06-19 15:53:07,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.14 vs. limit=19.036250000000003 2024-06-19 15:53:09,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=15381.666666666666, ans=0.002576388888888892 2024-06-19 15:53:11,676 INFO [train.py:1028] (0/2) Epoch 1, batch 8400, loss[loss=1.039, simple_loss=0.6458, pruned_loss=0.7162, over 12972.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6683, pruned_loss=0.7106, over 2577673.26 frames. ], batch size: 39, lr: 2.86e-02, grad_scale: 0.25 2024-06-19 15:53:13,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.33 vs. limit=19.05 2024-06-19 15:53:15,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=13.275 2024-06-19 15:53:27,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.57 vs. limit=19.06375 2024-06-19 15:53:42,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=15455.0, ans=0.35907500000000003 2024-06-19 15:53:45,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.50 vs. limit=8.868333333333334 2024-06-19 15:53:45,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=15473.333333333334, ans=0.125 2024-06-19 15:53:56,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.25 vs. limit=5.32375 2024-06-19 15:53:57,083 INFO [train.py:1028] (0/2) Epoch 1, batch 8450, loss[loss=1.019, simple_loss=0.6608, pruned_loss=0.6884, over 13201.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6704, pruned_loss=0.7118, over 2579103.52 frames. ], batch size: 112, lr: 2.85e-02, grad_scale: 0.125 2024-06-19 15:54:05,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=35.36 vs. limit=13.31625 2024-06-19 15:54:25,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=15528.333333333334, ans=0.125 2024-06-19 15:54:25,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=15528.333333333334, ans=0.3565083333333333 2024-06-19 15:54:27,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=13.323125000000001 2024-06-19 15:54:35,741 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.624e+03 1.005e+04 1.413e+04 1.774e+04 6.103e+04, threshold=2.826e+04, percent-clipped=13.0 2024-06-19 15:54:43,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15565.0, ans=0.14435 2024-06-19 15:54:45,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=15565.0, ans=0.125 2024-06-19 15:54:47,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=15565.0, ans=0.09899494936611666 2024-06-19 15:54:48,442 INFO [train.py:1028] (0/2) Epoch 1, batch 8500, loss[loss=1.078, simple_loss=0.6678, pruned_loss=0.7442, over 12795.00 frames. ], tot_loss[loss=1.049, simple_loss=0.6716, pruned_loss=0.7129, over 2577487.24 frames. ], batch size: 29, lr: 2.85e-02, grad_scale: 0.125 2024-06-19 15:54:48,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=15583.333333333334, ans=0.125 2024-06-19 15:55:02,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=15601.666666666666, ans=0.0016597222222222222 2024-06-19 15:55:10,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.77 vs. limit=13.3575 2024-06-19 15:55:11,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=15620.0, ans=0.125 2024-06-19 15:55:22,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=34.94 vs. limit=13.37125 2024-06-19 15:55:30,706 INFO [train.py:1028] (0/2) Epoch 1, batch 8550, loss[loss=1.068, simple_loss=0.6519, pruned_loss=0.7423, over 12601.00 frames. ], tot_loss[loss=1.051, simple_loss=0.6716, pruned_loss=0.7148, over 2575437.49 frames. ], batch size: 22, lr: 2.84e-02, grad_scale: 0.125 2024-06-19 15:55:31,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=15675.0, ans=0.125 2024-06-19 15:55:33,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=15675.0, ans=0.025 2024-06-19 15:55:34,354 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=37.02 vs. limit=13.378125 2024-06-19 15:55:44,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=15693.333333333334, ans=0.14306666666666668 2024-06-19 15:55:52,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=13.391874999999999 2024-06-19 15:55:59,815 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.75 vs. limit=12.865 2024-06-19 15:56:01,519 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.252e+03 1.303e+04 1.733e+04 2.383e+04 9.526e+04, threshold=3.465e+04, percent-clipped=13.0 2024-06-19 15:56:06,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=15748.333333333334, ans=0.025 2024-06-19 15:56:06,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=15748.333333333334, ans=13.405625 2024-06-19 15:56:13,364 INFO [train.py:1028] (0/2) Epoch 1, batch 8600, loss[loss=1.013, simple_loss=0.6521, pruned_loss=0.6869, over 13151.00 frames. ], tot_loss[loss=1.053, simple_loss=0.6726, pruned_loss=0.7165, over 2573337.57 frames. ], batch size: 112, lr: 2.84e-02, grad_scale: 0.25 2024-06-19 15:56:20,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=15785.0, ans=0.125 2024-06-19 15:56:20,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=15785.0, ans=0.125 2024-06-19 15:56:22,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=15785.0, ans=0.00743804347826087 2024-06-19 15:56:24,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=13.419374999999999 2024-06-19 15:56:25,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=15785.0, ans=0.347525 2024-06-19 15:56:27,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=15785.0, ans=0.00743804347826087 2024-06-19 15:56:34,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=15803.333333333334, ans=0.3468833333333333 2024-06-19 15:56:41,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=15821.666666666666, ans=0.125 2024-06-19 15:56:44,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=15821.666666666666, ans=0.0 2024-06-19 15:56:47,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15840.0, ans=0.1416 2024-06-19 15:56:48,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=15840.0, ans=0.007426086956521739 2024-06-19 15:56:53,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.50 vs. limit=19.380000000000003 2024-06-19 15:56:59,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=74.38 vs. limit=13.446875 2024-06-19 15:56:59,799 INFO [train.py:1028] (0/2) Epoch 1, batch 8650, loss[loss=0.9868, simple_loss=0.6427, pruned_loss=0.6655, over 13014.00 frames. ], tot_loss[loss=1.057, simple_loss=0.6748, pruned_loss=0.7194, over 2576629.15 frames. ], batch size: 102, lr: 2.83e-02, grad_scale: 0.0625 2024-06-19 15:57:02,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=15858.333333333334, ans=0.125 2024-06-19 15:57:03,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=8.52 vs. limit=10.343333333333334 2024-06-19 15:57:11,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=24.80 vs. limit=13.45375 2024-06-19 15:57:17,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=15895.0, ans=0.0 2024-06-19 15:57:27,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=10.365333333333334 2024-06-19 15:57:32,475 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.893e+03 1.161e+04 1.602e+04 2.552e+04 2.716e+05, threshold=3.204e+04, percent-clipped=15.0 2024-06-19 15:57:41,670 INFO [train.py:1028] (0/2) Epoch 1, batch 8700, loss[loss=1.104, simple_loss=0.7044, pruned_loss=0.7518, over 13129.00 frames. ], tot_loss[loss=1.052, simple_loss=0.6735, pruned_loss=0.7153, over 2572946.75 frames. ], batch size: 59, lr: 2.83e-02, grad_scale: 0.125 2024-06-19 15:57:42,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=15950.0, ans=0.00020833333333333814 2024-06-19 15:57:48,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=29.90 vs. limit=19.47625 2024-06-19 15:57:52,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=15968.333333333334, ans=0.125 2024-06-19 15:57:52,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.49 vs. limit=12.984166666666667 2024-06-19 15:58:01,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.22 vs. limit=13.495000000000001 2024-06-19 15:58:02,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=25.61 vs. limit=13.495000000000001 2024-06-19 15:58:06,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=16005.0, ans=0.0 2024-06-19 15:58:08,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=16005.0, ans=0.125 2024-06-19 15:58:12,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=16005.0, ans=0.125 2024-06-19 15:58:21,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=16023.333333333334, ans=0.025 2024-06-19 15:58:22,699 INFO [train.py:1028] (0/2) Epoch 1, batch 8750, loss[loss=1.001, simple_loss=0.6556, pruned_loss=0.6729, over 13091.00 frames. ], tot_loss[loss=1.052, simple_loss=0.6739, pruned_loss=0.7149, over 2569048.07 frames. ], batch size: 121, lr: 2.82e-02, grad_scale: 0.0625 2024-06-19 15:58:25,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=16041.666666666666, ans=0.13958333333333334 2024-06-19 15:58:28,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=16041.666666666666, ans=0.04949747468305833 2024-06-19 15:58:34,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.01 vs. limit=19.545 2024-06-19 15:58:35,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.64 vs. limit=19.545 2024-06-19 15:58:40,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=21.75 vs. limit=13.529375 2024-06-19 15:59:01,936 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.830e+03 1.054e+04 1.419e+04 1.760e+04 1.152e+05, threshold=2.839e+04, percent-clipped=3.0 2024-06-19 15:59:05,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=16115.0, ans=0.13885 2024-06-19 15:59:08,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=16115.0, ans=0.05 2024-06-19 15:59:12,454 INFO [train.py:1028] (0/2) Epoch 1, batch 8800, loss[loss=1.099, simple_loss=0.7029, pruned_loss=0.7471, over 13213.00 frames. ], tot_loss[loss=1.052, simple_loss=0.6748, pruned_loss=0.7145, over 2573866.19 frames. ], batch size: 72, lr: 2.82e-02, grad_scale: 0.125 2024-06-19 15:59:16,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=16133.333333333334, ans=0.125 2024-06-19 15:59:16,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=16133.333333333334, ans=0.125 2024-06-19 15:59:25,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=16151.666666666666, ans=0.05059708333333335 2024-06-19 15:59:26,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=16151.666666666666, ans=0.0 2024-06-19 15:59:33,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=16170.0, ans=0.125 2024-06-19 15:59:34,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=19.627499999999998 2024-06-19 15:59:42,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.72 vs. limit=19.64125 2024-06-19 15:59:46,081 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.35 vs. limit=5.42825 2024-06-19 15:59:48,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=16206.666666666666, ans=0.0 2024-06-19 15:59:50,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=16206.666666666666, ans=0.0 2024-06-19 15:59:58,021 INFO [train.py:1028] (0/2) Epoch 1, batch 8850, loss[loss=1.034, simple_loss=0.6896, pruned_loss=0.6891, over 12489.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6739, pruned_loss=0.7128, over 2561980.39 frames. ], batch size: 202, lr: 2.81e-02, grad_scale: 0.03125 2024-06-19 16:00:05,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=16225.0, ans=0.007342391304347827 2024-06-19 16:00:10,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=16243.333333333334, ans=0.125 2024-06-19 16:00:15,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=16261.666666666666, ans=0.0 2024-06-19 16:00:19,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=16261.666666666666, ans=0.08738333333333331 2024-06-19 16:00:22,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=16280.0, ans=0.125 2024-06-19 16:00:27,387 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=8.574e-03 2024-06-19 16:00:28,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=16280.0, ans=0.0 2024-06-19 16:00:29,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=16280.0, ans=0.33020000000000005 2024-06-19 16:00:32,243 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.916e+03 1.185e+04 1.706e+04 3.037e+04 2.228e+05, threshold=3.412e+04, percent-clipped=29.0 2024-06-19 16:00:32,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=13.611875000000001 2024-06-19 16:00:34,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=16298.333333333334, ans=0.3295583333333333 2024-06-19 16:00:35,409 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.079e-03 2024-06-19 16:00:37,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=16298.333333333334, ans=0.07 2024-06-19 16:00:39,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=16316.666666666666, ans=0.125 2024-06-19 16:00:40,631 INFO [train.py:1028] (0/2) Epoch 1, batch 8900, loss[loss=1.091, simple_loss=0.6784, pruned_loss=0.7519, over 12823.00 frames. ], tot_loss[loss=1.05, simple_loss=0.6742, pruned_loss=0.7131, over 2560471.69 frames. ], batch size: 33, lr: 2.81e-02, grad_scale: 0.0625 2024-06-19 16:01:00,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=10.541333333333334 2024-06-19 16:01:12,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=16371.666666666666, ans=0.3269916666666667 2024-06-19 16:01:12,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.20 vs. limit=10.548666666666666 2024-06-19 16:01:15,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16390.0, ans=0.1361 2024-06-19 16:01:15,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=16390.0, ans=0.125 2024-06-19 16:01:19,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=16390.0, ans=0.007306521739130435 2024-06-19 16:01:27,471 INFO [train.py:1028] (0/2) Epoch 1, batch 8950, loss[loss=1.052, simple_loss=0.6946, pruned_loss=0.7051, over 12593.00 frames. ], tot_loss[loss=1.059, simple_loss=0.6778, pruned_loss=0.7205, over 2561245.89 frames. ], batch size: 202, lr: 2.80e-02, grad_scale: 0.0625 2024-06-19 16:01:30,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=13.653125 2024-06-19 16:01:31,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=21.67 vs. limit=13.653125 2024-06-19 16:01:47,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16445.0, ans=0.13555 2024-06-19 16:01:53,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=16463.333333333332, ans=0.125 2024-06-19 16:01:54,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=10.585333333333333 2024-06-19 16:01:55,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=46.34 vs. limit=13.673749999999998 2024-06-19 16:02:03,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=16463.333333333332, ans=0.1353666666666667 2024-06-19 16:02:05,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.47 vs. limit=19.847499999999997 2024-06-19 16:02:06,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=29.58 vs. limit=13.673749999999998 2024-06-19 16:02:06,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.15 vs. limit=19.847499999999997 2024-06-19 16:02:09,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.56 vs. limit=19.861250000000002 2024-06-19 16:02:10,232 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.428e+03 6.767e+03 9.739e+03 1.176e+04 2.110e+04, threshold=1.948e+04, percent-clipped=0.0 2024-06-19 16:02:14,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.33 vs. limit=5.47225 2024-06-19 16:02:18,140 INFO [train.py:1028] (0/2) Epoch 1, batch 9000, loss[loss=1.047, simple_loss=0.659, pruned_loss=0.7177, over 13330.00 frames. ], tot_loss[loss=1.061, simple_loss=0.6776, pruned_loss=0.7224, over 2567989.30 frames. ], batch size: 46, lr: 2.80e-02, grad_scale: 0.125 2024-06-19 16:02:18,143 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 16:02:26,803 INFO [train.py:1060] (0/2) Epoch 1, validation: loss=0.9889, simple_loss=0.6323, pruned_loss=0.6727, over 351949.00 frames. 2024-06-19 16:02:26,804 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 16:02:30,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=32.36 vs. limit=13.6875 2024-06-19 16:02:32,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=16500.0, ans=0.135 2024-06-19 16:02:44,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=16536.666666666668, ans=0.07 2024-06-19 16:02:52,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=16555.0, ans=0.125 2024-06-19 16:03:00,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=26.71 vs. limit=13.715 2024-06-19 16:03:01,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=30.52 vs. limit=19.93 2024-06-19 16:03:03,489 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.258e+01 2024-06-19 16:03:04,186 INFO [train.py:1028] (0/2) Epoch 1, batch 9050, loss[loss=1.09, simple_loss=0.665, pruned_loss=0.7572, over 11414.00 frames. ], tot_loss[loss=1.062, simple_loss=0.6775, pruned_loss=0.7231, over 2567450.60 frames. ], batch size: 17, lr: 2.80e-02, grad_scale: 0.125 2024-06-19 16:03:05,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=16591.666666666668, ans=0.31929166666666675 2024-06-19 16:03:13,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=16610.0, ans=0.0 2024-06-19 16:03:14,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=16610.0, ans=0.125 2024-06-19 16:03:17,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=16610.0, ans=0.125 2024-06-19 16:03:21,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=16628.333333333332, ans=0.125 2024-06-19 16:03:26,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=69.95 vs. limit=13.735624999999999 2024-06-19 16:03:30,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=16646.666666666668, ans=0.025 2024-06-19 16:03:34,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.84 vs. limit=19.985 2024-06-19 16:03:36,083 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.640e+03 6.648e+03 9.842e+03 1.562e+04 7.535e+04, threshold=1.968e+04, percent-clipped=14.0 2024-06-19 16:03:40,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=16665.0, ans=0.007246739130434783 2024-06-19 16:03:42,674 INFO [train.py:1028] (0/2) Epoch 1, batch 9100, loss[loss=1.124, simple_loss=0.7094, pruned_loss=0.7693, over 13037.00 frames. ], tot_loss[loss=1.065, simple_loss=0.6768, pruned_loss=0.7265, over 2566923.36 frames. ], batch size: 71, lr: 2.79e-02, grad_scale: 0.25 2024-06-19 16:03:42,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=16683.333333333332, ans=0.125 2024-06-19 16:03:53,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.94 vs. limit=13.763125 2024-06-19 16:03:58,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=16720.0, ans=0.31479999999999997 2024-06-19 16:03:58,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=69.81 vs. limit=13.77 2024-06-19 16:04:03,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.31 vs. limit=13.77 2024-06-19 16:04:17,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=16756.666666666668, ans=0.0 2024-06-19 16:04:19,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=5.5135000000000005 2024-06-19 16:04:20,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=16756.666666666668, ans=0.125 2024-06-19 16:04:20,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16756.666666666668, ans=0.13243333333333332 2024-06-19 16:04:21,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=16756.666666666668, ans=0.007226811594202898 2024-06-19 16:04:22,741 INFO [train.py:1028] (0/2) Epoch 1, batch 9150, loss[loss=1.049, simple_loss=0.6595, pruned_loss=0.719, over 13148.00 frames. ], tot_loss[loss=1.066, simple_loss=0.6773, pruned_loss=0.7278, over 2568802.13 frames. ], batch size: 77, lr: 2.79e-02, grad_scale: 0.0625 2024-06-19 16:04:25,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=9.19375 2024-06-19 16:04:29,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=29.37 vs. limit=13.790625 2024-06-19 16:04:31,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.80 vs. limit=13.396666666666667 2024-06-19 16:04:31,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=55.83 vs. limit=20.095 2024-06-19 16:04:39,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=35.48 vs. limit=20.10875 2024-06-19 16:04:40,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=16811.666666666668, ans=0.0 2024-06-19 16:04:43,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16811.666666666668, ans=0.13188333333333332 2024-06-19 16:04:46,162 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.14 vs. limit=20.10875 2024-06-19 16:04:50,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16830.0, ans=0.1317 2024-06-19 16:04:53,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=16830.0, ans=0.0 2024-06-19 16:04:56,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=16848.333333333332, ans=0.125 2024-06-19 16:04:59,132 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.461e+03 1.191e+04 1.599e+04 3.022e+04 1.422e+05, threshold=3.198e+04, percent-clipped=43.0 2024-06-19 16:05:04,978 INFO [train.py:1028] (0/2) Epoch 1, batch 9200, loss[loss=1.009, simple_loss=0.6225, pruned_loss=0.6978, over 12923.00 frames. ], tot_loss[loss=1.069, simple_loss=0.6776, pruned_loss=0.7304, over 2572885.38 frames. ], batch size: 36, lr: 2.78e-02, grad_scale: 0.125 2024-06-19 16:05:08,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.81 vs. limit=9.216666666666667 2024-06-19 16:05:17,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=16885.0, ans=20.16375 2024-06-19 16:05:19,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=16903.333333333332, ans=0.125 2024-06-19 16:05:24,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=16903.333333333332, ans=0.0 2024-06-19 16:05:38,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=13.845625000000002 2024-06-19 16:05:39,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=16921.666666666668, ans=0.025 2024-06-19 16:05:40,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=16940.0, ans=0.125 2024-06-19 16:05:46,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.13 vs. limit=13.8525 2024-06-19 16:05:49,382 INFO [train.py:1028] (0/2) Epoch 1, batch 9250, loss[loss=1.082, simple_loss=0.6863, pruned_loss=0.739, over 13208.00 frames. ], tot_loss[loss=1.071, simple_loss=0.6776, pruned_loss=0.7326, over 2576368.81 frames. ], batch size: 67, lr: 2.78e-02, grad_scale: 0.125 2024-06-19 16:05:59,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=16976.666666666668, ans=0.0071789855072463766 2024-06-19 16:06:28,629 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.331e+03 7.856e+03 9.272e+03 1.375e+04 4.263e+04, threshold=1.854e+04, percent-clipped=4.0 2024-06-19 16:06:33,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=17031.666666666668, ans=0.30389166666666667 2024-06-19 16:06:34,419 INFO [train.py:1028] (0/2) Epoch 1, batch 9300, loss[loss=1.066, simple_loss=0.6626, pruned_loss=0.7348, over 12873.00 frames. ], tot_loss[loss=1.072, simple_loss=0.6771, pruned_loss=0.7338, over 2573702.64 frames. ], batch size: 39, lr: 2.77e-02, grad_scale: 0.25 2024-06-19 16:06:36,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=17050.0, ans=20.2875 2024-06-19 16:06:42,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17068.333333333332, ans=0.0 2024-06-19 16:06:44,000 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=5.56025 2024-06-19 16:06:45,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=17068.333333333332, ans=0.0 2024-06-19 16:06:50,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17086.666666666668, ans=0.125 2024-06-19 16:06:50,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=13.9075 2024-06-19 16:06:55,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=17086.666666666668, ans=13.9075 2024-06-19 16:06:55,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=13.9075 2024-06-19 16:06:56,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=10.834666666666667 2024-06-19 16:06:57,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.09 vs. limit=13.9075 2024-06-19 16:06:59,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.79 vs. limit=13.914375 2024-06-19 16:07:08,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=17123.333333333332, ans=0.0 2024-06-19 16:07:09,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=17123.333333333332, ans=0.125 2024-06-19 16:07:12,910 INFO [train.py:1028] (0/2) Epoch 1, batch 9350, loss[loss=1.161, simple_loss=0.7091, pruned_loss=0.8062, over 12528.00 frames. ], tot_loss[loss=1.071, simple_loss=0.6764, pruned_loss=0.7325, over 2571495.40 frames. ], batch size: 22, lr: 2.77e-02, grad_scale: 0.25 2024-06-19 16:07:18,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.05 vs. limit=13.928125000000001 2024-06-19 16:07:21,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=17160.0, ans=0.125 2024-06-19 16:07:21,896 WARNING [optim.py:503] (0/2) Scaling gradients by 0.08550368994474411, model_norm_threshold=18544.259765625 2024-06-19 16:07:22,062 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.38, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.794e+10, grad_sumsq=5.411e+09, orig_rms_sq=3.316e+00 2024-06-19 16:07:22,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.09 vs. limit=13.58 2024-06-19 16:07:25,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=40.57 vs. limit=13.934999999999999 2024-06-19 16:07:26,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=17178.333333333332, ans=0.125 2024-06-19 16:07:26,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=13.941875 2024-06-19 16:07:28,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.12 vs. limit=20.38375 2024-06-19 16:07:30,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17178.333333333332, ans=0.12821666666666667 2024-06-19 16:07:34,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=17196.666666666668, ans=0.125 2024-06-19 16:07:36,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=17196.666666666668, ans=0.007131159420289855 2024-06-19 16:07:45,653 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.189e+03 1.189e+04 1.502e+04 2.131e+04 2.169e+05, threshold=3.004e+04, percent-clipped=36.0 2024-06-19 16:07:49,844 INFO [train.py:1028] (0/2) Epoch 1, batch 9400, loss[loss=1.083, simple_loss=0.6772, pruned_loss=0.7445, over 13265.00 frames. ], tot_loss[loss=1.069, simple_loss=0.6759, pruned_loss=0.7306, over 2570445.18 frames. ], batch size: 52, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:07:51,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.37 vs. limit=20.424999999999997 2024-06-19 16:07:56,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.09 vs. limit=13.962499999999999 2024-06-19 16:08:02,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=17251.666666666668, ans=0.0 2024-06-19 16:08:03,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=17251.666666666668, ans=0.125 2024-06-19 16:08:11,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=17270.0, ans=0.0 2024-06-19 16:08:13,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=17288.333333333332, ans=0.12711666666666668 2024-06-19 16:08:16,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=20.46625 2024-06-19 16:08:29,389 INFO [train.py:1028] (0/2) Epoch 1, batch 9450, loss[loss=1.055, simple_loss=0.6492, pruned_loss=0.7307, over 12486.00 frames. ], tot_loss[loss=1.068, simple_loss=0.6767, pruned_loss=0.7294, over 2570237.05 frames. ], batch size: 22, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:08:30,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.54 vs. limit=13.996875 2024-06-19 16:08:31,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=18.43 vs. limit=13.996875 2024-06-19 16:08:34,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17325.0, ans=0.12675 2024-06-19 16:08:35,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=17343.333333333332, ans=0.0 2024-06-19 16:08:51,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=17380.0, ans=0.0 2024-06-19 16:08:52,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=17380.0, ans=0.125 2024-06-19 16:08:52,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=98.62 vs. limit=14.0175 2024-06-19 16:08:54,127 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=3.422e-02 2024-06-19 16:09:01,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=28.68 vs. limit=14.024375 2024-06-19 16:09:03,343 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.630e+03 1.501e+04 1.785e+04 2.284e+04 7.471e+04, threshold=3.570e+04, percent-clipped=8.0 2024-06-19 16:09:10,763 INFO [train.py:1028] (0/2) Epoch 1, batch 9500, loss[loss=1.067, simple_loss=0.6622, pruned_loss=0.7358, over 13271.00 frames. ], tot_loss[loss=1.067, simple_loss=0.6765, pruned_loss=0.7288, over 2579415.98 frames. ], batch size: 43, lr: 2.76e-02, grad_scale: 0.125 2024-06-19 16:09:13,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=17416.666666666668, ans=10.966666666666667 2024-06-19 16:09:16,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=33.85 vs. limit=14.03125 2024-06-19 16:09:18,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.96 vs. limit=14.038125 2024-06-19 16:09:28,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=47.59 vs. limit=20.589999999999996 2024-06-19 16:09:32,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=51.98 vs. limit=14.051875 2024-06-19 16:09:32,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=17471.666666666668, ans=0.125 2024-06-19 16:09:34,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=17471.666666666668, ans=0.125 2024-06-19 16:09:38,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=17471.666666666668, ans=0.125 2024-06-19 16:09:42,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=275.39 vs. limit=14.05875 2024-06-19 16:09:46,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17490.0, ans=0.125 2024-06-19 16:09:48,218 INFO [train.py:1028] (0/2) Epoch 1, batch 9550, loss[loss=1.053, simple_loss=0.6513, pruned_loss=0.7271, over 13187.00 frames. ], tot_loss[loss=1.067, simple_loss=0.6764, pruned_loss=0.7288, over 2573947.86 frames. ], batch size: 40, lr: 2.75e-02, grad_scale: 0.0625 2024-06-19 16:10:02,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=82.40 vs. limit=14.0725 2024-06-19 16:10:04,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=28.41 vs. limit=13.763333333333334 2024-06-19 16:10:05,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=17526.666666666668, ans=0.125 2024-06-19 16:10:12,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.47 vs. limit=20.658749999999998 2024-06-19 16:10:13,434 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=94.43 vs. limit=14.079374999999999 2024-06-19 16:10:16,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=17563.333333333332, ans=0.0 2024-06-19 16:10:22,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=17563.333333333332, ans=0.125 2024-06-19 16:10:26,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17581.666666666668, ans=0.12418333333333331 2024-06-19 16:10:27,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17581.666666666668, ans=0.12418333333333331 2024-06-19 16:10:28,881 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.303e+03 1.002e+04 1.452e+04 2.832e+04 1.209e+05, threshold=2.904e+04, percent-clipped=14.0 2024-06-19 16:10:30,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=14.093125 2024-06-19 16:10:31,187 INFO [train.py:1028] (0/2) Epoch 1, batch 9600, loss[loss=1.025, simple_loss=0.6631, pruned_loss=0.6937, over 10481.00 frames. ], tot_loss[loss=1.065, simple_loss=0.6749, pruned_loss=0.7271, over 2573080.53 frames. ], batch size: 303, lr: 2.75e-02, grad_scale: 0.125 2024-06-19 16:10:34,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=14.1 2024-06-19 16:10:42,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=32.11 vs. limit=14.106874999999999 2024-06-19 16:10:46,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=17636.666666666668, ans=0.125 2024-06-19 16:10:51,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17636.666666666668, ans=0.125 2024-06-19 16:10:55,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=14.120625 2024-06-19 16:10:59,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.38 vs. limit=5.64825 2024-06-19 16:11:02,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.18 vs. limit=20.755000000000003 2024-06-19 16:11:04,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=17673.333333333332, ans=0.0 2024-06-19 16:11:10,019 INFO [train.py:1028] (0/2) Epoch 1, batch 9650, loss[loss=1.031, simple_loss=0.6612, pruned_loss=0.7002, over 13030.00 frames. ], tot_loss[loss=1.064, simple_loss=0.6746, pruned_loss=0.7263, over 2562712.74 frames. ], batch size: 132, lr: 2.74e-02, grad_scale: 0.125 2024-06-19 16:11:15,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17691.666666666668, ans=0.0 2024-06-19 16:11:16,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=17691.666666666668, ans=0.125 2024-06-19 16:11:19,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=17710.0, ans=0.12290000000000001 2024-06-19 16:11:24,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.10 vs. limit=20.7825 2024-06-19 16:11:25,974 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=97.24 vs. limit=13.864166666666666 2024-06-19 16:11:26,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=17728.333333333332, ans=0.125 2024-06-19 16:11:39,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=14.155000000000001 2024-06-19 16:11:40,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17765.0, ans=0.125 2024-06-19 16:11:41,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.88 vs. limit=20.82375 2024-06-19 16:11:43,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=17765.0, ans=0.125 2024-06-19 16:11:45,774 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.317e+03 5.779e+03 6.944e+03 9.194e+03 6.573e+04, threshold=1.389e+04, percent-clipped=2.0 2024-06-19 16:11:47,780 INFO [train.py:1028] (0/2) Epoch 1, batch 9700, loss[loss=1.006, simple_loss=0.6451, pruned_loss=0.6838, over 12992.00 frames. ], tot_loss[loss=1.062, simple_loss=0.6734, pruned_loss=0.725, over 2556422.62 frames. ], batch size: 144, lr: 2.74e-02, grad_scale: 0.25 2024-06-19 16:11:52,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.73 vs. limit=14.16875 2024-06-19 16:11:52,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=17783.333333333332, ans=0.0 2024-06-19 16:11:58,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=17801.666666666668, ans=0.125 2024-06-19 16:12:06,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=17820.0, ans=0.09899494936611666 2024-06-19 16:12:19,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=17856.666666666668, ans=0.006987681159420289 2024-06-19 16:12:24,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17856.666666666668, ans=0.125 2024-06-19 16:12:28,666 INFO [train.py:1028] (0/2) Epoch 1, batch 9750, loss[loss=1.014, simple_loss=0.6465, pruned_loss=0.6904, over 13125.00 frames. ], tot_loss[loss=1.061, simple_loss=0.6725, pruned_loss=0.7244, over 2552806.47 frames. ], batch size: 132, lr: 2.73e-02, grad_scale: 0.25 2024-06-19 16:12:38,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=14.21 2024-06-19 16:13:02,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=17948.333333333332, ans=0.27180833333333343 2024-06-19 16:13:05,880 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.560e+03 7.216e+03 9.938e+03 1.489e+04 7.523e+04, threshold=1.988e+04, percent-clipped=30.0 2024-06-19 16:13:07,525 INFO [train.py:1028] (0/2) Epoch 1, batch 9800, loss[loss=1.067, simple_loss=0.6764, pruned_loss=0.7287, over 12966.00 frames. ], tot_loss[loss=1.061, simple_loss=0.6723, pruned_loss=0.7251, over 2545696.89 frames. ], batch size: 39, lr: 2.73e-02, grad_scale: 0.25 2024-06-19 16:13:07,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=9.09 vs. limit=5.695 2024-06-19 16:13:08,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=17966.666666666668, ans=0.006963768115942029 2024-06-19 16:13:21,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=22.78 vs. limit=14.244375 2024-06-19 16:13:22,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=21.58 vs. limit=14.244375 2024-06-19 16:13:32,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=14.258125 2024-06-19 16:13:36,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18021.666666666668, ans=0.11978333333333332 2024-06-19 16:13:38,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=18040.0, ans=0.125 2024-06-19 16:13:40,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18040.0, ans=0.0 2024-06-19 16:13:42,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=14.265 2024-06-19 16:13:45,290 INFO [train.py:1028] (0/2) Epoch 1, batch 9850, loss[loss=0.9898, simple_loss=0.6345, pruned_loss=0.6725, over 13059.00 frames. ], tot_loss[loss=1.059, simple_loss=0.6716, pruned_loss=0.723, over 2538964.48 frames. ], batch size: 102, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:13:57,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=18076.666666666668, ans=0.0 2024-06-19 16:14:06,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=18095.0, ans=0.266675 2024-06-19 16:14:21,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=18131.666666666668, ans=21.098750000000003 2024-06-19 16:14:25,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.54 vs. limit=9.532916666666667 2024-06-19 16:14:26,474 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.144e+03 8.179e+03 1.169e+04 1.634e+04 5.486e+04, threshold=2.339e+04, percent-clipped=12.0 2024-06-19 16:14:27,351 INFO [train.py:1028] (0/2) Epoch 1, batch 9900, loss[loss=1.032, simple_loss=0.641, pruned_loss=0.7113, over 12906.00 frames. ], tot_loss[loss=1.051, simple_loss=0.6688, pruned_loss=0.7166, over 2531383.25 frames. ], batch size: 39, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:14:28,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=18150.0, ans=0.125 2024-06-19 16:14:28,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=14.30625 2024-06-19 16:14:32,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=18150.0, ans=0.006923913043478261 2024-06-19 16:14:35,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=17.38 vs. limit=14.313125 2024-06-19 16:14:44,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=18186.666666666668, ans=0.006915942028985507 2024-06-19 16:14:45,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=14.32 2024-06-19 16:14:51,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=18205.0, ans=0.0 2024-06-19 16:14:55,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=18223.333333333332, ans=9.555833333333332 2024-06-19 16:15:03,401 INFO [train.py:1028] (0/2) Epoch 1, batch 9950, loss[loss=1.114, simple_loss=0.6817, pruned_loss=0.7729, over 12646.00 frames. ], tot_loss[loss=1.043, simple_loss=0.6651, pruned_loss=0.7103, over 2526020.09 frames. ], batch size: 29, lr: 2.72e-02, grad_scale: 0.25 2024-06-19 16:15:03,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18241.666666666668, ans=0.125 2024-06-19 16:15:05,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=18241.666666666668, ans=0.025 2024-06-19 16:15:20,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=18278.333333333332, ans=0.125 2024-06-19 16:15:20,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18278.333333333332, ans=0.11721666666666669 2024-06-19 16:15:21,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=25.40 vs. limit=14.139166666666666 2024-06-19 16:15:26,994 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=9.574166666666667 2024-06-19 16:15:27,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=89.62 vs. limit=14.36125 2024-06-19 16:15:38,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=18315.0, ans=21.23625 2024-06-19 16:15:38,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=14.368125 2024-06-19 16:15:39,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=5.74725 2024-06-19 16:15:40,489 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.487e+03 9.253e+03 1.124e+04 1.542e+04 5.730e+04, threshold=2.249e+04, percent-clipped=7.0 2024-06-19 16:15:41,583 INFO [train.py:1028] (0/2) Epoch 1, batch 10000, loss[loss=1.159, simple_loss=0.7146, pruned_loss=0.8014, over 12705.00 frames. ], tot_loss[loss=1.045, simple_loss=0.6665, pruned_loss=0.712, over 2487145.98 frames. ], batch size: 22, lr: 2.71e-02, grad_scale: 0.5 2024-06-19 16:15:43,108 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=2.660e+03 2024-06-19 16:15:43,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.69 vs. limit=21.25 2024-06-19 16:15:49,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=18351.666666666668, ans=0.0 2024-06-19 16:15:49,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.42 vs. limit=14.381875 2024-06-19 16:15:49,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=18351.666666666668, ans=0.125 2024-06-19 16:15:54,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=27.97 vs. limit=14.381875 2024-06-19 16:15:55,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=18351.666666666668, ans=0.125 2024-06-19 16:16:00,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.67 vs. limit=14.38875 2024-06-19 16:16:06,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=18388.333333333332, ans=0.125 2024-06-19 16:16:12,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=18406.666666666668, ans=10.0 2024-06-19 16:16:12,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=18406.666666666668, ans=0.125 2024-06-19 16:16:12,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.49 vs. limit=5.761 2024-06-19 16:16:13,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.24 vs. limit=14.4025 2024-06-19 16:16:13,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=18406.666666666668, ans=21.305 2024-06-19 16:16:14,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=18406.666666666668, ans=14.4025 2024-06-19 16:16:17,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=34.41 vs. limit=21.305 2024-06-19 16:16:18,747 INFO [train.py:1028] (0/2) Epoch 1, batch 10050, loss[loss=1.11, simple_loss=0.6856, pruned_loss=0.7667, over 12377.00 frames. ], tot_loss[loss=1.041, simple_loss=0.665, pruned_loss=0.7086, over 2444598.41 frames. ], batch size: 22, lr: 2.71e-02, grad_scale: 0.125 2024-06-19 16:16:25,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=18443.333333333332, ans=0.2544833333333334 2024-06-19 16:16:27,472 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.029e-01 2024-06-19 16:16:54,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=42.26 vs. limit=21.37375 2024-06-19 16:16:56,354 INFO [train.py:1028] (0/2) Epoch 1, batch 10100, loss[loss=0.9407, simple_loss=0.5748, pruned_loss=0.6533, over 11097.00 frames. ], tot_loss[loss=1.042, simple_loss=0.663, pruned_loss=0.711, over 2425140.86 frames. ], batch size: 16, lr: 2.70e-02, grad_scale: 0.25 2024-06-19 16:16:56,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.86 vs. limit=9.629166666666666 2024-06-19 16:16:56,968 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.105e+03 8.445e+03 1.116e+04 1.531e+04 6.312e+04, threshold=2.232e+04, percent-clipped=15.0 2024-06-19 16:17:00,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.33 vs. limit=21.3875 2024-06-19 16:17:02,778 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.046e-01 2024-06-19 16:17:06,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.67 vs. limit=14.450624999999999 2024-06-19 16:17:07,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=18535.0, ans=0.251275 2024-06-19 16:17:11,943 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-1.pt 2024-06-19 16:19:49,378 INFO [train.py:1028] (0/2) Epoch 2, batch 0, loss[loss=1.015, simple_loss=0.6366, pruned_loss=0.6969, over 12946.00 frames. ], tot_loss[loss=1.015, simple_loss=0.6366, pruned_loss=0.6969, over 12946.00 frames. ], batch size: 36, lr: 2.65e-02, grad_scale: 0.5 2024-06-19 16:19:49,379 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 16:19:56,765 INFO [train.py:1060] (0/2) Epoch 2, validation: loss=1.017, simple_loss=0.6453, pruned_loss=0.694, over 351949.00 frames. 2024-06-19 16:19:56,765 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 16:20:00,250 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.675e-01 2024-06-19 16:20:00,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=66.48 vs. limit=21.41225 2024-06-19 16:20:12,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=14.463000000000001 2024-06-19 16:20:14,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=18586.333333333332, ans=0.125 2024-06-19 16:20:14,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=18586.333333333332, ans=0.24947833333333336 2024-06-19 16:20:15,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=30.82 vs. limit=21.43975 2024-06-19 16:20:15,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=18586.333333333332, ans=0.0 2024-06-19 16:20:23,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.08 vs. limit=9.651166666666668 2024-06-19 16:20:39,844 INFO [train.py:1028] (0/2) Epoch 2, batch 50, loss[loss=1.076, simple_loss=0.6614, pruned_loss=0.7457, over 12634.00 frames. ], tot_loss[loss=0.9897, simple_loss=0.6346, pruned_loss=0.6724, over 574466.03 frames. ], batch size: 29, lr: 2.64e-02, grad_scale: 0.125 2024-06-19 16:20:41,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=79.40 vs. limit=14.4905 2024-06-19 16:20:47,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=5.79895 2024-06-19 16:20:49,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.06 vs. limit=21.49475 2024-06-19 16:20:52,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=18659.666666666668, ans=0.07 2024-06-19 16:20:59,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=18.35 vs. limit=14.339 2024-06-19 16:21:01,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=18678.0, ans=0.006809130434782609 2024-06-19 16:21:07,022 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.452e+03 7.112e+03 9.601e+03 1.429e+04 7.351e+04, threshold=1.920e+04, percent-clipped=7.0 2024-06-19 16:21:21,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=18714.666666666668, ans=0.0 2024-06-19 16:21:22,939 INFO [train.py:1028] (0/2) Epoch 2, batch 100, loss[loss=1.082, simple_loss=0.6825, pruned_loss=0.7409, over 13346.00 frames. ], tot_loss[loss=0.9754, simple_loss=0.6272, pruned_loss=0.6618, over 1018066.16 frames. ], batch size: 46, lr: 2.64e-02, grad_scale: 0.25 2024-06-19 16:21:25,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=18733.0, ans=0.125 2024-06-19 16:21:26,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=18733.0, ans=0.006797173913043478 2024-06-19 16:21:29,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18751.333333333332, ans=0.125 2024-06-19 16:21:36,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=14.538625 2024-06-19 16:21:40,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=18769.666666666668, ans=0.125 2024-06-19 16:21:42,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=11.507866666666668 2024-06-19 16:21:43,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=18788.0, ans=0.125 2024-06-19 16:21:43,789 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=11.5152 2024-06-19 16:21:45,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=18788.0, ans=0.025 2024-06-19 16:21:46,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.40 vs. limit=21.591 2024-06-19 16:21:53,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=11.522533333333332 2024-06-19 16:21:56,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=18806.333333333332, ans=0.006781231884057971 2024-06-19 16:21:58,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.29 vs. limit=14.552375 2024-06-19 16:21:59,883 INFO [train.py:1028] (0/2) Epoch 2, batch 150, loss[loss=0.9859, simple_loss=0.6151, pruned_loss=0.6784, over 12595.00 frames. ], tot_loss[loss=0.9779, simple_loss=0.6264, pruned_loss=0.6647, over 1366062.61 frames. ], batch size: 29, lr: 2.64e-02, grad_scale: 0.25 2024-06-19 16:21:59,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=18824.666666666668, ans=0.025871833333333344 2024-06-19 16:22:01,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=68.14 vs. limit=14.559249999999999 2024-06-19 16:22:01,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18824.666666666668, ans=0.11175333333333332 2024-06-19 16:22:10,247 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=33.13 vs. limit=21.63225 2024-06-19 16:22:13,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=18843.0, ans=0.125 2024-06-19 16:22:17,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=18861.333333333332, ans=0.0067692753623188415 2024-06-19 16:22:19,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.98 vs. limit=14.573 2024-06-19 16:22:20,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.78 vs. limit=14.430666666666665 2024-06-19 16:22:24,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=29.86 vs. limit=14.579875000000001 2024-06-19 16:22:27,831 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.659e+03 6.918e+03 9.642e+03 1.244e+04 3.866e+04, threshold=1.928e+04, percent-clipped=10.0 2024-06-19 16:22:28,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=33.01 vs. limit=21.659750000000003 2024-06-19 16:22:36,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=28.01 vs. limit=14.58675 2024-06-19 16:22:38,042 INFO [train.py:1028] (0/2) Epoch 2, batch 200, loss[loss=0.9467, simple_loss=0.6302, pruned_loss=0.6316, over 12534.00 frames. ], tot_loss[loss=0.9811, simple_loss=0.6273, pruned_loss=0.6674, over 1636017.25 frames. ], batch size: 202, lr: 2.63e-02, grad_scale: 0.25 2024-06-19 16:22:42,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=18916.333333333332, ans=0.1108366666666667 2024-06-19 16:22:50,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=5.840199999999999 2024-06-19 16:23:03,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=18971.333333333332, ans=0.125 2024-06-19 16:23:07,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.80 vs. limit=14.621125 2024-06-19 16:23:08,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=14.621125 2024-06-19 16:23:15,324 INFO [train.py:1028] (0/2) Epoch 2, batch 250, loss[loss=0.8684, simple_loss=0.5638, pruned_loss=0.5865, over 13002.00 frames. ], tot_loss[loss=0.9811, simple_loss=0.6272, pruned_loss=0.6675, over 1847245.19 frames. ], batch size: 144, lr: 2.63e-02, grad_scale: 0.25 2024-06-19 16:23:22,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=19008.0, ans=0.0 2024-06-19 16:23:25,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=19026.333333333332, ans=0.2340783333333334 2024-06-19 16:23:25,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=19026.333333333332, ans=0.0 2024-06-19 16:23:34,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19026.333333333332, ans=0.125 2024-06-19 16:23:36,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=19026.333333333332, ans=0.125 2024-06-19 16:23:39,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=19044.666666666668, ans=0.125 2024-06-19 16:23:55,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=19063.0, ans=0.125 2024-06-19 16:23:56,813 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.674e+03 1.038e+04 1.241e+04 1.675e+04 5.031e+04, threshold=2.482e+04, percent-clipped=14.0 2024-06-19 16:24:01,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=19081.333333333332, ans=0.125 2024-06-19 16:24:05,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=19081.333333333332, ans=0.125 2024-06-19 16:24:06,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19081.333333333332, ans=0.10918666666666668 2024-06-19 16:24:08,139 INFO [train.py:1028] (0/2) Epoch 2, batch 300, loss[loss=0.964, simple_loss=0.6262, pruned_loss=0.651, over 13174.00 frames. ], tot_loss[loss=0.9783, simple_loss=0.6261, pruned_loss=0.6652, over 2010970.84 frames. ], batch size: 112, lr: 2.62e-02, grad_scale: 0.5 2024-06-19 16:24:09,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=19099.666666666668, ans=0.10900333333333331 2024-06-19 16:24:10,247 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=30.08 vs. limit=21.82475 2024-06-19 16:24:14,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=11.639866666666666 2024-06-19 16:24:24,469 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.22 vs. limit=9.779499999999999 2024-06-19 16:24:25,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=19136.333333333332, ans=0.006709492753623189 2024-06-19 16:24:31,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=19136.333333333332, ans=0.23022833333333337 2024-06-19 16:24:38,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=19154.666666666668, ans=0.0584533333333333 2024-06-19 16:24:46,014 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=2.703e-03 2024-06-19 16:24:47,607 INFO [train.py:1028] (0/2) Epoch 2, batch 350, loss[loss=1.006, simple_loss=0.6143, pruned_loss=0.6985, over 13002.00 frames. ], tot_loss[loss=0.9794, simple_loss=0.626, pruned_loss=0.6664, over 2139458.36 frames. ], batch size: 33, lr: 2.62e-02, grad_scale: 0.5 2024-06-19 16:24:47,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=19191.333333333332, ans=0.0 2024-06-19 16:24:50,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.03 vs. limit=5.8787 2024-06-19 16:24:51,991 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=9.092e+01 2024-06-19 16:25:06,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.15 vs. limit=9.807 2024-06-19 16:25:11,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=19228.0, ans=0.035 2024-06-19 16:25:18,839 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.923e+03 6.364e+03 9.500e+03 1.188e+04 3.866e+04, threshold=1.900e+04, percent-clipped=4.0 2024-06-19 16:25:22,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=19264.666666666668, ans=0.125 2024-06-19 16:25:28,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.11 vs. limit=21.9485 2024-06-19 16:25:29,881 INFO [train.py:1028] (0/2) Epoch 2, batch 400, loss[loss=0.9747, simple_loss=0.6235, pruned_loss=0.6629, over 13249.00 frames. ], tot_loss[loss=0.9814, simple_loss=0.6276, pruned_loss=0.6676, over 2240351.42 frames. ], batch size: 63, lr: 2.61e-02, grad_scale: 0.5 2024-06-19 16:25:31,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.12 vs. limit=21.962249999999997 2024-06-19 16:25:34,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.35 vs. limit=9.82075 2024-06-19 16:25:35,348 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.048e-02 2024-06-19 16:25:36,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=19301.333333333332, ans=0.125 2024-06-19 16:25:48,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19319.666666666668, ans=0.1068033333333333 2024-06-19 16:25:55,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.09 vs. limit=14.751750000000001 2024-06-19 16:26:02,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.25 vs. limit=14.758625 2024-06-19 16:26:05,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=19356.333333333332, ans=0.125 2024-06-19 16:26:06,061 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=47.67 vs. limit=22.01725 2024-06-19 16:26:06,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=19356.333333333332, ans=0.125 2024-06-19 16:26:06,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.46 vs. limit=14.758625 2024-06-19 16:26:16,443 INFO [train.py:1028] (0/2) Epoch 2, batch 450, loss[loss=0.9761, simple_loss=0.6264, pruned_loss=0.6629, over 13160.00 frames. ], tot_loss[loss=0.9776, simple_loss=0.6264, pruned_loss=0.6644, over 2313581.35 frames. ], batch size: 67, lr: 2.61e-02, grad_scale: 0.25 2024-06-19 16:26:18,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19374.666666666668, ans=0.0 2024-06-19 16:26:24,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.06 vs. limit=14.772375 2024-06-19 16:26:24,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=19393.0, ans=0.07 2024-06-19 16:26:39,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=19411.333333333332, ans=0.22060333333333337 2024-06-19 16:26:43,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=19411.333333333332, ans=0.125 2024-06-19 16:26:51,268 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.475e+03 7.250e+03 8.612e+03 1.163e+04 2.828e+04, threshold=1.722e+04, percent-clipped=4.0 2024-06-19 16:26:51,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.33 vs. limit=22.07225 2024-06-19 16:27:00,064 INFO [train.py:1028] (0/2) Epoch 2, batch 500, loss[loss=0.9483, simple_loss=0.6006, pruned_loss=0.648, over 13076.00 frames. ], tot_loss[loss=0.9828, simple_loss=0.6279, pruned_loss=0.6689, over 2375275.20 frames. ], batch size: 121, lr: 2.61e-02, grad_scale: 0.5 2024-06-19 16:27:03,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=19466.333333333332, ans=0.125 2024-06-19 16:27:06,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=25.72 vs. limit=14.733166666666666 2024-06-19 16:27:06,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=14.806750000000001 2024-06-19 16:27:11,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=14.806750000000001 2024-06-19 16:27:12,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=19484.666666666668, ans=0.0066337681159420285 2024-06-19 16:27:25,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=15.37 vs. limit=11.808533333333333 2024-06-19 16:27:25,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.22 vs. limit=9.880333333333333 2024-06-19 16:27:26,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=14.8205 2024-06-19 16:27:29,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.21 vs. limit=14.827375 2024-06-19 16:27:37,829 INFO [train.py:1028] (0/2) Epoch 2, batch 550, loss[loss=0.9504, simple_loss=0.6249, pruned_loss=0.6379, over 12990.00 frames. ], tot_loss[loss=0.9852, simple_loss=0.6279, pruned_loss=0.6713, over 2420505.75 frames. ], batch size: 158, lr: 2.60e-02, grad_scale: 0.25 2024-06-19 16:27:38,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.43 vs. limit=14.83425 2024-06-19 16:27:39,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=43.80 vs. limit=14.83425 2024-06-19 16:27:49,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=35.60 vs. limit=22.18225 2024-06-19 16:27:51,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19594.666666666668, ans=0.125 2024-06-19 16:27:57,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=5.9392 2024-06-19 16:28:06,213 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+03 4.820e+03 7.996e+03 1.122e+04 3.675e+04, threshold=1.599e+04, percent-clipped=8.0 2024-06-19 16:28:06,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19631.333333333332, ans=0.10368666666666668 2024-06-19 16:28:10,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.62 vs. limit=14.86175 2024-06-19 16:28:13,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.47 vs. limit=11.852533333333334 2024-06-19 16:28:14,169 INFO [train.py:1028] (0/2) Epoch 2, batch 600, loss[loss=0.9033, simple_loss=0.6027, pruned_loss=0.602, over 12982.00 frames. ], tot_loss[loss=0.9834, simple_loss=0.6265, pruned_loss=0.6701, over 2458793.16 frames. ], batch size: 144, lr: 2.60e-02, grad_scale: 0.5 2024-06-19 16:28:30,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=19686.333333333332, ans=0.035 2024-06-19 16:28:31,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=5.9529499999999995 2024-06-19 16:28:34,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=14.88925 2024-06-19 16:28:39,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=45.94 vs. limit=14.88925 2024-06-19 16:28:41,619 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.77 vs. limit=14.88925 2024-06-19 16:28:42,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=42.28 vs. limit=14.88925 2024-06-19 16:28:44,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=61.42 vs. limit=22.2785 2024-06-19 16:28:44,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19704.666666666668, ans=0.125 2024-06-19 16:28:52,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.93 vs. limit=22.29225 2024-06-19 16:28:56,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=19741.333333333332, ans=0.125 2024-06-19 16:28:57,516 INFO [train.py:1028] (0/2) Epoch 2, batch 650, loss[loss=0.98, simple_loss=0.6239, pruned_loss=0.6681, over 13161.00 frames. ], tot_loss[loss=0.9845, simple_loss=0.6266, pruned_loss=0.6712, over 2489748.55 frames. ], batch size: 59, lr: 2.59e-02, grad_scale: 0.5 2024-06-19 16:29:00,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=58.20 vs. limit=22.305999999999997 2024-06-19 16:29:10,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=19759.666666666668, ans=0.20841166666666666 2024-06-19 16:29:12,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=31.76 vs. limit=14.909875 2024-06-19 16:29:21,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=19778.0, ans=0.025 2024-06-19 16:29:23,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=19796.333333333332, ans=0.0 2024-06-19 16:29:24,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=19796.333333333332, ans=0.0 2024-06-19 16:29:25,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.37 vs. limit=22.347250000000003 2024-06-19 16:29:31,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=30.55 vs. limit=22.361 2024-06-19 16:29:32,108 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.048e+03 4.392e+03 5.654e+03 7.717e+03 6.039e+04, threshold=1.131e+04, percent-clipped=2.0 2024-06-19 16:29:35,120 WARNING [optim.py:503] (0/2) Scaling gradients by 0.05225700885057449, model_norm_threshold=11308.90625 2024-06-19 16:29:35,277 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.4.weight with proportion 0.24, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.123e+10, grad_sumsq=2.701e+09, orig_rms_sq=4.158e+00 2024-06-19 16:29:35,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19814.666666666668, ans=0.125 2024-06-19 16:29:39,144 INFO [train.py:1028] (0/2) Epoch 2, batch 700, loss[loss=0.9876, simple_loss=0.6249, pruned_loss=0.6752, over 13271.00 frames. ], tot_loss[loss=0.9824, simple_loss=0.6251, pruned_loss=0.6698, over 2512886.28 frames. ], batch size: 46, lr: 2.59e-02, grad_scale: 0.125 2024-06-19 16:29:45,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19833.0, ans=0.125 2024-06-19 16:29:52,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=19851.333333333332, ans=0.125 2024-06-19 16:29:55,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.52 vs. limit=22.402250000000002 2024-06-19 16:30:00,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=19869.666666666668, ans=0.006550072463768116 2024-06-19 16:30:00,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=29.30 vs. limit=22.402250000000002 2024-06-19 16:30:01,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=46.18 vs. limit=22.402250000000002 2024-06-19 16:30:19,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=19906.333333333332, ans=0.0 2024-06-19 16:30:22,366 INFO [train.py:1028] (0/2) Epoch 2, batch 750, loss[loss=0.9636, simple_loss=0.6101, pruned_loss=0.6586, over 13212.00 frames. ], tot_loss[loss=0.9837, simple_loss=0.6257, pruned_loss=0.6708, over 2528558.20 frames. ], batch size: 63, lr: 2.59e-02, grad_scale: 0.125 2024-06-19 16:30:38,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.94 vs. limit=14.9855 2024-06-19 16:30:45,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=19979.666666666668, ans=0.125 2024-06-19 16:30:50,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=19979.666666666668, ans=0.025 2024-06-19 16:30:57,183 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.384e+03 5.918e+03 7.493e+03 1.144e+04 2.164e+05, threshold=1.499e+04, percent-clipped=25.0 2024-06-19 16:31:03,283 INFO [train.py:1028] (0/2) Epoch 2, batch 800, loss[loss=1.026, simple_loss=0.6393, pruned_loss=0.7063, over 13008.00 frames. ], tot_loss[loss=0.9875, simple_loss=0.6268, pruned_loss=0.6741, over 2540786.81 frames. ], batch size: 36, lr: 2.58e-02, grad_scale: 0.25 2024-06-19 16:31:20,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.81 vs. limit=15.0 2024-06-19 16:31:22,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=47.30 vs. limit=15.0 2024-06-19 16:31:24,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20053.0, ans=0.1 2024-06-19 16:31:29,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=20053.0, ans=0.125 2024-06-19 16:31:48,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.34 vs. limit=22.5 2024-06-19 16:31:49,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=35.56 vs. limit=15.0 2024-06-19 16:31:51,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=20089.666666666668, ans=0.0 2024-06-19 16:31:56,577 INFO [train.py:1028] (0/2) Epoch 2, batch 850, loss[loss=0.9606, simple_loss=0.606, pruned_loss=0.6576, over 13182.00 frames. ], tot_loss[loss=0.9886, simple_loss=0.6266, pruned_loss=0.6753, over 2551260.78 frames. ], batch size: 95, lr: 2.58e-02, grad_scale: 0.25 2024-06-19 16:31:59,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=31.12 vs. limit=22.5 2024-06-19 16:32:03,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=31.81 vs. limit=22.5 2024-06-19 16:32:04,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=20126.333333333332, ans=0.125 2024-06-19 16:32:04,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-19 16:32:14,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.22 vs. limit=15.0 2024-06-19 16:32:28,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=20163.0, ans=0.006486304347826087 2024-06-19 16:32:32,202 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+03 4.587e+03 6.105e+03 7.839e+03 2.814e+04, threshold=1.221e+04, percent-clipped=5.0 2024-06-19 16:32:34,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.06 vs. limit=22.5 2024-06-19 16:32:38,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=20199.666666666668, ans=0.125 2024-06-19 16:32:39,644 INFO [train.py:1028] (0/2) Epoch 2, batch 900, loss[loss=1.06, simple_loss=0.6553, pruned_loss=0.7327, over 13000.00 frames. ], tot_loss[loss=0.9874, simple_loss=0.6258, pruned_loss=0.6745, over 2556382.14 frames. ], batch size: 36, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:32:42,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.18 vs. limit=15.0 2024-06-19 16:32:43,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=20199.666666666668, ans=0.125 2024-06-19 16:32:51,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=20218.0, ans=0.2 2024-06-19 16:32:58,718 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=6.816e-02 2024-06-19 16:33:04,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.31 vs. limit=15.0 2024-06-19 16:33:07,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=20254.666666666668, ans=0.125 2024-06-19 16:33:13,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=18.64 vs. limit=15.0 2024-06-19 16:33:16,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=28.56 vs. limit=15.0 2024-06-19 16:33:17,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2024-06-19 16:33:22,657 INFO [train.py:1028] (0/2) Epoch 2, batch 950, loss[loss=1.044, simple_loss=0.66, pruned_loss=0.7136, over 12910.00 frames. ], tot_loss[loss=0.9853, simple_loss=0.6261, pruned_loss=0.6723, over 2559435.65 frames. ], batch size: 39, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:33:36,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=20309.666666666668, ans=0.125 2024-06-19 16:33:36,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=20309.666666666668, ans=0.125 2024-06-19 16:33:53,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.46 vs. limit=22.5 2024-06-19 16:33:59,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=20364.666666666668, ans=0.05 2024-06-19 16:34:02,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=20364.666666666668, ans=0.125 2024-06-19 16:34:03,004 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.503e+03 5.671e+03 7.306e+03 1.007e+04 4.704e+04, threshold=1.461e+04, percent-clipped=15.0 2024-06-19 16:34:06,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=20364.666666666668, ans=0.125 2024-06-19 16:34:06,861 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-19 16:34:07,328 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.903e+01 2024-06-19 16:34:08,488 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=3.183e-02 2024-06-19 16:34:09,223 INFO [train.py:1028] (0/2) Epoch 2, batch 1000, loss[loss=1.101, simple_loss=0.703, pruned_loss=0.75, over 13326.00 frames. ], tot_loss[loss=0.9826, simple_loss=0.6255, pruned_loss=0.6698, over 2561568.25 frames. ], batch size: 49, lr: 2.57e-02, grad_scale: 0.5 2024-06-19 16:34:11,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.72 vs. limit=10.0 2024-06-19 16:34:25,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=15.0 2024-06-19 16:34:42,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20438.0, ans=0.1 2024-06-19 16:34:52,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20456.333333333332, ans=0.1 2024-06-19 16:34:54,012 INFO [train.py:1028] (0/2) Epoch 2, batch 1050, loss[loss=0.9906, simple_loss=0.6276, pruned_loss=0.6768, over 13227.00 frames. ], tot_loss[loss=0.9878, simple_loss=0.628, pruned_loss=0.6738, over 2565360.52 frames. ], batch size: 77, lr: 2.56e-02, grad_scale: 0.5 2024-06-19 16:35:08,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20493.0, ans=0.1 2024-06-19 16:35:09,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=20493.0, ans=0.125 2024-06-19 16:35:22,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=23.07 vs. limit=15.0 2024-06-19 16:35:27,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=18.35 vs. limit=15.0 2024-06-19 16:35:28,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+03 3.650e+03 4.678e+03 6.104e+03 1.817e+04, threshold=9.355e+03, percent-clipped=1.0 2024-06-19 16:35:33,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=20566.333333333332, ans=0.1 2024-06-19 16:35:33,632 INFO [train.py:1028] (0/2) Epoch 2, batch 1100, loss[loss=1.018, simple_loss=0.6524, pruned_loss=0.6917, over 13227.00 frames. ], tot_loss[loss=0.9911, simple_loss=0.6292, pruned_loss=0.6765, over 2569689.32 frames. ], batch size: 52, lr: 2.56e-02, grad_scale: 1.0 2024-06-19 16:35:35,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=12.0 2024-06-19 16:35:43,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=44.54 vs. limit=15.0 2024-06-19 16:35:44,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=20584.666666666668, ans=0.125 2024-06-19 16:35:48,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20603.0, ans=0.1 2024-06-19 16:35:52,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.47 vs. limit=22.5 2024-06-19 16:36:05,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=20621.333333333332, ans=0.006386666666666667 2024-06-19 16:36:19,467 INFO [train.py:1028] (0/2) Epoch 2, batch 1150, loss[loss=1.019, simple_loss=0.6381, pruned_loss=0.6997, over 13278.00 frames. ], tot_loss[loss=0.9933, simple_loss=0.6298, pruned_loss=0.6784, over 2571429.99 frames. ], batch size: 52, lr: 2.55e-02, grad_scale: 0.5 2024-06-19 16:36:38,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-19 16:36:48,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.79 vs. limit=22.5 2024-06-19 16:36:49,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=38.68 vs. limit=15.0 2024-06-19 16:37:01,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+03 3.546e+03 4.246e+03 5.312e+03 1.969e+04, threshold=8.492e+03, percent-clipped=6.0 2024-06-19 16:37:04,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20731.333333333332, ans=0.1 2024-06-19 16:37:04,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=20749.666666666668, ans=0.2 2024-06-19 16:37:05,413 INFO [train.py:1028] (0/2) Epoch 2, batch 1200, loss[loss=0.9369, simple_loss=0.5993, pruned_loss=0.6373, over 13118.00 frames. ], tot_loss[loss=0.9884, simple_loss=0.6277, pruned_loss=0.6745, over 2573559.86 frames. ], batch size: 77, lr: 2.55e-02, grad_scale: 1.0 2024-06-19 16:37:05,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=27.79 vs. limit=22.5 2024-06-19 16:37:16,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=20768.0, ans=0.2 2024-06-19 16:37:20,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=20786.333333333332, ans=0.125 2024-06-19 16:37:22,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=20786.333333333332, ans=0.125 2024-06-19 16:37:28,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=20804.666666666668, ans=0.1 2024-06-19 16:37:37,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=20823.0, ans=0.125 2024-06-19 16:37:38,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20823.0, ans=0.125 2024-06-19 16:37:44,622 INFO [train.py:1028] (0/2) Epoch 2, batch 1250, loss[loss=0.962, simple_loss=0.6208, pruned_loss=0.6516, over 13193.00 frames. ], tot_loss[loss=0.9835, simple_loss=0.6258, pruned_loss=0.6706, over 2583456.91 frames. ], batch size: 112, lr: 2.55e-02, grad_scale: 0.125 2024-06-19 16:37:51,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=20841.333333333332, ans=0.125 2024-06-19 16:38:04,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=20878.0, ans=0.125 2024-06-19 16:38:05,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=20878.0, ans=0.125 2024-06-19 16:38:05,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.87 vs. limit=22.5 2024-06-19 16:38:07,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20878.0, ans=0.125 2024-06-19 16:38:19,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=65.15 vs. limit=22.5 2024-06-19 16:38:21,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20914.666666666668, ans=0.125 2024-06-19 16:38:23,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.349e+03 4.978e+03 6.251e+03 8.107e+03 6.184e+04, threshold=1.250e+04, percent-clipped=21.0 2024-06-19 16:38:26,053 INFO [train.py:1028] (0/2) Epoch 2, batch 1300, loss[loss=0.9031, simple_loss=0.5948, pruned_loss=0.6057, over 12749.00 frames. ], tot_loss[loss=0.9848, simple_loss=0.6261, pruned_loss=0.6717, over 2584505.84 frames. ], batch size: 176, lr: 2.54e-02, grad_scale: 0.25 2024-06-19 16:38:36,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=20951.333333333332, ans=0.2 2024-06-19 16:38:41,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=15.0 2024-06-19 16:38:48,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2024-06-19 16:38:50,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=20988.0, ans=0.125 2024-06-19 16:39:05,527 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=43.55 vs. limit=15.0 2024-06-19 16:39:09,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=21006.333333333332, ans=0.2 2024-06-19 16:39:14,548 INFO [train.py:1028] (0/2) Epoch 2, batch 1350, loss[loss=1.02, simple_loss=0.6453, pruned_loss=0.6972, over 13171.00 frames. ], tot_loss[loss=0.9845, simple_loss=0.6263, pruned_loss=0.6714, over 2586740.52 frames. ], batch size: 59, lr: 2.54e-02, grad_scale: 0.25 2024-06-19 16:39:22,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.96 vs. limit=15.0 2024-06-19 16:39:22,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.98 vs. limit=6.0 2024-06-19 16:39:28,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21043.0, ans=0.1 2024-06-19 16:39:30,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=21043.0, ans=0.006295 2024-06-19 16:39:31,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=21043.0, ans=0.125 2024-06-19 16:39:38,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21061.333333333332, ans=0.0 2024-06-19 16:39:39,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.79 vs. limit=15.0 2024-06-19 16:39:41,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=38.65 vs. limit=15.0 2024-06-19 16:39:42,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=21061.333333333332, ans=0.125 2024-06-19 16:39:45,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21079.666666666668, ans=0.125 2024-06-19 16:39:52,099 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.48 vs. limit=15.0 2024-06-19 16:39:54,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=21098.0, ans=0.1 2024-06-19 16:39:56,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2024-06-19 16:39:58,194 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.465e+03 4.270e+03 5.208e+03 6.837e+03 4.593e+04, threshold=1.042e+04, percent-clipped=7.0 2024-06-19 16:40:00,882 INFO [train.py:1028] (0/2) Epoch 2, batch 1400, loss[loss=1.087, simple_loss=0.6787, pruned_loss=0.7478, over 12632.00 frames. ], tot_loss[loss=0.9815, simple_loss=0.6252, pruned_loss=0.6689, over 2588396.03 frames. ], batch size: 26, lr: 2.54e-02, grad_scale: 0.5 2024-06-19 16:40:06,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=28.68 vs. limit=15.0 2024-06-19 16:40:07,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.78 vs. limit=15.0 2024-06-19 16:40:09,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=31.90 vs. limit=22.5 2024-06-19 16:40:14,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=21134.666666666668, ans=0.125 2024-06-19 16:40:16,382 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=3.133e+03 2024-06-19 16:40:22,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=21153.0, ans=0.125 2024-06-19 16:40:22,444 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.16 vs. limit=10.0 2024-06-19 16:40:26,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=21171.333333333332, ans=0.125 2024-06-19 16:40:26,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.73 vs. limit=6.0 2024-06-19 16:40:26,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=21171.333333333332, ans=0.2 2024-06-19 16:40:27,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21171.333333333332, ans=0.1 2024-06-19 16:40:35,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21189.666666666668, ans=0.125 2024-06-19 16:40:36,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=21189.666666666668, ans=0.2 2024-06-19 16:40:39,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-19 16:40:40,963 INFO [train.py:1028] (0/2) Epoch 2, batch 1450, loss[loss=0.903, simple_loss=0.5887, pruned_loss=0.6086, over 13101.00 frames. ], tot_loss[loss=0.9804, simple_loss=0.6252, pruned_loss=0.6678, over 2587771.21 frames. ], batch size: 121, lr: 2.53e-02, grad_scale: 0.5 2024-06-19 16:40:41,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=21208.0, ans=0.5 2024-06-19 16:40:41,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=21208.0, ans=0.006259130434782609 2024-06-19 16:40:41,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=21208.0, ans=0.2 2024-06-19 16:40:42,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=21208.0, ans=0.125 2024-06-19 16:40:43,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21208.0, ans=0.1 2024-06-19 16:40:43,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.36 vs. limit=10.0 2024-06-19 16:40:44,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=29.61 vs. limit=15.0 2024-06-19 16:40:52,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=21226.333333333332, ans=0.2 2024-06-19 16:40:53,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21226.333333333332, ans=0.125 2024-06-19 16:40:59,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=21244.666666666668, ans=0.05 2024-06-19 16:41:02,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.18 vs. limit=22.5 2024-06-19 16:41:03,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=21244.666666666668, ans=0.2 2024-06-19 16:41:18,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=21281.333333333332, ans=0.05 2024-06-19 16:41:19,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.04 vs. limit=22.5 2024-06-19 16:41:19,733 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.936e+03 5.345e+03 6.900e+03 9.769e+03 2.413e+04, threshold=1.380e+04, percent-clipped=21.0 2024-06-19 16:41:20,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21299.666666666668, ans=0.1 2024-06-19 16:41:21,010 INFO [train.py:1028] (0/2) Epoch 2, batch 1500, loss[loss=0.9875, simple_loss=0.6269, pruned_loss=0.6741, over 13217.00 frames. ], tot_loss[loss=0.9828, simple_loss=0.6258, pruned_loss=0.6699, over 2589115.35 frames. ], batch size: 83, lr: 2.53e-02, grad_scale: 0.5 2024-06-19 16:41:21,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=21299.666666666668, ans=0.0 2024-06-19 16:41:33,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21318.0, ans=0.125 2024-06-19 16:41:37,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21336.333333333332, ans=0.125 2024-06-19 16:41:47,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21354.666666666668, ans=0.1 2024-06-19 16:41:49,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21354.666666666668, ans=0.0 2024-06-19 16:41:55,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=21373.0, ans=0.125 2024-06-19 16:42:00,131 INFO [train.py:1028] (0/2) Epoch 2, batch 1550, loss[loss=0.9861, simple_loss=0.6284, pruned_loss=0.6719, over 13079.00 frames. ], tot_loss[loss=0.9831, simple_loss=0.6266, pruned_loss=0.6698, over 2584544.76 frames. ], batch size: 102, lr: 2.52e-02, grad_scale: 0.5 2024-06-19 16:42:14,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.09 vs. limit=15.0 2024-06-19 16:42:17,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=21428.0, ans=0.0 2024-06-19 16:42:18,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=21428.0, ans=0.006211304347826087 2024-06-19 16:42:25,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21446.333333333332, ans=0.1 2024-06-19 16:42:26,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=105.19 vs. limit=15.0 2024-06-19 16:42:32,105 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.455e+03 4.539e+03 5.393e+03 7.229e+03 3.192e+04, threshold=1.079e+04, percent-clipped=3.0 2024-06-19 16:42:32,760 INFO [train.py:1028] (0/2) Epoch 2, batch 1600, loss[loss=0.9193, simple_loss=0.5863, pruned_loss=0.6261, over 13171.00 frames. ], tot_loss[loss=0.9839, simple_loss=0.6266, pruned_loss=0.6706, over 2579954.90 frames. ], batch size: 77, lr: 2.52e-02, grad_scale: 0.5 2024-06-19 16:42:32,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=21483.0, ans=0.125 2024-06-19 16:42:33,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=21483.0, ans=0.025 2024-06-19 16:42:46,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21519.666666666668, ans=0.125 2024-06-19 16:42:50,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.46 vs. limit=15.0 2024-06-19 16:42:50,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=23.85 vs. limit=22.5 2024-06-19 16:43:05,135 INFO [train.py:1028] (0/2) Epoch 2, batch 1650, loss[loss=0.9683, simple_loss=0.627, pruned_loss=0.6548, over 13209.00 frames. ], tot_loss[loss=0.9826, simple_loss=0.6264, pruned_loss=0.6694, over 2577011.69 frames. ], batch size: 95, lr: 2.52e-02, grad_scale: 0.25 2024-06-19 16:43:09,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21574.666666666668, ans=0.125 2024-06-19 16:43:12,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=21593.0, ans=0.0 2024-06-19 16:43:18,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=15.0 2024-06-19 16:43:30,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=21629.666666666668, ans=0.125 2024-06-19 16:43:35,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=21648.0, ans=0.0 2024-06-19 16:43:41,498 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.132e+03 6.157e+03 7.365e+03 8.932e+03 2.808e+04, threshold=1.473e+04, percent-clipped=14.0 2024-06-19 16:43:41,527 INFO [train.py:1028] (0/2) Epoch 2, batch 1700, loss[loss=1.055, simple_loss=0.6585, pruned_loss=0.7262, over 12584.00 frames. ], tot_loss[loss=0.9823, simple_loss=0.6266, pruned_loss=0.669, over 2582783.03 frames. ], batch size: 25, lr: 2.51e-02, grad_scale: 0.5 2024-06-19 16:43:49,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21666.333333333332, ans=0.125 2024-06-19 16:43:52,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.20 vs. limit=15.0 2024-06-19 16:43:57,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21703.0, ans=0.1 2024-06-19 16:44:00,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=21703.0, ans=0.125 2024-06-19 16:44:07,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=15.0 2024-06-19 16:44:09,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=22.31 vs. limit=15.0 2024-06-19 16:44:17,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=21758.0, ans=0.125 2024-06-19 16:44:17,684 INFO [train.py:1028] (0/2) Epoch 2, batch 1750, loss[loss=1.113, simple_loss=0.6791, pruned_loss=0.7733, over 12538.00 frames. ], tot_loss[loss=0.9831, simple_loss=0.627, pruned_loss=0.6696, over 2582872.81 frames. ], batch size: 22, lr: 2.51e-02, grad_scale: 0.5 2024-06-19 16:44:17,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=21758.0, ans=0.0 2024-06-19 16:44:21,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=21758.0, ans=0.1 2024-06-19 16:44:30,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21794.666666666668, ans=0.125 2024-06-19 16:44:33,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21794.666666666668, ans=0.0 2024-06-19 16:44:34,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2024-06-19 16:44:40,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=15.0 2024-06-19 16:44:50,482 INFO [train.py:1028] (0/2) Epoch 2, batch 1800, loss[loss=1.02, simple_loss=0.647, pruned_loss=0.6962, over 13253.00 frames. ], tot_loss[loss=0.9827, simple_loss=0.6273, pruned_loss=0.6691, over 2582860.74 frames. ], batch size: 67, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:44:51,814 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.354e+03 5.226e+03 6.742e+03 8.545e+03 2.402e+04, threshold=1.348e+04, percent-clipped=2.0 2024-06-19 16:44:53,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21849.666666666668, ans=0.125 2024-06-19 16:44:58,433 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.949e+01 2024-06-19 16:44:59,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=15.0 2024-06-19 16:45:03,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=21886.333333333332, ans=0.125 2024-06-19 16:45:05,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.04 vs. limit=15.0 2024-06-19 16:45:06,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=21886.333333333332, ans=0.025 2024-06-19 16:45:23,278 INFO [train.py:1028] (0/2) Epoch 2, batch 1850, loss[loss=0.9224, simple_loss=0.5881, pruned_loss=0.6283, over 13265.00 frames. ], tot_loss[loss=0.9832, simple_loss=0.6274, pruned_loss=0.6695, over 2584773.10 frames. ], batch size: 83, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:45:23,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21941.333333333332, ans=0.0 2024-06-19 16:45:23,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.63 vs. limit=15.0 2024-06-19 16:45:48,972 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-12000.pt 2024-06-19 16:46:06,369 INFO [train.py:1028] (0/2) Epoch 2, batch 1900, loss[loss=0.904, simple_loss=0.5827, pruned_loss=0.6126, over 13167.00 frames. ], tot_loss[loss=0.9771, simple_loss=0.6249, pruned_loss=0.6647, over 2586213.29 frames. ], batch size: 95, lr: 2.50e-02, grad_scale: 0.25 2024-06-19 16:46:08,215 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.676e+03 6.252e+03 7.371e+03 1.003e+04 3.293e+04, threshold=1.474e+04, percent-clipped=10.0 2024-06-19 16:46:11,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22033.0, ans=0.125 2024-06-19 16:46:23,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=22069.666666666668, ans=0.0 2024-06-19 16:46:24,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=22069.666666666668, ans=0.035 2024-06-19 16:46:32,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=22088.0, ans=0.1 2024-06-19 16:46:34,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=22106.333333333332, ans=0.125 2024-06-19 16:46:35,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=22106.333333333332, ans=0.125 2024-06-19 16:46:39,697 INFO [train.py:1028] (0/2) Epoch 2, batch 1950, loss[loss=0.967, simple_loss=0.6119, pruned_loss=0.6611, over 13272.00 frames. ], tot_loss[loss=0.9744, simple_loss=0.6228, pruned_loss=0.663, over 2592373.55 frames. ], batch size: 52, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:46:42,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=19.82 vs. limit=15.0 2024-06-19 16:46:43,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=22124.666666666668, ans=0.125 2024-06-19 16:46:43,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=22124.666666666668, ans=0.125 2024-06-19 16:46:44,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=44.87 vs. limit=15.0 2024-06-19 16:46:47,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.72 vs. limit=15.0 2024-06-19 16:46:49,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.08 vs. limit=15.0 2024-06-19 16:46:51,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.88 vs. limit=15.0 2024-06-19 16:46:53,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=22161.333333333332, ans=0.5 2024-06-19 16:47:03,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.56 vs. limit=10.0 2024-06-19 16:47:11,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=15.0 2024-06-19 16:47:12,799 INFO [train.py:1028] (0/2) Epoch 2, batch 2000, loss[loss=1.094, simple_loss=0.6753, pruned_loss=0.7568, over 12713.00 frames. ], tot_loss[loss=0.9738, simple_loss=0.623, pruned_loss=0.6623, over 2588135.87 frames. ], batch size: 22, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:47:15,381 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.303e+03 6.102e+03 7.605e+03 1.055e+04 4.573e+04, threshold=1.521e+04, percent-clipped=11.0 2024-06-19 16:47:30,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.69 vs. limit=15.0 2024-06-19 16:47:32,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.63 vs. limit=22.5 2024-06-19 16:47:35,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=22271.333333333332, ans=0.125 2024-06-19 16:47:35,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=22271.333333333332, ans=0.125 2024-06-19 16:47:46,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=22289.666666666668, ans=22.5 2024-06-19 16:47:49,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.29 vs. limit=22.5 2024-06-19 16:47:51,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=22308.0, ans=0.2 2024-06-19 16:47:52,371 INFO [train.py:1028] (0/2) Epoch 2, batch 2050, loss[loss=1.021, simple_loss=0.6263, pruned_loss=0.7079, over 12639.00 frames. ], tot_loss[loss=0.9761, simple_loss=0.6236, pruned_loss=0.6643, over 2583621.94 frames. ], batch size: 29, lr: 2.49e-02, grad_scale: 0.25 2024-06-19 16:47:52,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=22308.0, ans=0.125 2024-06-19 16:47:56,795 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=15.0 2024-06-19 16:47:57,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.86 vs. limit=15.0 2024-06-19 16:48:06,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=22344.666666666668, ans=0.2 2024-06-19 16:48:10,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=22344.666666666668, ans=0.125 2024-06-19 16:48:14,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=22363.0, ans=0.0 2024-06-19 16:48:16,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=22363.0, ans=0.2 2024-06-19 16:48:21,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.19 vs. limit=15.0 2024-06-19 16:48:21,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.72 vs. limit=22.5 2024-06-19 16:48:26,322 INFO [train.py:1028] (0/2) Epoch 2, batch 2100, loss[loss=1.048, simple_loss=0.6658, pruned_loss=0.7153, over 13212.00 frames. ], tot_loss[loss=0.9779, simple_loss=0.6243, pruned_loss=0.6657, over 2586048.65 frames. ], batch size: 59, lr: 2.48e-02, grad_scale: 0.5 2024-06-19 16:48:26,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=15.0 2024-06-19 16:48:28,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+03 3.980e+03 4.910e+03 5.983e+03 2.832e+04, threshold=9.821e+03, percent-clipped=1.0 2024-06-19 16:48:30,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.60 vs. limit=22.5 2024-06-19 16:48:35,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=22418.0, ans=0.0 2024-06-19 16:48:35,938 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=6.156e+02 2024-06-19 16:48:37,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.09 vs. limit=22.5 2024-06-19 16:48:42,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=22436.333333333332, ans=0.0 2024-06-19 16:48:43,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=22436.333333333332, ans=0.125 2024-06-19 16:48:47,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=22454.666666666668, ans=0.125 2024-06-19 16:48:53,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=22473.0, ans=0.0 2024-06-19 16:48:53,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=22473.0, ans=0.0 2024-06-19 16:49:00,121 INFO [train.py:1028] (0/2) Epoch 2, batch 2150, loss[loss=0.9642, simple_loss=0.6071, pruned_loss=0.6607, over 13268.00 frames. ], tot_loss[loss=0.9799, simple_loss=0.6256, pruned_loss=0.6671, over 2589582.98 frames. ], batch size: 52, lr: 2.48e-02, grad_scale: 0.5 2024-06-19 16:49:13,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=22528.0, ans=0.125 2024-06-19 16:49:15,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.89 vs. limit=22.5 2024-06-19 16:49:31,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=22564.666666666668, ans=0.0 2024-06-19 16:49:33,510 INFO [train.py:1028] (0/2) Epoch 2, batch 2200, loss[loss=0.9836, simple_loss=0.6372, pruned_loss=0.6651, over 13265.00 frames. ], tot_loss[loss=0.9809, simple_loss=0.6261, pruned_loss=0.6679, over 2589267.88 frames. ], batch size: 83, lr: 2.47e-02, grad_scale: 0.25 2024-06-19 16:49:34,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-19 16:49:40,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=22583.0, ans=0.125 2024-06-19 16:49:41,120 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.031e+03 6.121e+03 7.867e+03 9.684e+03 3.625e+04, threshold=1.573e+04, percent-clipped=24.0 2024-06-19 16:49:41,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=22583.0, ans=0.125 2024-06-19 16:49:42,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.21 vs. limit=22.5 2024-06-19 16:49:44,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=22601.333333333332, ans=0.07 2024-06-19 16:49:45,687 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.356e-01 2024-06-19 16:49:49,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.63 vs. limit=22.5 2024-06-19 16:49:57,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.93 vs. limit=22.5 2024-06-19 16:50:02,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-19 16:50:11,817 INFO [train.py:1028] (0/2) Epoch 2, batch 2250, loss[loss=0.963, simple_loss=0.616, pruned_loss=0.655, over 13332.00 frames. ], tot_loss[loss=0.9787, simple_loss=0.6251, pruned_loss=0.6662, over 2587636.89 frames. ], batch size: 63, lr: 2.47e-02, grad_scale: 0.125 2024-06-19 16:50:13,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=22674.666666666668, ans=0.125 2024-06-19 16:50:14,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=20.90 vs. limit=15.0 2024-06-19 16:50:15,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=47.58 vs. limit=15.0 2024-06-19 16:50:22,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=22693.0, ans=0.025 2024-06-19 16:50:24,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=22711.333333333332, ans=0.0 2024-06-19 16:50:25,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=22711.333333333332, ans=0.005932318840579711 2024-06-19 16:50:31,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=22729.666666666668, ans=0.09899494936611666 2024-06-19 16:50:35,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=22729.666666666668, ans=0.005928333333333333 2024-06-19 16:50:38,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=22748.0, ans=0.0 2024-06-19 16:50:40,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=22748.0, ans=0.005924347826086957 2024-06-19 16:50:44,344 INFO [train.py:1028] (0/2) Epoch 2, batch 2300, loss[loss=1.044, simple_loss=0.6555, pruned_loss=0.7166, over 12883.00 frames. ], tot_loss[loss=0.9809, simple_loss=0.6259, pruned_loss=0.668, over 2582408.89 frames. ], batch size: 33, lr: 2.47e-02, grad_scale: 0.25 2024-06-19 16:50:44,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.31 vs. limit=22.5 2024-06-19 16:50:49,091 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.948e+03 5.933e+03 8.161e+03 9.991e+03 5.428e+04, threshold=1.632e+04, percent-clipped=5.0 2024-06-19 16:50:59,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=22803.0, ans=15.0 2024-06-19 16:50:59,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.84 vs. limit=15.0 2024-06-19 16:51:02,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.68 vs. limit=10.0 2024-06-19 16:51:03,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.17 vs. limit=22.5 2024-06-19 16:51:07,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=22821.333333333332, ans=0.125 2024-06-19 16:51:10,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=22821.333333333332, ans=0.025 2024-06-19 16:51:13,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.54 vs. limit=22.5 2024-06-19 16:51:17,890 INFO [train.py:1028] (0/2) Epoch 2, batch 2350, loss[loss=0.9373, simple_loss=0.5978, pruned_loss=0.6384, over 13277.00 frames. ], tot_loss[loss=0.982, simple_loss=0.6261, pruned_loss=0.6689, over 2585642.14 frames. ], batch size: 67, lr: 2.46e-02, grad_scale: 0.25 2024-06-19 16:51:23,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.33 vs. limit=10.0 2024-06-19 16:51:23,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22876.333333333332, ans=0.1 2024-06-19 16:51:23,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=22876.333333333332, ans=0.2 2024-06-19 16:51:24,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22876.333333333332, ans=0.125 2024-06-19 16:51:29,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=22876.333333333332, ans=0.2 2024-06-19 16:51:40,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.44 vs. limit=22.5 2024-06-19 16:51:45,357 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=15.0 2024-06-19 16:51:49,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=22931.333333333332, ans=0.0 2024-06-19 16:51:49,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-19 16:51:56,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=19.63 vs. limit=15.0 2024-06-19 16:51:56,937 INFO [train.py:1028] (0/2) Epoch 2, batch 2400, loss[loss=0.9987, simple_loss=0.6361, pruned_loss=0.6806, over 13339.00 frames. ], tot_loss[loss=0.9778, simple_loss=0.6234, pruned_loss=0.6661, over 2588083.76 frames. ], batch size: 46, lr: 2.46e-02, grad_scale: 0.5 2024-06-19 16:51:58,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22949.666666666668, ans=0.1 2024-06-19 16:52:01,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.0 2024-06-19 16:52:02,029 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.554e+03 6.217e+03 7.491e+03 9.810e+03 6.010e+04, threshold=1.498e+04, percent-clipped=6.0 2024-06-19 16:52:04,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=22968.0, ans=0.005876521739130435 2024-06-19 16:52:04,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=22968.0, ans=0.005876521739130435 2024-06-19 16:52:06,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=22968.0, ans=0.005876521739130435 2024-06-19 16:52:17,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=26.60 vs. limit=15.0 2024-06-19 16:52:23,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=23023.0, ans=0.125 2024-06-19 16:52:29,441 INFO [train.py:1028] (0/2) Epoch 2, batch 2450, loss[loss=0.9422, simple_loss=0.5947, pruned_loss=0.6449, over 13282.00 frames. ], tot_loss[loss=0.9718, simple_loss=0.6202, pruned_loss=0.6617, over 2583829.07 frames. ], batch size: 63, lr: 2.46e-02, grad_scale: 0.25 2024-06-19 16:52:34,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=22.65 vs. limit=15.0 2024-06-19 16:52:35,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.02 vs. limit=15.0 2024-06-19 16:52:38,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.00 vs. limit=6.0 2024-06-19 16:52:43,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=23078.0, ans=0.005852608695652174 2024-06-19 16:52:45,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=23078.0, ans=0.125 2024-06-19 16:52:46,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=23078.0, ans=0.5 2024-06-19 16:52:54,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23096.333333333332, ans=0.125 2024-06-19 16:52:58,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=23114.666666666668, ans=0.00584463768115942 2024-06-19 16:52:58,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=23114.666666666668, ans=0.125 2024-06-19 16:53:02,376 INFO [train.py:1028] (0/2) Epoch 2, batch 2500, loss[loss=0.9381, simple_loss=0.601, pruned_loss=0.6376, over 13226.00 frames. ], tot_loss[loss=0.9659, simple_loss=0.6172, pruned_loss=0.6573, over 2586880.32 frames. ], batch size: 83, lr: 2.45e-02, grad_scale: 0.125 2024-06-19 16:53:02,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.00 vs. limit=10.0 2024-06-19 16:53:05,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23133.0, ans=0.0 2024-06-19 16:53:05,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=23133.0, ans=0.0 2024-06-19 16:53:08,850 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.566e+03 9.530e+03 1.158e+04 1.651e+04 1.128e+05, threshold=2.317e+04, percent-clipped=29.0 2024-06-19 16:53:18,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.44 vs. limit=15.0 2024-06-19 16:53:21,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=23188.0, ans=0.005828695652173913 2024-06-19 16:53:38,891 INFO [train.py:1028] (0/2) Epoch 2, batch 2550, loss[loss=1.098, simple_loss=0.668, pruned_loss=0.7644, over 12554.00 frames. ], tot_loss[loss=0.9636, simple_loss=0.6156, pruned_loss=0.6558, over 2586067.64 frames. ], batch size: 22, lr: 2.45e-02, grad_scale: 0.125 2024-06-19 16:53:47,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.60 vs. limit=22.5 2024-06-19 16:53:57,127 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.030e+00 2024-06-19 16:54:00,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=23261.333333333332, ans=0.125 2024-06-19 16:54:03,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=23279.666666666668, ans=0.035 2024-06-19 16:54:06,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.17 vs. limit=10.0 2024-06-19 16:54:12,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=23298.0, ans=0.2 2024-06-19 16:54:14,673 INFO [train.py:1028] (0/2) Epoch 2, batch 2600, loss[loss=0.9832, simple_loss=0.6159, pruned_loss=0.6753, over 13239.00 frames. ], tot_loss[loss=0.9625, simple_loss=0.614, pruned_loss=0.6556, over 2586177.41 frames. ], batch size: 52, lr: 2.45e-02, grad_scale: 0.25 2024-06-19 16:54:17,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=23316.333333333332, ans=22.5 2024-06-19 16:54:17,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.38 vs. limit=15.0 2024-06-19 16:54:18,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=23316.333333333332, ans=0.0 2024-06-19 16:54:21,271 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+03 5.812e+03 7.896e+03 9.891e+03 5.199e+04, threshold=1.579e+04, percent-clipped=4.0 2024-06-19 16:54:26,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=29.55 vs. limit=22.5 2024-06-19 16:54:31,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=23353.0, ans=0.125 2024-06-19 16:54:38,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.53 vs. limit=22.5 2024-06-19 16:54:39,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.76 vs. limit=15.0 2024-06-19 16:54:48,408 INFO [train.py:1028] (0/2) Epoch 2, batch 2650, loss[loss=0.8537, simple_loss=0.5637, pruned_loss=0.5718, over 12965.00 frames. ], tot_loss[loss=0.958, simple_loss=0.6111, pruned_loss=0.6524, over 2587601.96 frames. ], batch size: 144, lr: 2.44e-02, grad_scale: 0.25 2024-06-19 16:55:01,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.16 vs. limit=15.0 2024-06-19 16:55:02,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=23444.666666666668, ans=15.0 2024-06-19 16:55:04,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=33.98 vs. limit=15.0 2024-06-19 16:55:07,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=23444.666666666668, ans=0.2 2024-06-19 16:55:08,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=23463.0, ans=0.1 2024-06-19 16:55:09,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.05 vs. limit=15.0 2024-06-19 16:55:14,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.27 vs. limit=15.0 2024-06-19 16:55:15,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.49 vs. limit=22.5 2024-06-19 16:55:18,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23481.333333333332, ans=0.125 2024-06-19 16:55:22,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=23481.333333333332, ans=0.125 2024-06-19 16:55:23,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=23481.333333333332, ans=0.005764927536231884 2024-06-19 16:55:25,219 INFO [train.py:1028] (0/2) Epoch 2, batch 2700, loss[loss=0.9583, simple_loss=0.6251, pruned_loss=0.6458, over 13217.00 frames. ], tot_loss[loss=0.9501, simple_loss=0.6075, pruned_loss=0.6463, over 2585216.40 frames. ], batch size: 89, lr: 2.44e-02, grad_scale: 0.5 2024-06-19 16:55:28,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=49.13 vs. limit=15.0 2024-06-19 16:55:30,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=24.91 vs. limit=15.0 2024-06-19 16:55:31,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.46 vs. limit=22.5 2024-06-19 16:55:31,853 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.261e+03 3.882e+03 4.889e+03 6.103e+03 2.908e+04, threshold=9.777e+03, percent-clipped=1.0 2024-06-19 16:55:37,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=23536.333333333332, ans=0.0 2024-06-19 16:55:46,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=23536.333333333332, ans=15.0 2024-06-19 16:55:50,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=23554.666666666668, ans=0.125 2024-06-19 16:56:01,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.14 vs. limit=15.0 2024-06-19 16:56:02,134 INFO [train.py:1028] (0/2) Epoch 2, batch 2750, loss[loss=0.9903, simple_loss=0.6232, pruned_loss=0.6787, over 13356.00 frames. ], tot_loss[loss=0.9451, simple_loss=0.6048, pruned_loss=0.6427, over 2582607.03 frames. ], batch size: 43, lr: 2.44e-02, grad_scale: 0.25 2024-06-19 16:56:05,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=23591.333333333332, ans=0.0 2024-06-19 16:56:07,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=23591.333333333332, ans=0.0 2024-06-19 16:56:16,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=23628.0, ans=0.125 2024-06-19 16:56:22,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.06 vs. limit=15.0 2024-06-19 16:56:27,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=23646.333333333332, ans=0.005729057971014493 2024-06-19 16:56:35,585 INFO [train.py:1028] (0/2) Epoch 2, batch 2800, loss[loss=0.8846, simple_loss=0.6003, pruned_loss=0.5845, over 10697.00 frames. ], tot_loss[loss=0.9436, simple_loss=0.6039, pruned_loss=0.6416, over 2579845.83 frames. ], batch size: 304, lr: 2.43e-02, grad_scale: 0.5 2024-06-19 16:56:36,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2024-06-19 16:56:37,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=23683.0, ans=0.07 2024-06-19 16:56:42,916 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.304e+03 3.671e+03 4.627e+03 6.254e+03 2.311e+04, threshold=9.254e+03, percent-clipped=6.0 2024-06-19 16:56:47,030 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=15.0 2024-06-19 16:56:50,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=23719.666666666668, ans=0.125 2024-06-19 16:57:11,342 INFO [train.py:1028] (0/2) Epoch 2, batch 2850, loss[loss=0.985, simple_loss=0.6291, pruned_loss=0.6704, over 13240.00 frames. ], tot_loss[loss=0.9383, simple_loss=0.6016, pruned_loss=0.6375, over 2577735.97 frames. ], batch size: 49, lr: 2.43e-02, grad_scale: 0.25 2024-06-19 16:57:12,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=23774.666666666668, ans=0.0 2024-06-19 16:57:16,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=23774.666666666668, ans=0.125 2024-06-19 16:57:21,458 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.78 vs. limit=22.5 2024-06-19 16:57:22,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2024-06-19 16:57:31,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=23811.333333333332, ans=0.125 2024-06-19 16:57:35,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.16 vs. limit=15.0 2024-06-19 16:57:36,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=32.11 vs. limit=15.0 2024-06-19 16:57:43,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.79 vs. limit=15.0 2024-06-19 16:57:45,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23848.0, ans=0.125 2024-06-19 16:57:47,031 INFO [train.py:1028] (0/2) Epoch 2, batch 2900, loss[loss=0.9635, simple_loss=0.6078, pruned_loss=0.6596, over 13103.00 frames. ], tot_loss[loss=0.9321, simple_loss=0.5984, pruned_loss=0.6329, over 2585178.70 frames. ], batch size: 55, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:57:55,350 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+03 2.936e+03 3.943e+03 5.282e+03 2.928e+04, threshold=7.887e+03, percent-clipped=7.0 2024-06-19 16:57:59,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=23884.666666666668, ans=0.125 2024-06-19 16:58:02,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2024-06-19 16:58:07,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=23921.333333333332, ans=0.125 2024-06-19 16:58:21,078 INFO [train.py:1028] (0/2) Epoch 2, batch 2950, loss[loss=0.8908, simple_loss=0.5598, pruned_loss=0.6109, over 13279.00 frames. ], tot_loss[loss=0.932, simple_loss=0.5982, pruned_loss=0.6329, over 2580093.47 frames. ], batch size: 43, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:58:23,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=23958.0, ans=0.125 2024-06-19 16:58:23,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=23958.0, ans=0.125 2024-06-19 16:58:34,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.20 vs. limit=10.0 2024-06-19 16:58:35,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=23994.666666666668, ans=0.125 2024-06-19 16:58:41,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=12.0 2024-06-19 16:58:45,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=24013.0, ans=10.0 2024-06-19 16:58:54,955 INFO [train.py:1028] (0/2) Epoch 2, batch 3000, loss[loss=0.9532, simple_loss=0.6053, pruned_loss=0.6506, over 13213.00 frames. ], tot_loss[loss=0.9275, simple_loss=0.5954, pruned_loss=0.6298, over 2579724.05 frames. ], batch size: 59, lr: 2.42e-02, grad_scale: 0.5 2024-06-19 16:58:54,956 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 16:59:05,801 INFO [train.py:1060] (0/2) Epoch 2, validation: loss=0.9755, simple_loss=0.6278, pruned_loss=0.6616, over 351949.00 frames. 2024-06-19 16:59:05,802 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 16:59:11,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=24049.666666666668, ans=0.0 2024-06-19 16:59:13,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24068.0, ans=0.1 2024-06-19 16:59:14,396 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+03 3.929e+03 5.019e+03 6.116e+03 2.164e+04, threshold=1.004e+04, percent-clipped=11.0 2024-06-19 16:59:17,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=15.0 2024-06-19 16:59:28,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=24104.666666666668, ans=0.0 2024-06-19 16:59:28,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=24104.666666666668, ans=0.2 2024-06-19 16:59:42,158 INFO [train.py:1028] (0/2) Epoch 2, batch 3050, loss[loss=1.027, simple_loss=0.6475, pruned_loss=0.7035, over 13311.00 frames. ], tot_loss[loss=0.9231, simple_loss=0.5936, pruned_loss=0.6264, over 2578710.21 frames. ], batch size: 46, lr: 2.41e-02, grad_scale: 0.5 2024-06-19 16:59:42,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=24141.333333333332, ans=0.125 2024-06-19 16:59:43,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24141.333333333332, ans=0.125 2024-06-19 16:59:49,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=24159.666666666668, ans=0.0 2024-06-19 17:00:03,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=24196.333333333332, ans=0.025 2024-06-19 17:00:07,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.48 vs. limit=22.5 2024-06-19 17:00:13,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.14 vs. limit=10.0 2024-06-19 17:00:15,466 INFO [train.py:1028] (0/2) Epoch 2, batch 3100, loss[loss=0.8487, simple_loss=0.5594, pruned_loss=0.569, over 12982.00 frames. ], tot_loss[loss=0.919, simple_loss=0.5911, pruned_loss=0.6235, over 2578299.85 frames. ], batch size: 144, lr: 2.41e-02, grad_scale: 0.5 2024-06-19 17:00:19,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24233.0, ans=0.1 2024-06-19 17:00:25,954 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.912e+03 5.586e+03 7.131e+03 1.055e+04 3.160e+04, threshold=1.426e+04, percent-clipped=25.0 2024-06-19 17:00:26,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=24251.333333333332, ans=22.5 2024-06-19 17:00:30,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.74 vs. limit=10.0 2024-06-19 17:00:32,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2024-06-19 17:00:32,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.84 vs. limit=10.0 2024-06-19 17:00:36,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24288.0, ans=0.1 2024-06-19 17:00:39,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=24288.0, ans=0.0 2024-06-19 17:00:42,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=24306.333333333332, ans=0.125 2024-06-19 17:00:48,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2024-06-19 17:00:50,105 INFO [train.py:1028] (0/2) Epoch 2, batch 3150, loss[loss=0.851, simple_loss=0.5666, pruned_loss=0.5677, over 12855.00 frames. ], tot_loss[loss=0.9156, simple_loss=0.5894, pruned_loss=0.6209, over 2580955.77 frames. ], batch size: 158, lr: 2.41e-02, grad_scale: 0.25 2024-06-19 17:00:54,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=24324.666666666668, ans=0.125 2024-06-19 17:00:57,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24343.0, ans=0.125 2024-06-19 17:01:02,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=29.44 vs. limit=15.0 2024-06-19 17:01:13,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24379.666666666668, ans=0.125 2024-06-19 17:01:13,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=15.0 2024-06-19 17:01:15,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-19 17:01:26,604 INFO [train.py:1028] (0/2) Epoch 2, batch 3200, loss[loss=0.9201, simple_loss=0.5792, pruned_loss=0.6305, over 13112.00 frames. ], tot_loss[loss=0.92, simple_loss=0.5904, pruned_loss=0.6248, over 2581362.84 frames. ], batch size: 55, lr: 2.40e-02, grad_scale: 0.5 2024-06-19 17:01:26,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24416.333333333332, ans=0.1 2024-06-19 17:01:30,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=24416.333333333332, ans=0.005561666666666667 2024-06-19 17:01:31,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=24416.333333333332, ans=0.125 2024-06-19 17:01:32,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.29 vs. limit=22.5 2024-06-19 17:01:33,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=24434.666666666668, ans=0.09899494936611666 2024-06-19 17:01:37,269 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.665e+03 6.229e+03 7.152e+03 9.416e+03 4.728e+04, threshold=1.430e+04, percent-clipped=6.0 2024-06-19 17:01:39,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=24453.0, ans=0.125 2024-06-19 17:01:44,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=24453.0, ans=0.0 2024-06-19 17:01:50,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=24471.333333333332, ans=0.07 2024-06-19 17:01:54,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.56 vs. limit=15.0 2024-06-19 17:01:56,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.62 vs. limit=10.0 2024-06-19 17:01:59,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=24489.666666666668, ans=0.125 2024-06-19 17:02:03,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=26.91 vs. limit=15.0 2024-06-19 17:02:03,377 INFO [train.py:1028] (0/2) Epoch 2, batch 3250, loss[loss=0.9249, simple_loss=0.583, pruned_loss=0.6334, over 13183.00 frames. ], tot_loss[loss=0.9188, simple_loss=0.5887, pruned_loss=0.6244, over 2584952.28 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 0.125 2024-06-19 17:02:08,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=24508.0, ans=0.0 2024-06-19 17:02:14,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=24526.333333333332, ans=0.2 2024-06-19 17:02:23,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=24544.666666666668, ans=0.005533768115942029 2024-06-19 17:02:24,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=15.0 2024-06-19 17:02:38,667 INFO [train.py:1028] (0/2) Epoch 2, batch 3300, loss[loss=0.9195, simple_loss=0.6188, pruned_loss=0.6101, over 12669.00 frames. ], tot_loss[loss=0.9181, simple_loss=0.5876, pruned_loss=0.6243, over 2580465.30 frames. ], batch size: 176, lr: 2.40e-02, grad_scale: 0.25 2024-06-19 17:02:44,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2024-06-19 17:02:44,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=24618.0, ans=0.125 2024-06-19 17:02:51,030 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.694e+03 4.848e+03 5.677e+03 7.153e+03 8.373e+04, threshold=1.135e+04, percent-clipped=8.0 2024-06-19 17:03:15,475 INFO [train.py:1028] (0/2) Epoch 2, batch 3350, loss[loss=0.8599, simple_loss=0.5753, pruned_loss=0.5722, over 12907.00 frames. ], tot_loss[loss=0.9127, simple_loss=0.5857, pruned_loss=0.6198, over 2575922.73 frames. ], batch size: 158, lr: 2.39e-02, grad_scale: 0.125 2024-06-19 17:03:19,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.70 vs. limit=22.5 2024-06-19 17:03:24,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=24709.666666666668, ans=0.005497898550724637 2024-06-19 17:03:27,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=24709.666666666668, ans=0.1 2024-06-19 17:03:33,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=24728.0, ans=0.0 2024-06-19 17:03:37,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24746.333333333332, ans=0.1 2024-06-19 17:03:38,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.69 vs. limit=22.5 2024-06-19 17:03:40,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=24746.333333333332, ans=0.125 2024-06-19 17:03:44,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.19 vs. limit=15.0 2024-06-19 17:03:46,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.84 vs. limit=15.0 2024-06-19 17:03:51,813 INFO [train.py:1028] (0/2) Epoch 2, batch 3400, loss[loss=1.037, simple_loss=0.6362, pruned_loss=0.7189, over 12612.00 frames. ], tot_loss[loss=0.9105, simple_loss=0.5846, pruned_loss=0.6182, over 2574807.73 frames. ], batch size: 22, lr: 2.39e-02, grad_scale: 0.25 2024-06-19 17:03:55,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=24783.0, ans=0.005481956521739131 2024-06-19 17:03:58,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=24801.333333333332, ans=0.125 2024-06-19 17:04:03,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.04 vs. limit=22.5 2024-06-19 17:04:04,082 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.343e+03 4.329e+03 5.335e+03 7.046e+03 2.045e+04, threshold=1.067e+04, percent-clipped=6.0 2024-06-19 17:04:13,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24838.0, ans=0.125 2024-06-19 17:04:17,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24838.0, ans=0.1 2024-06-19 17:04:25,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=18.40 vs. limit=15.0 2024-06-19 17:04:26,104 INFO [train.py:1028] (0/2) Epoch 2, batch 3450, loss[loss=0.8865, simple_loss=0.5854, pruned_loss=0.5939, over 12740.00 frames. ], tot_loss[loss=0.9102, simple_loss=0.5839, pruned_loss=0.6182, over 2576693.74 frames. ], batch size: 176, lr: 2.39e-02, grad_scale: 0.25 2024-06-19 17:04:27,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.81 vs. limit=15.0 2024-06-19 17:04:28,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=12.0 2024-06-19 17:04:29,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.65 vs. limit=22.5 2024-06-19 17:04:42,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2024-06-19 17:04:43,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.08 vs. limit=22.5 2024-06-19 17:04:44,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=23.12 vs. limit=15.0 2024-06-19 17:04:45,520 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.62 vs. limit=15.0 2024-06-19 17:04:47,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=24929.666666666668, ans=0.2 2024-06-19 17:04:48,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.65 vs. limit=15.0 2024-06-19 17:04:49,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=15.0 2024-06-19 17:04:55,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=27.51 vs. limit=22.5 2024-06-19 17:04:56,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.54 vs. limit=22.5 2024-06-19 17:04:58,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=24948.0, ans=0.025 2024-06-19 17:05:00,606 INFO [train.py:1028] (0/2) Epoch 2, batch 3500, loss[loss=0.8935, simple_loss=0.5576, pruned_loss=0.6147, over 12878.00 frames. ], tot_loss[loss=0.9087, simple_loss=0.5827, pruned_loss=0.6173, over 2575811.82 frames. ], batch size: 33, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:05:15,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24984.666666666668, ans=0.0 2024-06-19 17:05:16,697 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.513e+03 3.727e+03 4.442e+03 5.346e+03 2.556e+04, threshold=8.884e+03, percent-clipped=6.0 2024-06-19 17:05:27,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=25021.333333333332, ans=0.125 2024-06-19 17:05:28,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.48 vs. limit=15.0 2024-06-19 17:05:33,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=25039.666666666668, ans=0.125 2024-06-19 17:05:39,041 INFO [train.py:1028] (0/2) Epoch 2, batch 3550, loss[loss=0.863, simple_loss=0.5567, pruned_loss=0.5846, over 13109.00 frames. ], tot_loss[loss=0.9076, simple_loss=0.5811, pruned_loss=0.617, over 2577062.48 frames. ], batch size: 95, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:05:42,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=25058.0, ans=0.125 2024-06-19 17:05:43,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-19 17:05:48,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=15.0 2024-06-19 17:06:07,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=29.74 vs. limit=15.0 2024-06-19 17:06:15,730 INFO [train.py:1028] (0/2) Epoch 2, batch 3600, loss[loss=0.9503, simple_loss=0.5975, pruned_loss=0.6516, over 13024.00 frames. ], tot_loss[loss=0.9039, simple_loss=0.58, pruned_loss=0.6139, over 2579356.66 frames. ], batch size: 48, lr: 2.38e-02, grad_scale: 0.5 2024-06-19 17:06:18,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=25149.666666666668, ans=0.07 2024-06-19 17:06:20,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2024-06-19 17:06:20,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=19.76 vs. limit=15.0 2024-06-19 17:06:20,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25149.666666666668, ans=0.1 2024-06-19 17:06:23,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=25168.0, ans=0.04949747468305833 2024-06-19 17:06:24,585 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.87 vs. limit=15.0 2024-06-19 17:06:26,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=25168.0, ans=0.125 2024-06-19 17:06:27,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=12.0 2024-06-19 17:06:28,810 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+03 4.574e+03 5.849e+03 8.000e+03 5.142e+04, threshold=1.170e+04, percent-clipped=17.0 2024-06-19 17:06:31,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=25186.333333333332, ans=0.125 2024-06-19 17:06:31,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=25186.333333333332, ans=0.025 2024-06-19 17:06:33,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=32.37 vs. limit=22.5 2024-06-19 17:06:34,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=34.12 vs. limit=22.5 2024-06-19 17:06:37,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=25204.666666666668, ans=0.125 2024-06-19 17:06:44,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.17 vs. limit=10.0 2024-06-19 17:06:50,246 INFO [train.py:1028] (0/2) Epoch 2, batch 3650, loss[loss=0.8262, simple_loss=0.5351, pruned_loss=0.5586, over 13013.00 frames. ], tot_loss[loss=0.9063, simple_loss=0.5812, pruned_loss=0.6156, over 2577740.02 frames. ], batch size: 102, lr: 2.37e-02, grad_scale: 0.5 2024-06-19 17:06:54,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.15 vs. limit=15.0 2024-06-19 17:06:58,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=15.0 2024-06-19 17:07:08,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25278.0, ans=0.1 2024-06-19 17:07:14,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25296.333333333332, ans=0.125 2024-06-19 17:07:19,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=25296.333333333332, ans=0.00537036231884058 2024-06-19 17:07:27,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=25314.666666666668, ans=0.2 2024-06-19 17:07:28,997 INFO [train.py:1028] (0/2) Epoch 2, batch 3700, loss[loss=0.8613, simple_loss=0.5555, pruned_loss=0.5835, over 13274.00 frames. ], tot_loss[loss=0.9015, simple_loss=0.5787, pruned_loss=0.6121, over 2582405.31 frames. ], batch size: 72, lr: 2.37e-02, grad_scale: 1.0 2024-06-19 17:07:31,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2024-06-19 17:07:36,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=25351.333333333332, ans=0.025 2024-06-19 17:07:43,275 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.525e+03 5.418e+03 6.607e+03 8.945e+03 3.636e+04, threshold=1.321e+04, percent-clipped=15.0 2024-06-19 17:07:43,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=25369.666666666668, ans=0.0 2024-06-19 17:07:45,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25369.666666666668, ans=0.1 2024-06-19 17:07:46,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=25369.666666666668, ans=0.125 2024-06-19 17:07:48,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2024-06-19 17:07:58,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=25406.333333333332, ans=0.125 2024-06-19 17:08:01,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.87 vs. limit=15.0 2024-06-19 17:08:03,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.37 vs. limit=22.5 2024-06-19 17:08:05,412 INFO [train.py:1028] (0/2) Epoch 2, batch 3750, loss[loss=0.9613, simple_loss=0.5857, pruned_loss=0.6685, over 12624.00 frames. ], tot_loss[loss=0.9003, simple_loss=0.5779, pruned_loss=0.6113, over 2584956.27 frames. ], batch size: 22, lr: 2.37e-02, grad_scale: 0.125 2024-06-19 17:08:05,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25424.666666666668, ans=0.125 2024-06-19 17:08:06,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=28.67 vs. limit=22.5 2024-06-19 17:08:12,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=25443.0, ans=0.2 2024-06-19 17:08:14,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25443.0, ans=0.125 2024-06-19 17:08:23,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=25461.333333333332, ans=0.2 2024-06-19 17:08:26,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=25479.666666666668, ans=0.2 2024-06-19 17:08:28,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=25479.666666666668, ans=0.025 2024-06-19 17:08:38,377 INFO [train.py:1028] (0/2) Epoch 2, batch 3800, loss[loss=0.9346, simple_loss=0.5965, pruned_loss=0.6364, over 13230.00 frames. ], tot_loss[loss=0.9032, simple_loss=0.5786, pruned_loss=0.6139, over 2583217.11 frames. ], batch size: 83, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:08:41,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=25516.333333333332, ans=0.125 2024-06-19 17:08:42,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=38.67 vs. limit=22.5 2024-06-19 17:08:44,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=38.39 vs. limit=22.5 2024-06-19 17:08:45,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25534.666666666668, ans=0.1 2024-06-19 17:08:53,129 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.455e+03 7.849e+03 9.486e+03 1.301e+04 3.817e+04, threshold=1.897e+04, percent-clipped=21.0 2024-06-19 17:08:54,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.35 vs. limit=22.5 2024-06-19 17:08:58,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.17 vs. limit=6.0 2024-06-19 17:09:00,998 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=35.01 vs. limit=22.5 2024-06-19 17:09:01,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.82 vs. limit=10.0 2024-06-19 17:09:06,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.52 vs. limit=15.0 2024-06-19 17:09:11,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=36.61 vs. limit=22.5 2024-06-19 17:09:12,240 INFO [train.py:1028] (0/2) Epoch 2, batch 3850, loss[loss=0.8515, simple_loss=0.5659, pruned_loss=0.5686, over 13032.00 frames. ], tot_loss[loss=0.9071, simple_loss=0.5796, pruned_loss=0.6173, over 2583334.14 frames. ], batch size: 144, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:09:13,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=25608.0, ans=0.125 2024-06-19 17:09:24,671 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.394e+00 2024-06-19 17:09:28,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=25644.666666666668, ans=0.125 2024-06-19 17:09:30,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=25644.666666666668, ans=0.125 2024-06-19 17:09:43,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25681.333333333332, ans=0.125 2024-06-19 17:09:49,216 INFO [train.py:1028] (0/2) Epoch 2, batch 3900, loss[loss=0.8658, simple_loss=0.5617, pruned_loss=0.585, over 13253.00 frames. ], tot_loss[loss=0.9022, simple_loss=0.5783, pruned_loss=0.613, over 2587161.95 frames. ], batch size: 83, lr: 2.36e-02, grad_scale: 0.25 2024-06-19 17:09:49,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=25699.666666666668, ans=0.0 2024-06-19 17:09:56,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=25718.0, ans=0.0 2024-06-19 17:09:57,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.77 vs. limit=15.0 2024-06-19 17:09:59,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=25718.0, ans=0.125 2024-06-19 17:10:07,966 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.584e+03 6.112e+03 8.278e+03 1.239e+04 5.655e+04, threshold=1.656e+04, percent-clipped=6.0 2024-06-19 17:10:10,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25736.333333333332, ans=0.125 2024-06-19 17:10:12,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=25754.666666666668, ans=0.125 2024-06-19 17:10:14,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=25754.666666666668, ans=0.00527072463768116 2024-06-19 17:10:14,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=25754.666666666668, ans=0.125 2024-06-19 17:10:22,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=15.0 2024-06-19 17:10:25,648 INFO [train.py:1028] (0/2) Epoch 2, batch 3950, loss[loss=0.8433, simple_loss=0.5578, pruned_loss=0.5644, over 13122.00 frames. ], tot_loss[loss=0.8988, simple_loss=0.5769, pruned_loss=0.6103, over 2589234.62 frames. ], batch size: 132, lr: 2.35e-02, grad_scale: 0.125 2024-06-19 17:10:26,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25791.333333333332, ans=0.1 2024-06-19 17:10:30,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=25791.333333333332, ans=0.125 2024-06-19 17:10:30,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=25791.333333333332, ans=0.125 2024-06-19 17:10:35,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=25809.666666666668, ans=15.0 2024-06-19 17:10:38,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.60 vs. limit=22.5 2024-06-19 17:10:40,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.46 vs. limit=22.5 2024-06-19 17:10:40,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=25828.0, ans=0.125 2024-06-19 17:10:42,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.88 vs. limit=15.0 2024-06-19 17:10:43,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.82 vs. limit=15.0 2024-06-19 17:10:47,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-19 17:10:51,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=29.93 vs. limit=15.0 2024-06-19 17:10:52,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25864.666666666668, ans=0.1 2024-06-19 17:10:53,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=41.56 vs. limit=15.0 2024-06-19 17:10:54,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.01 vs. limit=10.0 2024-06-19 17:11:00,026 INFO [train.py:1028] (0/2) Epoch 2, batch 4000, loss[loss=0.9663, simple_loss=0.6147, pruned_loss=0.659, over 12912.00 frames. ], tot_loss[loss=0.8959, simple_loss=0.5756, pruned_loss=0.6081, over 2583396.10 frames. ], batch size: 39, lr: 2.35e-02, grad_scale: 0.25 2024-06-19 17:11:06,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=25901.333333333332, ans=0.125 2024-06-19 17:11:08,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.40 vs. limit=6.0 2024-06-19 17:11:16,506 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.331e+03 6.234e+03 8.336e+03 1.220e+04 4.733e+04, threshold=1.667e+04, percent-clipped=10.0 2024-06-19 17:11:28,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.03 vs. limit=22.5 2024-06-19 17:11:31,184 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=7.290e+01 2024-06-19 17:11:34,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2024-06-19 17:11:36,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=25974.666666666668, ans=0.125 2024-06-19 17:11:37,184 INFO [train.py:1028] (0/2) Epoch 2, batch 4050, loss[loss=0.7995, simple_loss=0.5461, pruned_loss=0.5265, over 11060.00 frames. ], tot_loss[loss=0.8936, simple_loss=0.5745, pruned_loss=0.6064, over 2581511.91 frames. ], batch size: 303, lr: 2.35e-02, grad_scale: 0.25 2024-06-19 17:11:44,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.86 vs. limit=15.0 2024-06-19 17:11:45,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=25993.0, ans=0.125 2024-06-19 17:11:51,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=26011.333333333332, ans=0.05 2024-06-19 17:12:05,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2024-06-19 17:12:08,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=26048.0, ans=0.025 2024-06-19 17:12:08,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=12.0 2024-06-19 17:12:11,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2024-06-19 17:12:11,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=26048.0, ans=0.025 2024-06-19 17:12:14,328 INFO [train.py:1028] (0/2) Epoch 2, batch 4100, loss[loss=0.8774, simple_loss=0.5741, pruned_loss=0.5904, over 13009.00 frames. ], tot_loss[loss=0.8929, simple_loss=0.5738, pruned_loss=0.606, over 2578704.77 frames. ], batch size: 102, lr: 2.34e-02, grad_scale: 0.5 2024-06-19 17:12:17,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=26066.333333333332, ans=0.125 2024-06-19 17:12:22,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=15.0 2024-06-19 17:12:28,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=26103.0, ans=0.125 2024-06-19 17:12:30,938 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.681e+03 6.194e+03 7.838e+03 1.003e+04 3.680e+04, threshold=1.568e+04, percent-clipped=6.0 2024-06-19 17:12:32,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=26103.0, ans=0.2 2024-06-19 17:12:34,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.71 vs. limit=22.5 2024-06-19 17:12:36,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.51 vs. limit=22.5 2024-06-19 17:12:38,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26121.333333333332, ans=0.1 2024-06-19 17:12:40,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=26121.333333333332, ans=0.0 2024-06-19 17:12:49,103 INFO [train.py:1028] (0/2) Epoch 2, batch 4150, loss[loss=0.9151, simple_loss=0.5822, pruned_loss=0.624, over 13136.00 frames. ], tot_loss[loss=0.8913, simple_loss=0.5721, pruned_loss=0.6052, over 2576154.22 frames. ], batch size: 55, lr: 2.34e-02, grad_scale: 0.125 2024-06-19 17:12:52,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.61 vs. limit=15.0 2024-06-19 17:13:17,791 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=15.0 2024-06-19 17:13:19,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.95 vs. limit=15.0 2024-06-19 17:13:21,976 INFO [train.py:1028] (0/2) Epoch 2, batch 4200, loss[loss=0.9006, simple_loss=0.5845, pruned_loss=0.6083, over 12996.00 frames. ], tot_loss[loss=0.8881, simple_loss=0.5708, pruned_loss=0.6027, over 2578643.05 frames. ], batch size: 102, lr: 2.34e-02, grad_scale: 0.25 2024-06-19 17:13:39,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.29 vs. limit=15.0 2024-06-19 17:13:42,039 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.212e+03 6.757e+03 8.192e+03 1.010e+04 4.087e+04, threshold=1.638e+04, percent-clipped=5.0 2024-06-19 17:13:45,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=26304.666666666668, ans=0.0 2024-06-19 17:13:47,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=26304.666666666668, ans=0.005151159420289855 2024-06-19 17:13:53,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26323.0, ans=0.1 2024-06-19 17:13:58,225 INFO [train.py:1028] (0/2) Epoch 2, batch 4250, loss[loss=0.8947, simple_loss=0.5725, pruned_loss=0.6084, over 13278.00 frames. ], tot_loss[loss=0.8896, simple_loss=0.5718, pruned_loss=0.6037, over 2580922.36 frames. ], batch size: 46, lr: 2.33e-02, grad_scale: 0.25 2024-06-19 17:13:59,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.65 vs. limit=15.0 2024-06-19 17:14:01,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=26341.333333333332, ans=0.005143188405797102 2024-06-19 17:14:02,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=26341.333333333332, ans=0.0 2024-06-19 17:14:03,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-19 17:14:04,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=33.23 vs. limit=15.0 2024-06-19 17:14:04,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26359.666666666668, ans=0.0 2024-06-19 17:14:04,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=26359.666666666668, ans=0.0 2024-06-19 17:14:04,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=26359.666666666668, ans=0.125 2024-06-19 17:14:05,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=26359.666666666668, ans=0.125 2024-06-19 17:14:05,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=26359.666666666668, ans=0.07 2024-06-19 17:14:11,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=26359.666666666668, ans=0.0 2024-06-19 17:14:16,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=26378.0, ans=0.2 2024-06-19 17:14:20,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26396.333333333332, ans=0.0 2024-06-19 17:14:25,081 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.71 vs. limit=22.5 2024-06-19 17:14:25,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.32 vs. limit=10.0 2024-06-19 17:14:26,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=26414.666666666668, ans=0.125 2024-06-19 17:14:34,575 INFO [train.py:1028] (0/2) Epoch 2, batch 4300, loss[loss=0.8623, simple_loss=0.5493, pruned_loss=0.5876, over 13173.00 frames. ], tot_loss[loss=0.8884, simple_loss=0.5706, pruned_loss=0.6031, over 2580906.86 frames. ], batch size: 59, lr: 2.33e-02, grad_scale: 0.5 2024-06-19 17:14:41,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26451.333333333332, ans=0.0 2024-06-19 17:14:43,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-19 17:14:44,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=26451.333333333332, ans=0.125 2024-06-19 17:14:48,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.92 vs. limit=15.0 2024-06-19 17:14:48,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.85 vs. limit=15.0 2024-06-19 17:14:51,769 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.168e+03 6.667e+03 7.842e+03 1.035e+04 5.575e+04, threshold=1.568e+04, percent-clipped=11.0 2024-06-19 17:15:00,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=15.0 2024-06-19 17:15:04,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.06 vs. limit=10.0 2024-06-19 17:15:07,463 INFO [train.py:1028] (0/2) Epoch 2, batch 4350, loss[loss=0.953, simple_loss=0.6124, pruned_loss=0.6468, over 13219.00 frames. ], tot_loss[loss=0.8845, simple_loss=0.569, pruned_loss=0.6, over 2585248.26 frames. ], batch size: 59, lr: 2.33e-02, grad_scale: 0.125 2024-06-19 17:15:13,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=26524.666666666668, ans=0.2 2024-06-19 17:15:17,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=26543.0, ans=0.125 2024-06-19 17:15:17,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=26543.0, ans=0.125 2024-06-19 17:15:19,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=26543.0, ans=0.025 2024-06-19 17:15:20,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=26561.333333333332, ans=0.125 2024-06-19 17:15:29,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.12 vs. limit=10.0 2024-06-19 17:15:38,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=26598.0, ans=0.005087391304347826 2024-06-19 17:15:39,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=26598.0, ans=0.125 2024-06-19 17:15:41,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=26598.0, ans=0.09899494936611666 2024-06-19 17:15:43,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.21 vs. limit=22.5 2024-06-19 17:15:44,634 INFO [train.py:1028] (0/2) Epoch 2, batch 4400, loss[loss=0.8084, simple_loss=0.5283, pruned_loss=0.5442, over 13228.00 frames. ], tot_loss[loss=0.8791, simple_loss=0.5666, pruned_loss=0.5958, over 2586262.91 frames. ], batch size: 83, lr: 2.33e-02, grad_scale: 0.25 2024-06-19 17:15:51,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=26634.666666666668, ans=0.0050794202898550725 2024-06-19 17:15:59,233 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.10 vs. limit=15.0 2024-06-19 17:15:59,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=26653.0, ans=0.125 2024-06-19 17:16:03,572 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.691e+03 8.130e+03 9.748e+03 1.231e+04 6.746e+04, threshold=1.950e+04, percent-clipped=14.0 2024-06-19 17:16:09,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26671.333333333332, ans=0.1 2024-06-19 17:16:11,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=29.62 vs. limit=22.5 2024-06-19 17:16:12,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=26671.333333333332, ans=0.0 2024-06-19 17:16:21,633 INFO [train.py:1028] (0/2) Epoch 2, batch 4450, loss[loss=0.8973, simple_loss=0.5654, pruned_loss=0.6146, over 12852.00 frames. ], tot_loss[loss=0.878, simple_loss=0.5666, pruned_loss=0.5947, over 2581297.43 frames. ], batch size: 33, lr: 2.32e-02, grad_scale: 0.125 2024-06-19 17:16:21,722 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=6.026e+02 2024-06-19 17:16:24,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=26708.0, ans=0.1 2024-06-19 17:16:27,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=26708.0, ans=0.0 2024-06-19 17:16:27,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=26708.0, ans=0.125 2024-06-19 17:16:27,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=15.0 2024-06-19 17:16:31,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2024-06-19 17:16:31,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26726.333333333332, ans=0.0 2024-06-19 17:16:33,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=26726.333333333332, ans=0.125 2024-06-19 17:16:42,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=26763.0, ans=0.0050515217391304355 2024-06-19 17:16:49,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=26781.333333333332, ans=0.0 2024-06-19 17:16:52,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.72 vs. limit=15.0 2024-06-19 17:16:54,992 INFO [train.py:1028] (0/2) Epoch 2, batch 4500, loss[loss=0.84, simple_loss=0.5406, pruned_loss=0.5697, over 13255.00 frames. ], tot_loss[loss=0.8766, simple_loss=0.5657, pruned_loss=0.5938, over 2585537.45 frames. ], batch size: 89, lr: 2.32e-02, grad_scale: 0.25 2024-06-19 17:16:56,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=26799.666666666668, ans=0.1 2024-06-19 17:17:04,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26818.0, ans=0.125 2024-06-19 17:17:08,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=26836.333333333332, ans=0.0 2024-06-19 17:17:11,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26836.333333333332, ans=0.125 2024-06-19 17:17:14,731 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.863e+03 9.925e+03 1.187e+04 1.777e+04 1.221e+05, threshold=2.375e+04, percent-clipped=23.0 2024-06-19 17:17:15,233 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2024-06-19 17:17:19,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=26854.666666666668, ans=0.0 2024-06-19 17:17:21,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=26854.666666666668, ans=0.125 2024-06-19 17:17:25,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.20 vs. limit=15.0 2024-06-19 17:17:29,364 INFO [train.py:1028] (0/2) Epoch 2, batch 4550, loss[loss=0.9593, simple_loss=0.6033, pruned_loss=0.6577, over 13197.00 frames. ], tot_loss[loss=0.8779, simple_loss=0.5657, pruned_loss=0.5951, over 2589035.43 frames. ], batch size: 52, lr: 2.32e-02, grad_scale: 0.25 2024-06-19 17:17:29,651 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2024-06-19 17:17:35,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=26891.333333333332, ans=0.05 2024-06-19 17:17:35,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=26891.333333333332, ans=0.0050236231884057975 2024-06-19 17:17:38,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=26891.333333333332, ans=22.5 2024-06-19 17:17:38,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=15.0 2024-06-19 17:17:39,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26909.666666666668, ans=0.1 2024-06-19 17:17:41,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.92 vs. limit=22.5 2024-06-19 17:17:42,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=26909.666666666668, ans=0.025 2024-06-19 17:17:51,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=26928.0, ans=0.2 2024-06-19 17:17:52,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=26946.333333333332, ans=0.05 2024-06-19 17:17:57,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.06 vs. limit=15.0 2024-06-19 17:17:58,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.86 vs. limit=15.0 2024-06-19 17:17:59,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=26964.666666666668, ans=0.1 2024-06-19 17:18:03,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=26964.666666666668, ans=0.125 2024-06-19 17:18:06,125 INFO [train.py:1028] (0/2) Epoch 2, batch 4600, loss[loss=0.8958, simple_loss=0.5839, pruned_loss=0.6039, over 12516.00 frames. ], tot_loss[loss=0.8814, simple_loss=0.5667, pruned_loss=0.5981, over 2584208.41 frames. ], batch size: 202, lr: 2.31e-02, grad_scale: 0.5 2024-06-19 17:18:06,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=26983.0, ans=0.025 2024-06-19 17:18:07,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=26983.0, ans=0.125 2024-06-19 17:18:20,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=27001.333333333332, ans=0.0 2024-06-19 17:18:22,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=27001.333333333332, ans=15.0 2024-06-19 17:18:25,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27019.666666666668, ans=0.125 2024-06-19 17:18:30,214 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.985e+03 6.616e+03 8.756e+03 1.168e+04 5.373e+04, threshold=1.751e+04, percent-clipped=6.0 2024-06-19 17:18:30,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27038.0, ans=0.125 2024-06-19 17:18:30,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=27038.0, ans=22.5 2024-06-19 17:18:42,191 INFO [train.py:1028] (0/2) Epoch 2, batch 4650, loss[loss=0.878, simple_loss=0.5732, pruned_loss=0.5914, over 13087.00 frames. ], tot_loss[loss=0.879, simple_loss=0.5647, pruned_loss=0.5966, over 2587675.02 frames. ], batch size: 132, lr: 2.31e-02, grad_scale: 0.125 2024-06-19 17:18:46,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=23.18 vs. limit=15.0 2024-06-19 17:18:53,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.94 vs. limit=15.0 2024-06-19 17:18:54,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27093.0, ans=0.1 2024-06-19 17:18:54,458 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=15.0 2024-06-19 17:18:57,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=27111.333333333332, ans=0.2 2024-06-19 17:19:00,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=27111.333333333332, ans=0.0 2024-06-19 17:19:03,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=27129.666666666668, ans=0.125 2024-06-19 17:19:08,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=27148.0, ans=0.0 2024-06-19 17:19:15,942 INFO [train.py:1028] (0/2) Epoch 2, batch 4700, loss[loss=0.9271, simple_loss=0.5807, pruned_loss=0.6367, over 12836.00 frames. ], tot_loss[loss=0.8805, simple_loss=0.5658, pruned_loss=0.5976, over 2583231.01 frames. ], batch size: 26, lr: 2.31e-02, grad_scale: 0.25 2024-06-19 17:19:26,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=17.74 vs. limit=15.0 2024-06-19 17:19:33,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27203.0, ans=0.1 2024-06-19 17:19:36,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=27203.0, ans=0.5 2024-06-19 17:19:37,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=27203.0, ans=0.125 2024-06-19 17:19:38,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.43 vs. limit=10.0 2024-06-19 17:19:39,763 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.452e+03 7.474e+03 9.848e+03 1.235e+04 3.821e+04, threshold=1.970e+04, percent-clipped=6.0 2024-06-19 17:19:44,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=15.0 2024-06-19 17:19:49,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=27239.666666666668, ans=0.125 2024-06-19 17:19:52,363 INFO [train.py:1028] (0/2) Epoch 2, batch 4750, loss[loss=0.8729, simple_loss=0.586, pruned_loss=0.5799, over 12493.00 frames. ], tot_loss[loss=0.8755, simple_loss=0.5636, pruned_loss=0.5937, over 2579026.01 frames. ], batch size: 202, lr: 2.30e-02, grad_scale: 0.0625 2024-06-19 17:20:00,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=27276.333333333332, ans=15.0 2024-06-19 17:20:05,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27294.666666666668, ans=0.1 2024-06-19 17:20:11,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=34.81 vs. limit=15.0 2024-06-19 17:20:18,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.81 vs. limit=10.0 2024-06-19 17:20:24,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=27331.333333333332, ans=0.025 2024-06-19 17:20:25,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=27331.333333333332, ans=0.125 2024-06-19 17:20:25,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=33.58 vs. limit=15.0 2024-06-19 17:20:29,246 INFO [train.py:1028] (0/2) Epoch 2, batch 4800, loss[loss=0.8789, simple_loss=0.5623, pruned_loss=0.5978, over 13276.00 frames. ], tot_loss[loss=0.8755, simple_loss=0.5636, pruned_loss=0.5937, over 2576006.24 frames. ], batch size: 63, lr: 2.30e-02, grad_scale: 0.125 2024-06-19 17:20:38,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=27368.0, ans=0.0 2024-06-19 17:20:48,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27386.333333333332, ans=0.125 2024-06-19 17:20:50,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=27404.666666666668, ans=0.125 2024-06-19 17:20:51,173 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.322e+03 9.685e+03 1.241e+04 1.513e+04 6.193e+04, threshold=2.482e+04, percent-clipped=12.0 2024-06-19 17:20:53,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=27404.666666666668, ans=0.125 2024-06-19 17:20:55,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=15.0 2024-06-19 17:21:02,369 INFO [train.py:1028] (0/2) Epoch 2, batch 4850, loss[loss=0.8436, simple_loss=0.5486, pruned_loss=0.5693, over 13266.00 frames. ], tot_loss[loss=0.8738, simple_loss=0.563, pruned_loss=0.5923, over 2574248.35 frames. ], batch size: 89, lr: 2.30e-02, grad_scale: 0.125 2024-06-19 17:21:10,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.91 vs. limit=15.0 2024-06-19 17:21:11,383 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.292e-01 2024-06-19 17:21:11,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=27459.666666666668, ans=15.0 2024-06-19 17:21:22,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=27496.333333333332, ans=0.125 2024-06-19 17:21:24,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.75 vs. limit=22.5 2024-06-19 17:21:31,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27496.333333333332, ans=0.125 2024-06-19 17:21:39,885 INFO [train.py:1028] (0/2) Epoch 2, batch 4900, loss[loss=0.9409, simple_loss=0.5925, pruned_loss=0.6446, over 13246.00 frames. ], tot_loss[loss=0.8772, simple_loss=0.5645, pruned_loss=0.595, over 2574952.13 frames. ], batch size: 59, lr: 2.29e-02, grad_scale: 0.125 2024-06-19 17:21:44,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=27533.0, ans=0.125 2024-06-19 17:21:49,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=27551.333333333332, ans=0.025 2024-06-19 17:21:51,788 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.775e+00 2024-06-19 17:21:52,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=27551.333333333332, ans=0.0 2024-06-19 17:21:52,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.69 vs. limit=15.0 2024-06-19 17:22:03,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.661e+03 7.110e+03 9.637e+03 1.218e+04 7.815e+04, threshold=1.927e+04, percent-clipped=6.0 2024-06-19 17:22:03,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=27588.0, ans=0.025 2024-06-19 17:22:07,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.88 vs. limit=15.0 2024-06-19 17:22:17,350 INFO [train.py:1028] (0/2) Epoch 2, batch 4950, loss[loss=0.8297, simple_loss=0.5586, pruned_loss=0.5504, over 10998.00 frames. ], tot_loss[loss=0.8742, simple_loss=0.5637, pruned_loss=0.5924, over 2568545.56 frames. ], batch size: 303, lr: 2.29e-02, grad_scale: 0.125 2024-06-19 17:22:27,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=27643.0, ans=10.0 2024-06-19 17:22:38,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=27679.666666666668, ans=0.0 2024-06-19 17:22:44,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=27698.0, ans=0.125 2024-06-19 17:22:50,966 INFO [train.py:1028] (0/2) Epoch 2, batch 5000, loss[loss=0.8534, simple_loss=0.5379, pruned_loss=0.5844, over 13176.00 frames. ], tot_loss[loss=0.8749, simple_loss=0.5632, pruned_loss=0.5933, over 2572389.12 frames. ], batch size: 95, lr: 2.29e-02, grad_scale: 0.25 2024-06-19 17:22:51,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=27716.333333333332, ans=0.0 2024-06-19 17:22:57,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=27734.666666666668, ans=0.07 2024-06-19 17:23:07,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.21 vs. limit=15.0 2024-06-19 17:23:10,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27753.0, ans=0.125 2024-06-19 17:23:10,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=27753.0, ans=0.125 2024-06-19 17:23:14,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.824e+03 8.419e+03 1.067e+04 1.248e+04 2.932e+04, threshold=2.134e+04, percent-clipped=2.0 2024-06-19 17:23:20,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=27789.666666666668, ans=0.004828333333333334 2024-06-19 17:23:20,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27789.666666666668, ans=0.1 2024-06-19 17:23:21,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.57 vs. limit=15.0 2024-06-19 17:23:22,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=27789.666666666668, ans=0.125 2024-06-19 17:23:24,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=27789.666666666668, ans=0.125 2024-06-19 17:23:25,278 INFO [train.py:1028] (0/2) Epoch 2, batch 5050, loss[loss=0.8892, simple_loss=0.5571, pruned_loss=0.6107, over 12977.00 frames. ], tot_loss[loss=0.8797, simple_loss=0.564, pruned_loss=0.5977, over 2573379.65 frames. ], batch size: 36, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:23:26,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=31.34 vs. limit=22.5 2024-06-19 17:23:27,289 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 17:23:51,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=27863.0, ans=0.1 2024-06-19 17:23:56,250 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=8.725e+00 2024-06-19 17:23:58,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=27881.333333333332, ans=0.2 2024-06-19 17:24:02,892 INFO [train.py:1028] (0/2) Epoch 2, batch 5100, loss[loss=0.9406, simple_loss=0.596, pruned_loss=0.6426, over 13235.00 frames. ], tot_loss[loss=0.876, simple_loss=0.5632, pruned_loss=0.5944, over 2570064.08 frames. ], batch size: 40, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:24:04,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.53 vs. limit=15.0 2024-06-19 17:24:19,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=27918.0, ans=0.125 2024-06-19 17:24:19,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=27936.333333333332, ans=0.025 2024-06-19 17:24:23,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.37 vs. limit=15.0 2024-06-19 17:24:25,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=27936.333333333332, ans=0.004796449275362319 2024-06-19 17:24:30,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.514e+03 8.316e+03 1.075e+04 1.479e+04 1.402e+05, threshold=2.150e+04, percent-clipped=15.0 2024-06-19 17:24:33,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=27973.0, ans=0.015 2024-06-19 17:24:39,861 INFO [train.py:1028] (0/2) Epoch 2, batch 5150, loss[loss=0.7912, simple_loss=0.5227, pruned_loss=0.5298, over 13109.00 frames. ], tot_loss[loss=0.8741, simple_loss=0.5627, pruned_loss=0.5928, over 2572555.07 frames. ], batch size: 132, lr: 2.28e-02, grad_scale: 0.125 2024-06-19 17:24:40,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=27991.333333333332, ans=0.125 2024-06-19 17:24:49,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28009.666666666668, ans=0.125 2024-06-19 17:24:49,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=15.0 2024-06-19 17:24:50,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=15.0 2024-06-19 17:24:58,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.80 vs. limit=10.0 2024-06-19 17:25:00,943 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.707e+02 2024-06-19 17:25:01,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.94 vs. limit=15.0 2024-06-19 17:25:03,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=28.46 vs. limit=22.5 2024-06-19 17:25:04,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28046.333333333332, ans=0.1 2024-06-19 17:25:05,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=28046.333333333332, ans=0.09899494936611666 2024-06-19 17:25:13,552 INFO [train.py:1028] (0/2) Epoch 2, batch 5200, loss[loss=0.8428, simple_loss=0.5508, pruned_loss=0.5674, over 13118.00 frames. ], tot_loss[loss=0.8732, simple_loss=0.5621, pruned_loss=0.5922, over 2575394.02 frames. ], batch size: 95, lr: 2.28e-02, grad_scale: 0.25 2024-06-19 17:25:15,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=28083.0, ans=0.0 2024-06-19 17:25:18,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=28083.0, ans=0.2 2024-06-19 17:25:20,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=28101.333333333332, ans=0.5 2024-06-19 17:25:21,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.02 vs. limit=15.0 2024-06-19 17:25:28,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.01 vs. limit=22.5 2024-06-19 17:25:31,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=12.0 2024-06-19 17:25:41,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.391e+03 7.121e+03 9.053e+03 1.078e+04 2.821e+04, threshold=1.811e+04, percent-clipped=3.0 2024-06-19 17:25:42,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=8.0 2024-06-19 17:25:50,408 INFO [train.py:1028] (0/2) Epoch 2, batch 5250, loss[loss=0.9247, simple_loss=0.582, pruned_loss=0.6338, over 13303.00 frames. ], tot_loss[loss=0.877, simple_loss=0.5635, pruned_loss=0.5952, over 2569769.18 frames. ], batch size: 52, lr: 2.27e-02, grad_scale: 0.25 2024-06-19 17:25:50,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.42 vs. limit=15.0 2024-06-19 17:25:51,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=28174.666666666668, ans=0.2 2024-06-19 17:25:58,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.75 vs. limit=15.0 2024-06-19 17:25:59,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28193.0, ans=0.125 2024-06-19 17:26:09,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.25 vs. limit=15.0 2024-06-19 17:26:23,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28248.0, ans=0.1 2024-06-19 17:26:27,458 INFO [train.py:1028] (0/2) Epoch 2, batch 5300, loss[loss=0.8295, simple_loss=0.5488, pruned_loss=0.5551, over 13013.00 frames. ], tot_loss[loss=0.8769, simple_loss=0.5632, pruned_loss=0.5953, over 2566413.19 frames. ], batch size: 144, lr: 2.27e-02, grad_scale: 0.5 2024-06-19 17:26:29,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.32 vs. limit=15.0 2024-06-19 17:26:47,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-19 17:26:47,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.25 vs. limit=10.0 2024-06-19 17:26:53,285 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.446e+03 7.835e+03 9.752e+03 1.158e+04 3.315e+04, threshold=1.950e+04, percent-clipped=8.0 2024-06-19 17:26:56,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=28339.666666666668, ans=0.2 2024-06-19 17:26:58,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=28339.666666666668, ans=0.125 2024-06-19 17:27:00,044 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=22.12 vs. limit=15.0 2024-06-19 17:27:01,699 INFO [train.py:1028] (0/2) Epoch 2, batch 5350, loss[loss=0.974, simple_loss=0.6015, pruned_loss=0.6732, over 11489.00 frames. ], tot_loss[loss=0.8733, simple_loss=0.5616, pruned_loss=0.5925, over 2573426.30 frames. ], batch size: 16, lr: 2.27e-02, grad_scale: 0.0625 2024-06-19 17:27:10,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=28376.333333333332, ans=0.125 2024-06-19 17:27:20,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.39 vs. limit=22.5 2024-06-19 17:27:33,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=28431.333333333332, ans=0.125 2024-06-19 17:27:37,832 INFO [train.py:1028] (0/2) Epoch 2, batch 5400, loss[loss=0.7911, simple_loss=0.5355, pruned_loss=0.5233, over 12233.00 frames. ], tot_loss[loss=0.8699, simple_loss=0.5611, pruned_loss=0.5894, over 2567377.94 frames. ], batch size: 240, lr: 2.26e-02, grad_scale: 0.125 2024-06-19 17:27:40,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=28449.666666666668, ans=0.125 2024-06-19 17:27:50,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=28468.0, ans=0.2 2024-06-19 17:27:50,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.92 vs. limit=15.0 2024-06-19 17:27:54,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2024-06-19 17:28:04,689 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.551e+03 7.963e+03 9.925e+03 1.254e+04 6.192e+04, threshold=1.985e+04, percent-clipped=7.0 2024-06-19 17:28:13,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2024-06-19 17:28:15,409 INFO [train.py:1028] (0/2) Epoch 2, batch 5450, loss[loss=1.006, simple_loss=0.6347, pruned_loss=0.6888, over 12573.00 frames. ], tot_loss[loss=0.8711, simple_loss=0.5621, pruned_loss=0.59, over 2571349.11 frames. ], batch size: 25, lr: 2.26e-02, grad_scale: 0.125 2024-06-19 17:28:15,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=28541.333333333332, ans=0.0 2024-06-19 17:28:16,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=28541.333333333332, ans=0.035 2024-06-19 17:28:16,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-19 17:28:18,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=28541.333333333332, ans=0.2 2024-06-19 17:28:24,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=12.0 2024-06-19 17:28:35,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=28596.333333333332, ans=0.125 2024-06-19 17:28:41,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-19 17:28:43,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=28614.666666666668, ans=0.125 2024-06-19 17:28:44,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28614.666666666668, ans=0.1 2024-06-19 17:28:44,408 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.52 vs. limit=15.0 2024-06-19 17:28:46,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28614.666666666668, ans=0.1 2024-06-19 17:28:49,382 INFO [train.py:1028] (0/2) Epoch 2, batch 5500, loss[loss=0.8588, simple_loss=0.5697, pruned_loss=0.574, over 12240.00 frames. ], tot_loss[loss=0.8708, simple_loss=0.5622, pruned_loss=0.5897, over 2565201.36 frames. ], batch size: 240, lr: 2.26e-02, grad_scale: 0.25 2024-06-19 17:28:50,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28633.0, ans=0.1 2024-06-19 17:28:50,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28633.0, ans=0.125 2024-06-19 17:28:57,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.26 vs. limit=10.0 2024-06-19 17:28:58,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28651.333333333332, ans=0.125 2024-06-19 17:29:02,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.38 vs. limit=15.0 2024-06-19 17:29:03,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=28669.666666666668, ans=0.0 2024-06-19 17:29:07,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=28669.666666666668, ans=0.0 2024-06-19 17:29:13,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=28688.0, ans=0.00463304347826087 2024-06-19 17:29:15,960 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.573e+03 9.456e+03 1.166e+04 1.441e+04 6.858e+04, threshold=2.333e+04, percent-clipped=12.0 2024-06-19 17:29:17,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=28706.333333333332, ans=0.125 2024-06-19 17:29:21,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.79 vs. limit=15.0 2024-06-19 17:29:26,846 INFO [train.py:1028] (0/2) Epoch 2, batch 5550, loss[loss=0.987, simple_loss=0.6166, pruned_loss=0.6786, over 13343.00 frames. ], tot_loss[loss=0.8724, simple_loss=0.5622, pruned_loss=0.5913, over 2569589.61 frames. ], batch size: 43, lr: 2.26e-02, grad_scale: 0.25 2024-06-19 17:29:27,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=28724.666666666668, ans=0.125 2024-06-19 17:29:34,323 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.836e-02 2024-06-19 17:29:40,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=28761.333333333332, ans=0.125 2024-06-19 17:29:41,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=28761.333333333332, ans=0.125 2024-06-19 17:29:49,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-19 17:29:51,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.57 vs. limit=15.0 2024-06-19 17:29:56,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=28798.0, ans=0.125 2024-06-19 17:29:59,930 INFO [train.py:1028] (0/2) Epoch 2, batch 5600, loss[loss=0.8188, simple_loss=0.531, pruned_loss=0.5533, over 13225.00 frames. ], tot_loss[loss=0.87, simple_loss=0.5613, pruned_loss=0.5894, over 2572335.58 frames. ], batch size: 89, lr: 2.25e-02, grad_scale: 0.5 2024-06-19 17:30:00,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28816.333333333332, ans=0.1 2024-06-19 17:30:04,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=28816.333333333332, ans=0.125 2024-06-19 17:30:07,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.47 vs. limit=22.5 2024-06-19 17:30:12,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28834.666666666668, ans=0.1 2024-06-19 17:30:19,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=28853.0, ans=0.125 2024-06-19 17:30:23,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=28871.333333333332, ans=0.2 2024-06-19 17:30:32,012 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.180e+03 1.028e+04 1.357e+04 2.041e+04 6.376e+04, threshold=2.714e+04, percent-clipped=17.0 2024-06-19 17:30:36,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.58 vs. limit=15.0 2024-06-19 17:30:38,511 INFO [train.py:1028] (0/2) Epoch 2, batch 5650, loss[loss=0.8657, simple_loss=0.5811, pruned_loss=0.5751, over 12549.00 frames. ], tot_loss[loss=0.8715, simple_loss=0.5618, pruned_loss=0.5906, over 2578061.43 frames. ], batch size: 203, lr: 2.25e-02, grad_scale: 0.125 2024-06-19 17:30:49,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28926.333333333332, ans=0.2 2024-06-19 17:30:49,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.78 vs. limit=15.0 2024-06-19 17:30:50,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28926.333333333332, ans=0.1 2024-06-19 17:30:50,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.71 vs. limit=15.0 2024-06-19 17:30:51,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.80 vs. limit=15.0 2024-06-19 17:30:55,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=28944.666666666668, ans=0.0 2024-06-19 17:30:59,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=28963.0, ans=0.0 2024-06-19 17:31:03,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=28963.0, ans=0.0 2024-06-19 17:31:07,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2024-06-19 17:31:13,263 INFO [train.py:1028] (0/2) Epoch 2, batch 5700, loss[loss=0.9423, simple_loss=0.5972, pruned_loss=0.6437, over 13284.00 frames. ], tot_loss[loss=0.8697, simple_loss=0.5608, pruned_loss=0.5893, over 2581900.67 frames. ], batch size: 63, lr: 2.25e-02, grad_scale: 0.25 2024-06-19 17:31:24,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29018.0, ans=0.0 2024-06-19 17:31:26,353 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.69 vs. limit=10.0 2024-06-19 17:31:36,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.03 vs. limit=22.5 2024-06-19 17:31:36,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.38 vs. limit=15.0 2024-06-19 17:31:44,364 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.609e+03 5.248e+03 7.904e+03 1.265e+04 5.551e+04, threshold=1.581e+04, percent-clipped=5.0 2024-06-19 17:31:48,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.87 vs. limit=22.5 2024-06-19 17:31:49,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=29073.0, ans=0.125 2024-06-19 17:31:49,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=53.47 vs. limit=15.0 2024-06-19 17:31:50,187 INFO [train.py:1028] (0/2) Epoch 2, batch 5750, loss[loss=0.8729, simple_loss=0.5733, pruned_loss=0.5862, over 12815.00 frames. ], tot_loss[loss=0.8734, simple_loss=0.5623, pruned_loss=0.5923, over 2582120.87 frames. ], batch size: 177, lr: 2.24e-02, grad_scale: 0.25 2024-06-19 17:32:23,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=27.12 vs. limit=15.0 2024-06-19 17:32:25,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=29164.666666666668, ans=0.125 2024-06-19 17:32:28,799 INFO [train.py:1028] (0/2) Epoch 2, batch 5800, loss[loss=0.8569, simple_loss=0.5707, pruned_loss=0.5716, over 12751.00 frames. ], tot_loss[loss=0.8766, simple_loss=0.5644, pruned_loss=0.5944, over 2581053.94 frames. ], batch size: 176, lr: 2.24e-02, grad_scale: 0.5 2024-06-19 17:32:29,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=32.13 vs. limit=15.0 2024-06-19 17:32:35,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=29201.333333333332, ans=0.125 2024-06-19 17:32:41,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.36 vs. limit=10.0 2024-06-19 17:32:42,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=29219.666666666668, ans=0.2 2024-06-19 17:32:42,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.97 vs. limit=22.5 2024-06-19 17:32:47,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.68 vs. limit=10.0 2024-06-19 17:32:56,835 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.321e+03 5.380e+03 6.532e+03 7.733e+03 3.613e+04, threshold=1.306e+04, percent-clipped=3.0 2024-06-19 17:33:02,238 INFO [train.py:1028] (0/2) Epoch 2, batch 5850, loss[loss=0.8289, simple_loss=0.5538, pruned_loss=0.552, over 12513.00 frames. ], tot_loss[loss=0.8821, simple_loss=0.5679, pruned_loss=0.5981, over 2577535.98 frames. ], batch size: 202, lr: 2.24e-02, grad_scale: 0.25 2024-06-19 17:33:04,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=19.89 vs. limit=15.0 2024-06-19 17:33:11,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29293.0, ans=0.1 2024-06-19 17:33:13,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.19 vs. limit=15.0 2024-06-19 17:33:17,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.04 vs. limit=22.5 2024-06-19 17:33:23,298 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-16000.pt 2024-06-19 17:33:34,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=29329.666666666668, ans=0.125 2024-06-19 17:33:34,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29329.666666666668, ans=0.125 2024-06-19 17:33:45,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=29348.0, ans=0.1 2024-06-19 17:33:47,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.whiten.whitening_limit, batch_count=29348.0, ans=12.0 2024-06-19 17:33:49,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.48 vs. limit=22.5 2024-06-19 17:33:49,540 INFO [train.py:1028] (0/2) Epoch 2, batch 5900, loss[loss=0.8507, simple_loss=0.5627, pruned_loss=0.5694, over 13100.00 frames. ], tot_loss[loss=0.8894, simple_loss=0.5725, pruned_loss=0.6032, over 2577783.91 frames. ], batch size: 121, lr: 2.23e-02, grad_scale: 0.5 2024-06-19 17:33:50,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=29366.333333333332, ans=0.125 2024-06-19 17:33:53,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.80 vs. limit=15.0 2024-06-19 17:33:56,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2024-06-19 17:34:00,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.10 vs. limit=15.0 2024-06-19 17:34:12,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=29421.333333333332, ans=0.004473623188405797 2024-06-19 17:34:19,051 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.046e+03 7.162e+03 8.402e+03 1.029e+04 3.384e+04, threshold=1.680e+04, percent-clipped=13.0 2024-06-19 17:34:23,184 INFO [train.py:1028] (0/2) Epoch 2, batch 5950, loss[loss=0.853, simple_loss=0.555, pruned_loss=0.5755, over 13055.00 frames. ], tot_loss[loss=0.8948, simple_loss=0.5764, pruned_loss=0.6066, over 2581926.81 frames. ], batch size: 121, lr: 2.23e-02, grad_scale: 0.125 2024-06-19 17:34:29,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=29458.0, ans=0.1 2024-06-19 17:34:30,494 WARNING [optim.py:503] (0/2) Scaling gradients by 0.08075393736362457, model_norm_threshold=16804.384765625 2024-06-19 17:34:30,661 WARNING [optim.py:575] (0/2) Parameter dominating tot_sumsq module.encoder_embed.conv.0.weight with proportion 0.29, where dominant_sumsq=(grad_sumsq*orig_rms_sq)=1.248e+10, grad_sumsq=3.936e+12, orig_rms_sq=3.171e-03 2024-06-19 17:34:31,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.41 vs. limit=6.0 2024-06-19 17:34:33,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=15.0 2024-06-19 17:34:44,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=29494.666666666668, ans=0.125 2024-06-19 17:34:45,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=29494.666666666668, ans=0.025 2024-06-19 17:34:50,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=29513.0, ans=0.125 2024-06-19 17:34:57,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=29531.333333333332, ans=0.125 2024-06-19 17:34:57,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=29531.333333333332, ans=0.2 2024-06-19 17:34:58,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=29531.333333333332, ans=0.125 2024-06-19 17:35:00,318 INFO [train.py:1028] (0/2) Epoch 2, batch 6000, loss[loss=0.8929, simple_loss=0.6053, pruned_loss=0.5903, over 12222.00 frames. ], tot_loss[loss=0.8979, simple_loss=0.5784, pruned_loss=0.6087, over 2574370.38 frames. ], batch size: 241, lr: 2.23e-02, grad_scale: 0.0625 2024-06-19 17:35:00,319 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 17:35:05,088 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.5560, 1.5594, 1.8513, 1.6128], device='cuda:0') 2024-06-19 17:35:08,182 INFO [train.py:1060] (0/2) Epoch 2, validation: loss=0.9699, simple_loss=0.6265, pruned_loss=0.6567, over 351949.00 frames. 2024-06-19 17:35:08,183 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 17:35:09,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=29549.666666666668, ans=0.0 2024-06-19 17:35:16,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=31.01 vs. limit=15.0 2024-06-19 17:35:21,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.19 vs. limit=15.0 2024-06-19 17:35:21,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.54 vs. limit=12.0 2024-06-19 17:35:24,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=29586.333333333332, ans=0.0 2024-06-19 17:35:35,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=29623.0, ans=0.05 2024-06-19 17:35:36,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=29623.0, ans=0.95 2024-06-19 17:35:40,095 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.962e+03 4.634e+03 5.657e+03 7.630e+03 2.081e+05, threshold=1.131e+04, percent-clipped=6.0 2024-06-19 17:35:42,728 INFO [train.py:1028] (0/2) Epoch 2, batch 6050, loss[loss=0.946, simple_loss=0.6024, pruned_loss=0.6448, over 12960.00 frames. ], tot_loss[loss=0.9032, simple_loss=0.5816, pruned_loss=0.6124, over 2577254.13 frames. ], batch size: 39, lr: 2.23e-02, grad_scale: 0.0625 2024-06-19 17:35:49,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=29641.333333333332, ans=0.0 2024-06-19 17:35:49,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=15.0 2024-06-19 17:35:50,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=29641.333333333332, ans=0.05 2024-06-19 17:36:19,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29733.0, ans=0.1 2024-06-19 17:36:19,727 INFO [train.py:1028] (0/2) Epoch 2, batch 6100, loss[loss=0.8421, simple_loss=0.5522, pruned_loss=0.566, over 13150.00 frames. ], tot_loss[loss=0.9056, simple_loss=0.583, pruned_loss=0.6141, over 2580413.12 frames. ], batch size: 121, lr: 2.22e-02, grad_scale: 0.125 2024-06-19 17:36:23,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=29733.0, ans=0.07 2024-06-19 17:36:29,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=29751.333333333332, ans=0.0 2024-06-19 17:36:32,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.80 vs. limit=22.5 2024-06-19 17:36:38,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=29769.666666666668, ans=0.004397898550724638 2024-06-19 17:36:39,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.71 vs. limit=22.5 2024-06-19 17:36:44,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=29788.0, ans=0.125 2024-06-19 17:36:44,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=19.21 vs. limit=15.0 2024-06-19 17:36:45,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.90 vs. limit=15.0 2024-06-19 17:36:50,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29806.333333333332, ans=0.1 2024-06-19 17:36:51,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.53 vs. limit=10.0 2024-06-19 17:36:54,750 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.092e+03 5.735e+03 7.517e+03 9.589e+03 5.845e+04, threshold=1.503e+04, percent-clipped=17.0 2024-06-19 17:36:56,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=12.0 2024-06-19 17:36:56,670 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.71 vs. limit=15.0 2024-06-19 17:36:57,060 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=2.264e+02 2024-06-19 17:36:57,690 INFO [train.py:1028] (0/2) Epoch 2, batch 6150, loss[loss=0.8456, simple_loss=0.5726, pruned_loss=0.5593, over 11001.00 frames. ], tot_loss[loss=0.9098, simple_loss=0.5853, pruned_loss=0.6171, over 2578958.26 frames. ], batch size: 304, lr: 2.22e-02, grad_scale: 0.125 2024-06-19 17:37:06,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29843.0, ans=0.125 2024-06-19 17:37:13,261 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=5.037e-02 2024-06-19 17:37:13,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=29861.333333333332, ans=0.0 2024-06-19 17:37:21,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=29879.666666666668, ans=0.125 2024-06-19 17:37:27,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=29898.0, ans=0.125 2024-06-19 17:37:27,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.30 vs. limit=15.0 2024-06-19 17:37:32,177 INFO [train.py:1028] (0/2) Epoch 2, batch 6200, loss[loss=1.013, simple_loss=0.6631, pruned_loss=0.6817, over 13229.00 frames. ], tot_loss[loss=0.9151, simple_loss=0.5891, pruned_loss=0.6206, over 2576583.42 frames. ], batch size: 89, lr: 2.22e-02, grad_scale: 0.25 2024-06-19 17:37:32,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=29916.333333333332, ans=0.125 2024-06-19 17:37:34,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.96 vs. limit=22.5 2024-06-19 17:37:35,998 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=27.10 vs. limit=15.0 2024-06-19 17:37:38,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=29934.666666666668, ans=0.95 2024-06-19 17:37:48,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=29953.0, ans=0.0 2024-06-19 17:37:50,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.53 vs. limit=15.0 2024-06-19 17:38:01,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=29971.333333333332, ans=0.125 2024-06-19 17:38:07,237 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.721e+03 5.123e+03 6.015e+03 7.826e+03 3.365e+04, threshold=1.203e+04, percent-clipped=1.0 2024-06-19 17:38:10,180 INFO [train.py:1028] (0/2) Epoch 2, batch 6250, loss[loss=0.9075, simple_loss=0.5873, pruned_loss=0.6138, over 13252.00 frames. ], tot_loss[loss=0.9197, simple_loss=0.5923, pruned_loss=0.6236, over 2569671.00 frames. ], batch size: 83, lr: 2.22e-02, grad_scale: 0.25 2024-06-19 17:38:11,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2024-06-19 17:38:12,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30008.0, ans=0.0 2024-06-19 17:38:16,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=23.12 vs. limit=15.0 2024-06-19 17:38:18,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=30026.333333333332, ans=0.0 2024-06-19 17:38:25,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30044.666666666668, ans=0.1 2024-06-19 17:38:36,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.08 vs. limit=6.0 2024-06-19 17:38:48,301 INFO [train.py:1028] (0/2) Epoch 2, batch 6300, loss[loss=0.9982, simple_loss=0.6083, pruned_loss=0.694, over 11455.00 frames. ], tot_loss[loss=0.9229, simple_loss=0.5943, pruned_loss=0.6258, over 2564502.53 frames. ], batch size: 16, lr: 2.21e-02, grad_scale: 0.5 2024-06-19 17:38:56,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=23.39 vs. limit=15.0 2024-06-19 17:39:12,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=27.58 vs. limit=22.5 2024-06-19 17:39:16,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30173.0, ans=0.1 2024-06-19 17:39:19,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.24 vs. limit=15.0 2024-06-19 17:39:19,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=15.0 2024-06-19 17:39:20,167 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.025e+03 6.058e+03 8.330e+03 1.068e+04 3.495e+04, threshold=1.666e+04, percent-clipped=15.0 2024-06-19 17:39:22,214 INFO [train.py:1028] (0/2) Epoch 2, batch 6350, loss[loss=0.9785, simple_loss=0.6551, pruned_loss=0.6509, over 12693.00 frames. ], tot_loss[loss=0.9343, simple_loss=0.5999, pruned_loss=0.6344, over 2574756.98 frames. ], batch size: 202, lr: 2.21e-02, grad_scale: 0.25 2024-06-19 17:39:28,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=15.0 2024-06-19 17:39:36,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=30228.0, ans=0.004298260869565217 2024-06-19 17:39:38,109 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2024-06-19 17:39:51,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.59 vs. limit=15.0 2024-06-19 17:39:55,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=30264.666666666668, ans=0.125 2024-06-19 17:39:56,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=30264.666666666668, ans=0.125 2024-06-19 17:39:58,192 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=7.342e+02 2024-06-19 17:39:58,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30283.0, ans=0.1 2024-06-19 17:39:59,429 INFO [train.py:1028] (0/2) Epoch 2, batch 6400, loss[loss=0.9414, simple_loss=0.5993, pruned_loss=0.6417, over 13211.00 frames. ], tot_loss[loss=0.941, simple_loss=0.6034, pruned_loss=0.6393, over 2575332.92 frames. ], batch size: 67, lr: 2.21e-02, grad_scale: 0.5 2024-06-19 17:40:10,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=30301.333333333332, ans=0.00428231884057971 2024-06-19 17:40:15,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2024-06-19 17:40:19,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.37 vs. limit=15.0 2024-06-19 17:40:20,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=30338.0, ans=0.004274347826086957 2024-06-19 17:40:21,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=15.0 2024-06-19 17:40:24,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=30338.0, ans=0.125 2024-06-19 17:40:25,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.64 vs. limit=10.0 2024-06-19 17:40:25,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.30 vs. limit=15.0 2024-06-19 17:40:27,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=30356.333333333332, ans=0.2 2024-06-19 17:40:29,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=30356.333333333332, ans=0.125 2024-06-19 17:40:31,026 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.789e+03 5.308e+03 6.291e+03 7.965e+03 3.591e+04, threshold=1.258e+04, percent-clipped=3.0 2024-06-19 17:40:32,249 INFO [train.py:1028] (0/2) Epoch 2, batch 6450, loss[loss=0.9702, simple_loss=0.6421, pruned_loss=0.6491, over 12584.00 frames. ], tot_loss[loss=0.9488, simple_loss=0.6082, pruned_loss=0.6447, over 2580647.12 frames. ], batch size: 202, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:40:32,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=30374.666666666668, ans=0.2 2024-06-19 17:40:47,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=30393.0, ans=0.0 2024-06-19 17:40:50,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.73 vs. limit=15.0 2024-06-19 17:40:51,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.71 vs. limit=15.0 2024-06-19 17:40:54,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=30411.333333333332, ans=0.125 2024-06-19 17:41:08,781 INFO [train.py:1028] (0/2) Epoch 2, batch 6500, loss[loss=0.8827, simple_loss=0.6025, pruned_loss=0.5815, over 10982.00 frames. ], tot_loss[loss=0.9557, simple_loss=0.6122, pruned_loss=0.6495, over 2584353.15 frames. ], batch size: 303, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:41:09,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=28.48 vs. limit=15.0 2024-06-19 17:41:14,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=30484.666666666668, ans=0.07 2024-06-19 17:41:27,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=23.12 vs. limit=22.5 2024-06-19 17:41:28,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=30521.333333333332, ans=0.125 2024-06-19 17:41:29,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=15.0 2024-06-19 17:41:31,656 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=15.0 2024-06-19 17:41:36,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=30539.666666666668, ans=0.004230507246376811 2024-06-19 17:41:40,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.795e+03 5.977e+03 7.488e+03 1.002e+04 4.931e+04, threshold=1.498e+04, percent-clipped=13.0 2024-06-19 17:41:41,642 INFO [train.py:1028] (0/2) Epoch 2, batch 6550, loss[loss=1.023, simple_loss=0.6354, pruned_loss=0.7056, over 12614.00 frames. ], tot_loss[loss=0.96, simple_loss=0.6151, pruned_loss=0.6525, over 2587322.47 frames. ], batch size: 22, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:41:47,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=30576.333333333332, ans=0.125 2024-06-19 17:41:48,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=30576.333333333332, ans=0.125 2024-06-19 17:41:50,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=30576.333333333332, ans=0.125 2024-06-19 17:41:52,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=30576.333333333332, ans=0.0 2024-06-19 17:41:59,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=30594.666666666668, ans=0.125 2024-06-19 17:42:10,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-19 17:42:13,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30631.333333333332, ans=0.125 2024-06-19 17:42:16,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=30631.333333333332, ans=0.025 2024-06-19 17:42:17,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=30631.333333333332, ans=0.125 2024-06-19 17:42:18,350 INFO [train.py:1028] (0/2) Epoch 2, batch 6600, loss[loss=0.9382, simple_loss=0.5986, pruned_loss=0.6389, over 13098.00 frames. ], tot_loss[loss=0.9614, simple_loss=0.6158, pruned_loss=0.6535, over 2589799.95 frames. ], batch size: 71, lr: 2.20e-02, grad_scale: 0.25 2024-06-19 17:42:22,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.69 vs. limit=15.0 2024-06-19 17:42:24,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=30649.666666666668, ans=0.0 2024-06-19 17:42:33,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2024-06-19 17:42:34,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.62 vs. limit=22.5 2024-06-19 17:42:35,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=30686.333333333332, ans=0.2 2024-06-19 17:42:40,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=30704.666666666668, ans=0.125 2024-06-19 17:42:42,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=30704.666666666668, ans=0.2 2024-06-19 17:42:47,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=15.0 2024-06-19 17:42:51,918 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.25 vs. limit=22.5 2024-06-19 17:42:52,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.407e+03 5.833e+03 6.857e+03 8.492e+03 7.723e+04, threshold=1.371e+04, percent-clipped=6.0 2024-06-19 17:42:52,822 INFO [train.py:1028] (0/2) Epoch 2, batch 6650, loss[loss=0.961, simple_loss=0.6257, pruned_loss=0.6482, over 12953.00 frames. ], tot_loss[loss=0.9655, simple_loss=0.6185, pruned_loss=0.6563, over 2583797.47 frames. ], batch size: 158, lr: 2.19e-02, grad_scale: 0.25 2024-06-19 17:43:03,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=30759.666666666668, ans=0.004182681159420289 2024-06-19 17:43:04,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=30759.666666666668, ans=0.5 2024-06-19 17:43:05,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=27.75 vs. limit=15.0 2024-06-19 17:43:11,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30778.0, ans=0.1 2024-06-19 17:43:14,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=19.02 vs. limit=15.0 2024-06-19 17:43:17,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.45 vs. limit=15.0 2024-06-19 17:43:26,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=30814.666666666668, ans=0.125 2024-06-19 17:43:29,903 INFO [train.py:1028] (0/2) Epoch 2, batch 6700, loss[loss=0.895, simple_loss=0.5936, pruned_loss=0.5982, over 12683.00 frames. ], tot_loss[loss=0.9677, simple_loss=0.6195, pruned_loss=0.6579, over 2583633.29 frames. ], batch size: 176, lr: 2.19e-02, grad_scale: 0.125 2024-06-19 17:43:31,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=30833.0, ans=0.125 2024-06-19 17:43:32,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=30.99 vs. limit=22.5 2024-06-19 17:43:34,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=30833.0, ans=0.004166739130434783 2024-06-19 17:43:37,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=23.46 vs. limit=15.0 2024-06-19 17:43:44,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=25.18 vs. limit=22.5 2024-06-19 17:43:49,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=30888.0, ans=0.125 2024-06-19 17:43:51,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=30888.0, ans=0.125 2024-06-19 17:43:53,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.70 vs. limit=22.5 2024-06-19 17:43:54,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=12.0 2024-06-19 17:43:55,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30888.0, ans=0.1 2024-06-19 17:43:56,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=30906.333333333332, ans=0.125 2024-06-19 17:44:01,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.07 vs. limit=15.0 2024-06-19 17:44:02,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.37 vs. limit=22.5 2024-06-19 17:44:06,690 INFO [train.py:1028] (0/2) Epoch 2, batch 6750, loss[loss=1.001, simple_loss=0.669, pruned_loss=0.6668, over 12172.00 frames. ], tot_loss[loss=0.9689, simple_loss=0.6206, pruned_loss=0.6586, over 2577446.95 frames. ], batch size: 241, lr: 2.19e-02, grad_scale: 0.125 2024-06-19 17:44:06,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=30924.666666666668, ans=0.125 2024-06-19 17:44:07,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=30924.666666666668, ans=0.004146811594202899 2024-06-19 17:44:07,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=15.0 2024-06-19 17:44:07,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=30924.666666666668, ans=22.5 2024-06-19 17:44:08,028 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.388e+03 5.453e+03 7.214e+03 8.831e+03 9.283e+04, threshold=1.443e+04, percent-clipped=6.0 2024-06-19 17:44:08,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=30924.666666666668, ans=0.125 2024-06-19 17:44:10,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=30924.666666666668, ans=0.125 2024-06-19 17:44:11,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=30924.666666666668, ans=0.2 2024-06-19 17:44:14,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.56 vs. limit=15.0 2024-06-19 17:44:24,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=30961.333333333332, ans=0.125 2024-06-19 17:44:28,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.33 vs. limit=22.5 2024-06-19 17:44:35,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=30998.0, ans=0.2 2024-06-19 17:44:40,270 INFO [train.py:1028] (0/2) Epoch 2, batch 6800, loss[loss=1.047, simple_loss=0.6544, pruned_loss=0.7196, over 13195.00 frames. ], tot_loss[loss=0.9735, simple_loss=0.6228, pruned_loss=0.662, over 2578616.72 frames. ], batch size: 67, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:44:42,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=31016.333333333332, ans=0.125 2024-06-19 17:44:42,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=31016.333333333332, ans=0.125 2024-06-19 17:44:45,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2024-06-19 17:44:46,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=31034.666666666668, ans=0.0 2024-06-19 17:45:03,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=31071.333333333332, ans=0.125 2024-06-19 17:45:10,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31089.666666666668, ans=0.125 2024-06-19 17:45:17,391 INFO [train.py:1028] (0/2) Epoch 2, batch 6850, loss[loss=1.065, simple_loss=0.6797, pruned_loss=0.7247, over 13243.00 frames. ], tot_loss[loss=0.9806, simple_loss=0.6257, pruned_loss=0.6677, over 2583118.46 frames. ], batch size: 63, lr: 2.18e-02, grad_scale: 0.125 2024-06-19 17:45:19,545 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.351e+03 6.614e+03 8.113e+03 1.195e+04 4.472e+04, threshold=1.623e+04, percent-clipped=12.0 2024-06-19 17:45:20,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.49 vs. limit=6.0 2024-06-19 17:45:45,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=31181.333333333332, ans=0.1 2024-06-19 17:45:50,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=25.78 vs. limit=15.0 2024-06-19 17:45:51,698 INFO [train.py:1028] (0/2) Epoch 2, batch 6900, loss[loss=1.044, simple_loss=0.6556, pruned_loss=0.7163, over 13336.00 frames. ], tot_loss[loss=0.9828, simple_loss=0.6279, pruned_loss=0.6689, over 2585704.97 frames. ], batch size: 49, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:45:55,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2024-06-19 17:45:57,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.40 vs. limit=22.5 2024-06-19 17:46:00,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.85 vs. limit=22.5 2024-06-19 17:46:07,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=31236.333333333332, ans=15.0 2024-06-19 17:46:20,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=31254.666666666668, ans=0.004075072463768116 2024-06-19 17:46:23,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2024-06-19 17:46:26,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.10 vs. limit=10.0 2024-06-19 17:46:28,850 INFO [train.py:1028] (0/2) Epoch 2, batch 6950, loss[loss=0.9291, simple_loss=0.5729, pruned_loss=0.6426, over 11185.00 frames. ], tot_loss[loss=0.9839, simple_loss=0.6283, pruned_loss=0.6698, over 2581151.81 frames. ], batch size: 16, lr: 2.18e-02, grad_scale: 0.25 2024-06-19 17:46:30,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.936e+03 8.290e+03 9.447e+03 1.194e+04 4.529e+04, threshold=1.889e+04, percent-clipped=17.0 2024-06-19 17:46:40,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=31309.666666666668, ans=0.125 2024-06-19 17:46:44,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.84 vs. limit=15.0 2024-06-19 17:46:46,524 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.055e-02 2024-06-19 17:47:00,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=26.65 vs. limit=15.0 2024-06-19 17:47:01,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=31383.0, ans=0.125 2024-06-19 17:47:02,442 INFO [train.py:1028] (0/2) Epoch 2, batch 7000, loss[loss=0.9885, simple_loss=0.6471, pruned_loss=0.665, over 12896.00 frames. ], tot_loss[loss=0.9858, simple_loss=0.6294, pruned_loss=0.6711, over 2576634.36 frames. ], batch size: 158, lr: 2.17e-02, grad_scale: 0.5 2024-06-19 17:47:03,249 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=3.848e-02 2024-06-19 17:47:22,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31419.666666666668, ans=0.125 2024-06-19 17:47:24,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=31419.666666666668, ans=0.004039202898550725 2024-06-19 17:47:30,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-19 17:47:39,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=24.04 vs. limit=15.0 2024-06-19 17:47:40,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=31456.333333333332, ans=0.125 2024-06-19 17:47:40,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2024-06-19 17:47:41,383 INFO [train.py:1028] (0/2) Epoch 2, batch 7050, loss[loss=0.9919, simple_loss=0.659, pruned_loss=0.6624, over 12801.00 frames. ], tot_loss[loss=0.9926, simple_loss=0.6329, pruned_loss=0.6761, over 2582985.43 frames. ], batch size: 177, lr: 2.17e-02, grad_scale: 0.25 2024-06-19 17:47:44,122 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.072e+03 6.924e+03 8.022e+03 9.641e+03 3.370e+04, threshold=1.604e+04, percent-clipped=2.0 2024-06-19 17:47:46,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31474.666666666668, ans=0.125 2024-06-19 17:47:56,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=31511.333333333332, ans=0.125 2024-06-19 17:48:05,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=31529.666666666668, ans=0.05 2024-06-19 17:48:05,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=31529.666666666668, ans=0.125 2024-06-19 17:48:14,571 INFO [train.py:1028] (0/2) Epoch 2, batch 7100, loss[loss=1.035, simple_loss=0.6712, pruned_loss=0.6991, over 13142.00 frames. ], tot_loss[loss=0.9922, simple_loss=0.6335, pruned_loss=0.6754, over 2574891.75 frames. ], batch size: 112, lr: 2.17e-02, grad_scale: 0.5 2024-06-19 17:48:24,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=12.0 2024-06-19 17:48:28,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=29.97 vs. limit=22.5 2024-06-19 17:48:32,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=15.0 2024-06-19 17:48:38,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=31621.333333333332, ans=0.0 2024-06-19 17:48:38,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=31621.333333333332, ans=0.125 2024-06-19 17:48:40,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=31621.333333333332, ans=0.125 2024-06-19 17:48:45,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=20.05 vs. limit=15.0 2024-06-19 17:48:47,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.25 vs. limit=6.0 2024-06-19 17:48:51,413 INFO [train.py:1028] (0/2) Epoch 2, batch 7150, loss[loss=1.038, simple_loss=0.6917, pruned_loss=0.692, over 12557.00 frames. ], tot_loss[loss=0.9976, simple_loss=0.6366, pruned_loss=0.6793, over 2573903.83 frames. ], batch size: 202, lr: 2.17e-02, grad_scale: 0.125 2024-06-19 17:48:55,360 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.249e+03 6.596e+03 7.927e+03 9.449e+03 3.032e+04, threshold=1.585e+04, percent-clipped=5.0 2024-06-19 17:48:57,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.32 vs. limit=10.0 2024-06-19 17:48:59,604 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.12 vs. limit=22.5 2024-06-19 17:49:06,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31694.666666666668, ans=0.1 2024-06-19 17:49:16,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=31713.0, ans=10.0 2024-06-19 17:49:16,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.12 vs. limit=10.0 2024-06-19 17:49:23,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=31731.333333333332, ans=0.125 2024-06-19 17:49:27,194 INFO [train.py:1028] (0/2) Epoch 2, batch 7200, loss[loss=1.036, simple_loss=0.6837, pruned_loss=0.694, over 13172.00 frames. ], tot_loss[loss=0.9979, simple_loss=0.6381, pruned_loss=0.6788, over 2578710.91 frames. ], batch size: 112, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:49:38,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.0 2024-06-19 17:49:40,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.32 vs. limit=22.5 2024-06-19 17:49:41,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=31786.333333333332, ans=0.0 2024-06-19 17:49:45,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.43 vs. limit=22.5 2024-06-19 17:49:51,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=15.0 2024-06-19 17:50:00,919 INFO [train.py:1028] (0/2) Epoch 2, batch 7250, loss[loss=0.9635, simple_loss=0.5921, pruned_loss=0.6674, over 12876.00 frames. ], tot_loss[loss=0.998, simple_loss=0.6387, pruned_loss=0.6787, over 2579817.49 frames. ], batch size: 36, lr: 2.16e-02, grad_scale: 0.125 2024-06-19 17:50:05,708 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.393e+03 5.949e+03 7.039e+03 8.681e+03 3.768e+04, threshold=1.408e+04, percent-clipped=4.0 2024-06-19 17:50:22,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=31896.333333333332, ans=0.2 2024-06-19 17:50:24,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=31896.333333333332, ans=0.025 2024-06-19 17:50:24,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31896.333333333332, ans=0.1 2024-06-19 17:50:33,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.33 vs. limit=15.0 2024-06-19 17:50:33,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.64 vs. limit=5.0 2024-06-19 17:50:37,925 INFO [train.py:1028] (0/2) Epoch 2, batch 7300, loss[loss=0.9459, simple_loss=0.5845, pruned_loss=0.6537, over 12966.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6424, pruned_loss=0.684, over 2579649.81 frames. ], batch size: 36, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:51:09,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32006.333333333332, ans=0.125 2024-06-19 17:51:11,422 INFO [train.py:1028] (0/2) Epoch 2, batch 7350, loss[loss=1.067, simple_loss=0.6627, pruned_loss=0.7354, over 13320.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6416, pruned_loss=0.6839, over 2580836.35 frames. ], batch size: 46, lr: 2.16e-02, grad_scale: 0.25 2024-06-19 17:51:16,123 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.148e+03 4.596e+03 5.810e+03 7.109e+03 3.624e+04, threshold=1.162e+04, percent-clipped=3.0 2024-06-19 17:51:17,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=32043.0, ans=0.125 2024-06-19 17:51:21,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=32043.0, ans=0.2 2024-06-19 17:51:32,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=32061.333333333332, ans=0.0 2024-06-19 17:51:32,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32061.333333333332, ans=0.1 2024-06-19 17:51:33,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=32061.333333333332, ans=0.1 2024-06-19 17:51:36,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=32079.666666666668, ans=0.0 2024-06-19 17:51:36,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=32079.666666666668, ans=0.125 2024-06-19 17:51:40,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=28.45 vs. limit=15.0 2024-06-19 17:51:43,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.97 vs. limit=15.0 2024-06-19 17:51:44,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.09 vs. limit=15.0 2024-06-19 17:51:48,258 INFO [train.py:1028] (0/2) Epoch 2, batch 7400, loss[loss=1.102, simple_loss=0.7005, pruned_loss=0.7516, over 13284.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6416, pruned_loss=0.684, over 2586550.85 frames. ], batch size: 63, lr: 2.15e-02, grad_scale: 0.5 2024-06-19 17:51:52,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.25 vs. limit=15.0 2024-06-19 17:51:59,795 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.76 vs. limit=12.0 2024-06-19 17:52:08,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=32171.333333333332, ans=0.0 2024-06-19 17:52:11,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2024-06-19 17:52:20,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.32 vs. limit=22.5 2024-06-19 17:52:22,570 INFO [train.py:1028] (0/2) Epoch 2, batch 7450, loss[loss=0.9932, simple_loss=0.6197, pruned_loss=0.6833, over 12630.00 frames. ], tot_loss[loss=1.004, simple_loss=0.641, pruned_loss=0.6833, over 2581744.70 frames. ], batch size: 29, lr: 2.15e-02, grad_scale: 0.125 2024-06-19 17:52:25,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2024-06-19 17:52:28,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.752e+03 6.467e+03 8.542e+03 1.242e+04 6.905e+04, threshold=1.708e+04, percent-clipped=27.0 2024-06-19 17:52:29,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2024-06-19 17:52:29,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2024-06-19 17:52:30,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=32226.333333333332, ans=0.0038638405797101457 2024-06-19 17:52:53,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32263.0, ans=0.1 2024-06-19 17:53:01,301 INFO [train.py:1028] (0/2) Epoch 2, batch 7500, loss[loss=0.9169, simple_loss=0.6249, pruned_loss=0.6045, over 10544.00 frames. ], tot_loss[loss=1.007, simple_loss=0.6438, pruned_loss=0.6849, over 2578717.70 frames. ], batch size: 303, lr: 2.15e-02, grad_scale: 0.25 2024-06-19 17:53:02,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=15.0 2024-06-19 17:53:02,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=32299.666666666668, ans=0.125 2024-06-19 17:53:09,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=32318.0, ans=0.0 2024-06-19 17:53:12,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=32318.0, ans=0.125 2024-06-19 17:53:16,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=32336.333333333332, ans=0.0038399275362318843 2024-06-19 17:53:17,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=32336.333333333332, ans=0.125 2024-06-19 17:53:26,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=32354.666666666668, ans=0.07 2024-06-19 17:53:28,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=32354.666666666668, ans=0.125 2024-06-19 17:53:30,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=32373.0, ans=0.125 2024-06-19 17:53:35,438 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=5.011e-02 2024-06-19 17:53:38,036 INFO [train.py:1028] (0/2) Epoch 2, batch 7550, loss[loss=0.893, simple_loss=0.5955, pruned_loss=0.5953, over 12876.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6441, pruned_loss=0.6819, over 2577891.27 frames. ], batch size: 158, lr: 2.15e-02, grad_scale: 0.25 2024-06-19 17:53:44,091 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.234e+03 6.931e+03 8.347e+03 1.009e+04 5.236e+04, threshold=1.669e+04, percent-clipped=6.0 2024-06-19 17:53:53,152 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.087e-02 2024-06-19 17:53:55,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=32428.0, ans=0.2 2024-06-19 17:54:04,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=12.0 2024-06-19 17:54:07,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.54 vs. limit=15.0 2024-06-19 17:54:08,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32464.666666666668, ans=0.125 2024-06-19 17:54:10,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=32464.666666666668, ans=0.125 2024-06-19 17:54:11,944 INFO [train.py:1028] (0/2) Epoch 2, batch 7600, loss[loss=1.014, simple_loss=0.6537, pruned_loss=0.6874, over 13192.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6452, pruned_loss=0.6824, over 2577918.18 frames. ], batch size: 83, lr: 2.14e-02, grad_scale: 0.5 2024-06-19 17:54:17,016 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.55 vs. limit=15.0 2024-06-19 17:54:34,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=32519.666666666668, ans=0.0 2024-06-19 17:54:41,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=32538.0, ans=15.0 2024-06-19 17:54:41,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=32538.0, ans=0.125 2024-06-19 17:54:46,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=32556.333333333332, ans=0.025 2024-06-19 17:54:48,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=32556.333333333332, ans=0.125 2024-06-19 17:54:49,739 INFO [train.py:1028] (0/2) Epoch 2, batch 7650, loss[loss=0.9422, simple_loss=0.5847, pruned_loss=0.6499, over 12829.00 frames. ], tot_loss[loss=1.004, simple_loss=0.6448, pruned_loss=0.6813, over 2573818.62 frames. ], batch size: 33, lr: 2.14e-02, grad_scale: 0.125 2024-06-19 17:54:56,077 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.00 vs. limit=10.0 2024-06-19 17:54:57,052 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.570e+03 5.053e+03 6.423e+03 8.367e+03 3.575e+04, threshold=1.285e+04, percent-clipped=4.0 2024-06-19 17:55:06,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32611.333333333332, ans=0.125 2024-06-19 17:55:09,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=32629.666666666668, ans=0.0 2024-06-19 17:55:11,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=22.78 vs. limit=15.0 2024-06-19 17:55:26,861 INFO [train.py:1028] (0/2) Epoch 2, batch 7700, loss[loss=1.069, simple_loss=0.6758, pruned_loss=0.7306, over 13231.00 frames. ], tot_loss[loss=1.006, simple_loss=0.6465, pruned_loss=0.683, over 2570489.24 frames. ], batch size: 63, lr: 2.14e-02, grad_scale: 0.25 2024-06-19 17:55:35,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-19 17:55:41,171 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.523e-03 2024-06-19 17:55:41,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=32703.0, ans=0.125 2024-06-19 17:55:43,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=32703.0, ans=0.0 2024-06-19 17:55:46,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=32721.333333333332, ans=0.0037562318840579714 2024-06-19 17:55:47,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=32721.333333333332, ans=0.0037562318840579714 2024-06-19 17:55:51,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=32721.333333333332, ans=0.0 2024-06-19 17:56:00,645 INFO [train.py:1028] (0/2) Epoch 2, batch 7750, loss[loss=1.003, simple_loss=0.647, pruned_loss=0.6794, over 13290.00 frames. ], tot_loss[loss=1.005, simple_loss=0.6471, pruned_loss=0.6819, over 2575153.04 frames. ], batch size: 72, lr: 2.14e-02, grad_scale: 0.25 2024-06-19 17:56:02,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=12.0 2024-06-19 17:56:08,339 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.911e+03 4.639e+03 5.370e+03 6.411e+03 1.729e+04, threshold=1.074e+04, percent-clipped=7.0 2024-06-19 17:56:08,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=32776.333333333336, ans=0.125 2024-06-19 17:56:11,550 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.41 vs. limit=10.0 2024-06-19 17:56:15,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2024-06-19 17:56:17,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.40 vs. limit=15.0 2024-06-19 17:56:20,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=32813.0, ans=0.125 2024-06-19 17:56:21,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=70.61 vs. limit=15.0 2024-06-19 17:56:30,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2024-06-19 17:56:31,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=32831.333333333336, ans=0.125 2024-06-19 17:56:36,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.41 vs. limit=22.5 2024-06-19 17:56:38,263 INFO [train.py:1028] (0/2) Epoch 2, batch 7800, loss[loss=1.014, simple_loss=0.6533, pruned_loss=0.6871, over 13128.00 frames. ], tot_loss[loss=1.008, simple_loss=0.6491, pruned_loss=0.6839, over 2578756.85 frames. ], batch size: 95, lr: 2.13e-02, grad_scale: 0.5 2024-06-19 17:56:43,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=32849.666666666664, ans=0.025 2024-06-19 17:56:43,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=28.20 vs. limit=15.0 2024-06-19 17:56:43,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.75 vs. limit=22.5 2024-06-19 17:56:52,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=32886.333333333336, ans=0.025 2024-06-19 17:56:54,540 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.32 vs. limit=15.0 2024-06-19 17:56:57,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=32886.333333333336, ans=0.125 2024-06-19 17:57:10,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=32923.0, ans=0.125 2024-06-19 17:57:12,887 INFO [train.py:1028] (0/2) Epoch 2, batch 7850, loss[loss=0.9548, simple_loss=0.5893, pruned_loss=0.6601, over 11628.00 frames. ], tot_loss[loss=1.012, simple_loss=0.6516, pruned_loss=0.6867, over 2572774.70 frames. ], batch size: 17, lr: 2.13e-02, grad_scale: 0.25 2024-06-19 17:57:14,717 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.37 vs. limit=15.0 2024-06-19 17:57:18,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=32941.333333333336, ans=0.125 2024-06-19 17:57:19,390 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.22 vs. limit=15.0 2024-06-19 17:57:23,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.16 vs. limit=10.0 2024-06-19 17:57:24,406 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+03 5.662e+03 6.789e+03 8.185e+03 3.824e+04, threshold=1.358e+04, percent-clipped=6.0 2024-06-19 17:57:35,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=32978.0, ans=0.0 2024-06-19 17:57:45,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=33014.666666666664, ans=0.0 2024-06-19 17:57:46,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=33014.666666666664, ans=10.0 2024-06-19 17:57:49,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=33033.0, ans=0.125 2024-06-19 17:57:49,813 INFO [train.py:1028] (0/2) Epoch 2, batch 7900, loss[loss=1.068, simple_loss=0.6797, pruned_loss=0.728, over 13189.00 frames. ], tot_loss[loss=1.014, simple_loss=0.6526, pruned_loss=0.6879, over 2573026.95 frames. ], batch size: 77, lr: 2.13e-02, grad_scale: 0.5 2024-06-19 17:57:50,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.29 vs. limit=22.5 2024-06-19 17:58:06,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2024-06-19 17:58:07,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=33069.666666666664, ans=0.0 2024-06-19 17:58:07,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=33069.666666666664, ans=0.125 2024-06-19 17:58:12,382 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.96 vs. limit=15.0 2024-06-19 17:58:16,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=20.15 vs. limit=15.0 2024-06-19 17:58:23,128 INFO [train.py:1028] (0/2) Epoch 2, batch 7950, loss[loss=0.9103, simple_loss=0.6231, pruned_loss=0.5988, over 10670.00 frames. ], tot_loss[loss=1.015, simple_loss=0.6529, pruned_loss=0.6888, over 2575296.86 frames. ], batch size: 303, lr: 2.12e-02, grad_scale: 0.5 2024-06-19 17:58:34,718 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+03 4.370e+03 5.379e+03 6.669e+03 1.454e+04, threshold=1.076e+04, percent-clipped=2.0 2024-06-19 17:58:36,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=33143.0, ans=0.125 2024-06-19 17:58:38,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=33143.0, ans=0.05 2024-06-19 17:58:48,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33179.666666666664, ans=0.1 2024-06-19 17:58:52,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=33179.666666666664, ans=0.125 2024-06-19 17:59:00,334 INFO [train.py:1028] (0/2) Epoch 2, batch 8000, loss[loss=1.024, simple_loss=0.6341, pruned_loss=0.7073, over 12538.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6554, pruned_loss=0.693, over 2572152.84 frames. ], batch size: 29, lr: 2.12e-02, grad_scale: 1.0 2024-06-19 17:59:01,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=33216.333333333336, ans=0.125 2024-06-19 17:59:09,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=33234.666666666664, ans=0.0036446376811594215 2024-06-19 17:59:14,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.09 vs. limit=22.5 2024-06-19 17:59:21,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.91 vs. limit=22.5 2024-06-19 17:59:23,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=33271.333333333336, ans=0.2 2024-06-19 17:59:24,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=33271.333333333336, ans=0.025 2024-06-19 17:59:33,075 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 17:59:36,390 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 17:59:36,982 INFO [train.py:1028] (0/2) Epoch 2, batch 8050, loss[loss=1.016, simple_loss=0.6583, pruned_loss=0.6865, over 13254.00 frames. ], tot_loss[loss=1.021, simple_loss=0.655, pruned_loss=0.6934, over 2572309.23 frames. ], batch size: 83, lr: 2.12e-02, grad_scale: 0.25 2024-06-19 17:59:39,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33308.0, ans=0.1 2024-06-19 17:59:43,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=25.26 vs. limit=15.0 2024-06-19 17:59:45,903 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+03 3.672e+03 4.743e+03 5.647e+03 2.569e+04, threshold=9.485e+03, percent-clipped=5.0 2024-06-19 17:59:49,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=33344.666666666664, ans=0.125 2024-06-19 17:59:58,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=33363.0, ans=0.125 2024-06-19 18:00:00,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=33363.0, ans=0.125 2024-06-19 18:00:08,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.62 vs. limit=15.0 2024-06-19 18:00:09,222 INFO [train.py:1028] (0/2) Epoch 2, batch 8100, loss[loss=0.9856, simple_loss=0.6484, pruned_loss=0.6615, over 13116.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6573, pruned_loss=0.6939, over 2576176.17 frames. ], batch size: 112, lr: 2.12e-02, grad_scale: 0.5 2024-06-19 18:00:10,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=33399.666666666664, ans=0.125 2024-06-19 18:00:14,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=33399.666666666664, ans=0.125 2024-06-19 18:00:19,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.47 vs. limit=15.0 2024-06-19 18:00:21,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=33418.0, ans=0.2 2024-06-19 18:00:24,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=33436.333333333336, ans=0.0 2024-06-19 18:00:27,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=33436.333333333336, ans=0.0 2024-06-19 18:00:43,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=33473.0, ans=0.2 2024-06-19 18:00:47,698 INFO [train.py:1028] (0/2) Epoch 2, batch 8150, loss[loss=0.9544, simple_loss=0.6319, pruned_loss=0.6384, over 13119.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6576, pruned_loss=0.6947, over 2579288.95 frames. ], batch size: 121, lr: 2.12e-02, grad_scale: 0.25 2024-06-19 18:00:49,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2024-06-19 18:00:54,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=33509.666666666664, ans=0.125 2024-06-19 18:00:54,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=33509.666666666664, ans=0.125 2024-06-19 18:00:58,106 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.772e+03 5.329e+03 6.812e+03 8.456e+03 3.052e+04, threshold=1.362e+04, percent-clipped=10.0 2024-06-19 18:00:58,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=33509.666666666664, ans=0.125 2024-06-19 18:01:05,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=33528.0, ans=0.0035808695652173907 2024-06-19 18:01:10,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=12.0 2024-06-19 18:01:13,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=33546.333333333336, ans=0.2 2024-06-19 18:01:13,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=33546.333333333336, ans=0.2 2024-06-19 18:01:16,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=33564.666666666664, ans=0.5 2024-06-19 18:01:20,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=33564.666666666664, ans=0.0035728985507246383 2024-06-19 18:01:22,116 INFO [train.py:1028] (0/2) Epoch 2, batch 8200, loss[loss=1.003, simple_loss=0.6577, pruned_loss=0.6741, over 13098.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6581, pruned_loss=0.6944, over 2582526.32 frames. ], batch size: 112, lr: 2.11e-02, grad_scale: 0.5 2024-06-19 18:01:28,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-19 18:01:30,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=33583.0, ans=0.125 2024-06-19 18:01:31,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=33601.333333333336, ans=0.125 2024-06-19 18:01:32,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=33601.333333333336, ans=0.125 2024-06-19 18:01:34,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33601.333333333336, ans=0.1 2024-06-19 18:01:40,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.97 vs. limit=22.5 2024-06-19 18:01:56,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.67 vs. limit=15.0 2024-06-19 18:02:00,402 INFO [train.py:1028] (0/2) Epoch 2, batch 8250, loss[loss=1.076, simple_loss=0.6895, pruned_loss=0.7308, over 13273.00 frames. ], tot_loss[loss=1.023, simple_loss=0.6588, pruned_loss=0.6938, over 2583505.96 frames. ], batch size: 52, lr: 2.11e-02, grad_scale: 0.125 2024-06-19 18:02:01,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.90 vs. limit=15.0 2024-06-19 18:02:02,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=33674.666666666664, ans=0.125 2024-06-19 18:02:04,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=33674.666666666664, ans=0.2 2024-06-19 18:02:11,433 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.323e+03 8.435e+03 1.162e+04 1.808e+04 3.960e+04, threshold=2.323e+04, percent-clipped=39.0 2024-06-19 18:02:12,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=33693.0, ans=0.0 2024-06-19 18:02:15,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.14 vs. limit=22.5 2024-06-19 18:02:18,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=33711.333333333336, ans=0.125 2024-06-19 18:02:20,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.69 vs. limit=15.0 2024-06-19 18:02:22,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33729.666666666664, ans=0.1 2024-06-19 18:02:23,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=15.0 2024-06-19 18:02:30,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=33748.0, ans=0.0035330434782608697 2024-06-19 18:02:33,704 INFO [train.py:1028] (0/2) Epoch 2, batch 8300, loss[loss=1.003, simple_loss=0.6472, pruned_loss=0.6796, over 13024.00 frames. ], tot_loss[loss=1.022, simple_loss=0.6581, pruned_loss=0.6928, over 2580345.94 frames. ], batch size: 102, lr: 2.11e-02, grad_scale: 0.25 2024-06-19 18:02:34,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33766.333333333336, ans=0.1 2024-06-19 18:02:39,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=33766.333333333336, ans=0.125 2024-06-19 18:02:43,175 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=5.706e-01 2024-06-19 18:02:45,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=33784.666666666664, ans=0.125 2024-06-19 18:02:45,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.48 vs. limit=15.0 2024-06-19 18:02:46,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=33784.666666666664, ans=0.125 2024-06-19 18:02:55,262 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=5.367e+02 2024-06-19 18:02:56,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.56 vs. limit=15.0 2024-06-19 18:02:58,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.35 vs. limit=22.5 2024-06-19 18:03:11,296 INFO [train.py:1028] (0/2) Epoch 2, batch 8350, loss[loss=0.9838, simple_loss=0.6449, pruned_loss=0.6613, over 13171.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6585, pruned_loss=0.6958, over 2582001.59 frames. ], batch size: 112, lr: 2.11e-02, grad_scale: 0.25 2024-06-19 18:03:20,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=33876.333333333336, ans=0.125 2024-06-19 18:03:23,337 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.977e+03 8.620e+03 1.124e+04 1.451e+04 5.858e+04, threshold=2.248e+04, percent-clipped=6.0 2024-06-19 18:03:31,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=33894.666666666664, ans=0.125 2024-06-19 18:03:33,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=33894.666666666664, ans=6.0 2024-06-19 18:03:39,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=33913.0, ans=0.0 2024-06-19 18:03:48,668 INFO [train.py:1028] (0/2) Epoch 2, batch 8400, loss[loss=1.01, simple_loss=0.6348, pruned_loss=0.6923, over 12883.00 frames. ], tot_loss[loss=1.026, simple_loss=0.6584, pruned_loss=0.6963, over 2578991.49 frames. ], batch size: 39, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:03:49,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=33949.666666666664, ans=0.025 2024-06-19 18:03:57,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=12.0 2024-06-19 18:04:14,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=34023.0, ans=0.125 2024-06-19 18:04:20,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=34023.0, ans=0.0034732608695652173 2024-06-19 18:04:20,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34023.0, ans=0.0 2024-06-19 18:04:21,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=12.0 2024-06-19 18:04:22,020 INFO [train.py:1028] (0/2) Epoch 2, batch 8450, loss[loss=1.021, simple_loss=0.664, pruned_loss=0.6892, over 13176.00 frames. ], tot_loss[loss=1.03, simple_loss=0.6616, pruned_loss=0.6993, over 2581271.21 frames. ], batch size: 112, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:04:22,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.41 vs. limit=22.5 2024-06-19 18:04:26,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=34041.333333333336, ans=0.125 2024-06-19 18:04:29,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.41 vs. limit=15.0 2024-06-19 18:04:32,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=34059.666666666664, ans=0.003465289855072465 2024-06-19 18:04:34,282 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.056e+03 8.668e+03 1.037e+04 1.252e+04 4.644e+04, threshold=2.074e+04, percent-clipped=5.0 2024-06-19 18:04:34,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=34059.666666666664, ans=0.025 2024-06-19 18:04:41,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.00 vs. limit=10.0 2024-06-19 18:04:43,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-19 18:04:48,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=34096.333333333336, ans=0.00345731884057971 2024-06-19 18:04:48,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=34096.333333333336, ans=0.00345731884057971 2024-06-19 18:05:00,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=34114.666666666664, ans=0.0 2024-06-19 18:05:00,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.58 vs. limit=22.5 2024-06-19 18:05:01,188 INFO [train.py:1028] (0/2) Epoch 2, batch 8500, loss[loss=1.049, simple_loss=0.6569, pruned_loss=0.7201, over 12593.00 frames. ], tot_loss[loss=1.03, simple_loss=0.6619, pruned_loss=0.6988, over 2578690.95 frames. ], batch size: 29, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:05:05,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=34133.0, ans=0.125 2024-06-19 18:05:05,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=34133.0, ans=0.02 2024-06-19 18:05:15,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=34169.666666666664, ans=0.0 2024-06-19 18:05:17,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=34169.666666666664, ans=0.125 2024-06-19 18:05:28,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=34206.333333333336, ans=0.003433405797101448 2024-06-19 18:05:35,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=34206.333333333336, ans=0.2 2024-06-19 18:05:35,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34206.333333333336, ans=0.1 2024-06-19 18:05:37,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34206.333333333336, ans=0.1 2024-06-19 18:05:37,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=15.0 2024-06-19 18:05:38,926 INFO [train.py:1028] (0/2) Epoch 2, batch 8550, loss[loss=1.029, simple_loss=0.6392, pruned_loss=0.7096, over 12434.00 frames. ], tot_loss[loss=1.028, simple_loss=0.6609, pruned_loss=0.6972, over 2575031.23 frames. ], batch size: 22, lr: 2.10e-02, grad_scale: 0.25 2024-06-19 18:05:51,493 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.523e+03 8.931e+03 1.134e+04 1.489e+04 5.166e+04, threshold=2.268e+04, percent-clipped=8.0 2024-06-19 18:05:55,960 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2024-06-19 18:05:58,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=34279.666666666664, ans=0.2 2024-06-19 18:06:01,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=34279.666666666664, ans=0.125 2024-06-19 18:06:06,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=15.0 2024-06-19 18:06:06,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=34298.0, ans=0.05 2024-06-19 18:06:12,065 INFO [train.py:1028] (0/2) Epoch 2, batch 8600, loss[loss=0.9533, simple_loss=0.6273, pruned_loss=0.6397, over 13116.00 frames. ], tot_loss[loss=1.027, simple_loss=0.6609, pruned_loss=0.6964, over 2573297.17 frames. ], batch size: 112, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:06:13,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=34316.333333333336, ans=0.125 2024-06-19 18:06:22,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.34 vs. limit=15.0 2024-06-19 18:06:24,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.46 vs. limit=15.0 2024-06-19 18:06:25,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.51 vs. limit=15.0 2024-06-19 18:06:28,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2024-06-19 18:06:33,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=34371.333333333336, ans=0.125 2024-06-19 18:06:35,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=34371.333333333336, ans=0.125 2024-06-19 18:06:39,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=34389.666666666664, ans=0.2 2024-06-19 18:06:40,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.23 vs. limit=15.0 2024-06-19 18:06:42,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=34389.666666666664, ans=0.025 2024-06-19 18:06:49,536 INFO [train.py:1028] (0/2) Epoch 2, batch 8650, loss[loss=0.9743, simple_loss=0.6409, pruned_loss=0.6538, over 13059.00 frames. ], tot_loss[loss=1.027, simple_loss=0.6613, pruned_loss=0.6962, over 2576260.25 frames. ], batch size: 102, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:07:00,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=15.0 2024-06-19 18:07:02,502 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.826e+03 7.037e+03 8.400e+03 1.020e+04 2.596e+04, threshold=1.680e+04, percent-clipped=2.0 2024-06-19 18:07:05,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.93 vs. limit=15.0 2024-06-19 18:07:22,762 INFO [train.py:1028] (0/2) Epoch 2, batch 8700, loss[loss=1.064, simple_loss=0.6761, pruned_loss=0.7254, over 13164.00 frames. ], tot_loss[loss=1.024, simple_loss=0.6605, pruned_loss=0.6938, over 2572964.42 frames. ], batch size: 59, lr: 2.09e-02, grad_scale: 0.5 2024-06-19 18:07:22,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=34499.666666666664, ans=0.035 2024-06-19 18:07:27,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=34499.666666666664, ans=0.125 2024-06-19 18:07:34,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 18:07:35,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34518.0, ans=0.0 2024-06-19 18:07:36,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=34518.0, ans=0.125 2024-06-19 18:07:44,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=34536.333333333336, ans=0.2 2024-06-19 18:07:46,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=34536.333333333336, ans=0.025 2024-06-19 18:07:49,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.81 vs. limit=15.0 2024-06-19 18:07:52,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.83 vs. limit=22.5 2024-06-19 18:07:57,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34573.0, ans=0.125 2024-06-19 18:08:00,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=34591.333333333336, ans=0.125 2024-06-19 18:08:00,561 INFO [train.py:1028] (0/2) Epoch 2, batch 8750, loss[loss=1.015, simple_loss=0.6651, pruned_loss=0.6823, over 13126.00 frames. ], tot_loss[loss=1.022, simple_loss=0.6594, pruned_loss=0.6922, over 2569900.08 frames. ], batch size: 121, lr: 2.09e-02, grad_scale: 0.125 2024-06-19 18:08:00,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=34591.333333333336, ans=0.2 2024-06-19 18:08:02,660 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.720e+02 2024-06-19 18:08:05,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=34591.333333333336, ans=0.2 2024-06-19 18:08:14,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=23.62 vs. limit=15.0 2024-06-19 18:08:15,166 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.385e+03 6.994e+03 9.550e+03 1.196e+04 3.720e+04, threshold=1.910e+04, percent-clipped=8.0 2024-06-19 18:08:21,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=34646.333333333336, ans=0.125 2024-06-19 18:08:21,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.24 vs. limit=22.5 2024-06-19 18:08:34,855 INFO [train.py:1028] (0/2) Epoch 2, batch 8800, loss[loss=1.02, simple_loss=0.6491, pruned_loss=0.6956, over 13239.00 frames. ], tot_loss[loss=1.021, simple_loss=0.659, pruned_loss=0.6919, over 2574889.50 frames. ], batch size: 72, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:08:34,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34683.0, ans=0.0 2024-06-19 18:08:45,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=34701.333333333336, ans=0.0 2024-06-19 18:08:47,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=34701.333333333336, ans=0.125 2024-06-19 18:08:51,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.31 vs. limit=22.5 2024-06-19 18:09:03,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=34738.0, ans=0.125 2024-06-19 18:09:07,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=34756.333333333336, ans=0.003313840579710144 2024-06-19 18:09:14,090 INFO [train.py:1028] (0/2) Epoch 2, batch 8850, loss[loss=0.9879, simple_loss=0.6535, pruned_loss=0.6612, over 12472.00 frames. ], tot_loss[loss=1.022, simple_loss=0.6585, pruned_loss=0.6924, over 2564104.77 frames. ], batch size: 202, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:09:14,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=34774.666666666664, ans=0.2 2024-06-19 18:09:15,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=34774.666666666664, ans=0.003309855072463768 2024-06-19 18:09:26,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=15.0 2024-06-19 18:09:26,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.18 vs. limit=22.5 2024-06-19 18:09:31,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.58 vs. limit=15.0 2024-06-19 18:09:32,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.327e+03 4.716e+03 6.151e+03 7.583e+03 1.456e+04, threshold=1.230e+04, percent-clipped=0.0 2024-06-19 18:09:33,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=34811.333333333336, ans=10.0 2024-06-19 18:09:37,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=34811.333333333336, ans=0.0 2024-06-19 18:09:44,297 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=4.587e+01 2024-06-19 18:09:49,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=34848.0, ans=0.125 2024-06-19 18:09:51,481 INFO [train.py:1028] (0/2) Epoch 2, batch 8900, loss[loss=1.011, simple_loss=0.6296, pruned_loss=0.6958, over 12887.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6603, pruned_loss=0.6946, over 2562113.41 frames. ], batch size: 33, lr: 2.08e-02, grad_scale: 0.5 2024-06-19 18:10:02,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=34884.666666666664, ans=0.0 2024-06-19 18:10:08,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.26 vs. limit=10.0 2024-06-19 18:10:12,960 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.60 vs. limit=15.0 2024-06-19 18:10:13,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=34921.333333333336, ans=0.2 2024-06-19 18:10:15,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=34921.333333333336, ans=0.0 2024-06-19 18:10:18,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=34939.666666666664, ans=0.125 2024-06-19 18:10:19,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.37 vs. limit=15.0 2024-06-19 18:10:24,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=17.32 vs. limit=12.0 2024-06-19 18:10:25,068 INFO [train.py:1028] (0/2) Epoch 2, batch 8950, loss[loss=1.04, simple_loss=0.6972, pruned_loss=0.6913, over 12511.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6601, pruned_loss=0.6945, over 2562288.27 frames. ], batch size: 202, lr: 2.08e-02, grad_scale: 0.25 2024-06-19 18:10:27,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.40 vs. limit=10.0 2024-06-19 18:10:38,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=34994.666666666664, ans=0.07 2024-06-19 18:10:44,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.515e+03 5.590e+03 7.381e+03 9.061e+03 3.015e+04, threshold=1.476e+04, percent-clipped=10.0 2024-06-19 18:10:45,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=34994.666666666664, ans=0.07 2024-06-19 18:10:53,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=35013.0, ans=0.125 2024-06-19 18:10:55,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.48 vs. limit=22.5 2024-06-19 18:11:02,888 INFO [train.py:1028] (0/2) Epoch 2, batch 9000, loss[loss=1.073, simple_loss=0.6702, pruned_loss=0.7381, over 13287.00 frames. ], tot_loss[loss=1.03, simple_loss=0.6619, pruned_loss=0.6989, over 2567766.12 frames. ], batch size: 46, lr: 2.07e-02, grad_scale: 0.5 2024-06-19 18:11:02,889 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 18:11:10,680 INFO [train.py:1060] (0/2) Epoch 2, validation: loss=0.9784, simple_loss=0.6238, pruned_loss=0.6665, over 351949.00 frames. 2024-06-19 18:11:10,681 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 18:11:11,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-19 18:11:14,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.56 vs. limit=15.0 2024-06-19 18:11:16,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=35049.666666666664, ans=0.0032500724637681165 2024-06-19 18:11:17,494 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.433e+02 2024-06-19 18:11:20,420 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.32 vs. limit=15.0 2024-06-19 18:11:20,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=35068.0, ans=0.003246086956521739 2024-06-19 18:11:23,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.29 vs. limit=15.0 2024-06-19 18:11:37,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=42.22 vs. limit=15.0 2024-06-19 18:11:43,221 INFO [train.py:1028] (0/2) Epoch 2, batch 9050, loss[loss=0.9763, simple_loss=0.6015, pruned_loss=0.6755, over 11847.00 frames. ], tot_loss[loss=1.033, simple_loss=0.6632, pruned_loss=0.7013, over 2567814.80 frames. ], batch size: 17, lr: 2.07e-02, grad_scale: 0.25 2024-06-19 18:11:45,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=35141.333333333336, ans=10.0 2024-06-19 18:11:49,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=35159.666666666664, ans=0.0 2024-06-19 18:11:50,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.85 vs. limit=15.0 2024-06-19 18:12:00,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=35178.0, ans=0.0 2024-06-19 18:12:01,541 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.455e+03 5.302e+03 6.733e+03 8.293e+03 2.331e+04, threshold=1.347e+04, percent-clipped=3.0 2024-06-19 18:12:07,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=35196.333333333336, ans=0.125 2024-06-19 18:12:09,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=35196.333333333336, ans=0.125 2024-06-19 18:12:12,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=35214.666666666664, ans=0.0032142028985507253 2024-06-19 18:12:14,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.86 vs. limit=15.0 2024-06-19 18:12:19,238 INFO [train.py:1028] (0/2) Epoch 2, batch 9100, loss[loss=1.064, simple_loss=0.6759, pruned_loss=0.7261, over 13242.00 frames. ], tot_loss[loss=1.035, simple_loss=0.6637, pruned_loss=0.7031, over 2567931.82 frames. ], batch size: 72, lr: 2.07e-02, grad_scale: 0.5 2024-06-19 18:12:19,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.09 vs. limit=10.0 2024-06-19 18:12:24,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.12 vs. limit=10.0 2024-06-19 18:12:34,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-19 18:12:38,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.78 vs. limit=15.0 2024-06-19 18:12:41,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=35288.0, ans=0.003198260869565218 2024-06-19 18:12:43,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=35288.0, ans=0.2 2024-06-19 18:12:46,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35306.333333333336, ans=0.1 2024-06-19 18:12:46,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=35306.333333333336, ans=0.5 2024-06-19 18:12:46,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.29 vs. limit=22.5 2024-06-19 18:12:52,033 INFO [train.py:1028] (0/2) Epoch 2, batch 9150, loss[loss=1.077, simple_loss=0.6928, pruned_loss=0.7302, over 13172.00 frames. ], tot_loss[loss=1.035, simple_loss=0.6638, pruned_loss=0.7027, over 2568586.32 frames. ], batch size: 77, lr: 2.07e-02, grad_scale: 0.25 2024-06-19 18:12:59,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=35343.0, ans=0.125 2024-06-19 18:13:01,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.13 vs. limit=22.5 2024-06-19 18:13:08,078 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.406e+03 5.533e+03 6.542e+03 8.494e+03 4.349e+04, threshold=1.308e+04, percent-clipped=9.0 2024-06-19 18:13:14,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=35379.666666666664, ans=0.0 2024-06-19 18:13:24,376 INFO [train.py:1028] (0/2) Epoch 2, batch 9200, loss[loss=1.061, simple_loss=0.6624, pruned_loss=0.7299, over 12951.00 frames. ], tot_loss[loss=1.034, simple_loss=0.663, pruned_loss=0.7024, over 2572775.23 frames. ], batch size: 36, lr: 2.06e-02, grad_scale: 0.5 2024-06-19 18:13:25,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=35416.333333333336, ans=0.125 2024-06-19 18:13:25,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.28 vs. limit=15.0 2024-06-19 18:13:34,758 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.937e+01 2024-06-19 18:13:44,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=35471.333333333336, ans=0.0031584057971014486 2024-06-19 18:13:45,042 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.052e-01 2024-06-19 18:13:49,872 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.20 vs. limit=22.5 2024-06-19 18:13:50,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=35489.666666666664, ans=0.07 2024-06-19 18:13:56,302 INFO [train.py:1028] (0/2) Epoch 2, batch 9250, loss[loss=1.082, simple_loss=0.6909, pruned_loss=0.7365, over 13164.00 frames. ], tot_loss[loss=1.038, simple_loss=0.6646, pruned_loss=0.7054, over 2574896.51 frames. ], batch size: 67, lr: 2.06e-02, grad_scale: 0.125 2024-06-19 18:13:57,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=35508.0, ans=0.025 2024-06-19 18:14:02,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=35526.333333333336, ans=0.025 2024-06-19 18:14:17,815 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.650e+03 6.395e+03 7.586e+03 9.001e+03 3.996e+04, threshold=1.517e+04, percent-clipped=3.0 2024-06-19 18:14:20,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=35563.0, ans=0.125 2024-06-19 18:14:21,813 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.685e-01 2024-06-19 18:14:23,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.71 vs. limit=22.5 2024-06-19 18:14:24,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=35563.0, ans=0.025 2024-06-19 18:14:31,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=35581.333333333336, ans=0.2 2024-06-19 18:14:32,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.66 vs. limit=10.0 2024-06-19 18:14:32,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=15.0 2024-06-19 18:14:33,106 INFO [train.py:1028] (0/2) Epoch 2, batch 9300, loss[loss=1.028, simple_loss=0.6291, pruned_loss=0.7139, over 13268.00 frames. ], tot_loss[loss=1.044, simple_loss=0.6658, pruned_loss=0.7112, over 2571397.72 frames. ], batch size: 40, lr: 2.06e-02, grad_scale: 0.25 2024-06-19 18:14:46,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=35636.333333333336, ans=0.125 2024-06-19 18:14:54,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=35654.666666666664, ans=0.09899494936611666 2024-06-19 18:15:05,517 INFO [train.py:1028] (0/2) Epoch 2, batch 9350, loss[loss=1.117, simple_loss=0.6909, pruned_loss=0.7721, over 12685.00 frames. ], tot_loss[loss=1.046, simple_loss=0.6659, pruned_loss=0.7126, over 2569689.46 frames. ], batch size: 22, lr: 2.06e-02, grad_scale: 0.25 2024-06-19 18:15:07,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=15.0 2024-06-19 18:15:09,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=20.43 vs. limit=15.0 2024-06-19 18:15:12,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=17.78 vs. limit=15.0 2024-06-19 18:15:13,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=35709.666666666664, ans=0.125 2024-06-19 18:15:22,571 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.668e+03 3.576e+03 4.368e+03 5.539e+03 2.115e+04, threshold=8.737e+03, percent-clipped=2.0 2024-06-19 18:15:24,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=35746.333333333336, ans=15.0 2024-06-19 18:15:34,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=35764.666666666664, ans=0.125 2024-06-19 18:15:37,170 INFO [train.py:1028] (0/2) Epoch 2, batch 9400, loss[loss=1.026, simple_loss=0.6494, pruned_loss=0.7011, over 13281.00 frames. ], tot_loss[loss=1.046, simple_loss=0.666, pruned_loss=0.7128, over 2569643.11 frames. ], batch size: 52, lr: 2.06e-02, grad_scale: 0.5 2024-06-19 18:15:41,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.15 vs. limit=6.0 2024-06-19 18:15:49,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=20.45 vs. limit=15.0 2024-06-19 18:15:51,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=35801.333333333336, ans=0.2 2024-06-19 18:15:51,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=35801.333333333336, ans=0.125 2024-06-19 18:15:55,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=35819.666666666664, ans=0.0 2024-06-19 18:15:57,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.75 vs. limit=15.0 2024-06-19 18:16:03,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.36 vs. limit=22.5 2024-06-19 18:16:06,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.74 vs. limit=22.5 2024-06-19 18:16:10,434 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=27.63 vs. limit=15.0 2024-06-19 18:16:10,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=35856.333333333336, ans=15.0 2024-06-19 18:16:11,460 INFO [train.py:1028] (0/2) Epoch 2, batch 9450, loss[loss=1.072, simple_loss=0.6635, pruned_loss=0.7405, over 12523.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6666, pruned_loss=0.7142, over 2570259.20 frames. ], batch size: 22, lr: 2.05e-02, grad_scale: 0.25 2024-06-19 18:16:14,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=15.0 2024-06-19 18:16:16,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.22 vs. limit=15.0 2024-06-19 18:16:21,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=35893.0, ans=0.125 2024-06-19 18:16:21,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=15.0 2024-06-19 18:16:26,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=35911.333333333336, ans=0.125 2024-06-19 18:16:28,750 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+03 4.735e+03 5.988e+03 7.794e+03 5.964e+04, threshold=1.198e+04, percent-clipped=17.0 2024-06-19 18:16:43,046 INFO [train.py:1028] (0/2) Epoch 2, batch 9500, loss[loss=0.998, simple_loss=0.6245, pruned_loss=0.6858, over 13282.00 frames. ], tot_loss[loss=1.047, simple_loss=0.6659, pruned_loss=0.7138, over 2578419.81 frames. ], batch size: 43, lr: 2.05e-02, grad_scale: 0.5 2024-06-19 18:16:52,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=29.54 vs. limit=15.0 2024-06-19 18:16:57,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36003.0, ans=0.125 2024-06-19 18:17:01,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=36021.333333333336, ans=0.125 2024-06-19 18:17:04,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=36021.333333333336, ans=0.2 2024-06-19 18:17:12,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.03 vs. limit=15.0 2024-06-19 18:17:16,340 INFO [train.py:1028] (0/2) Epoch 2, batch 9550, loss[loss=1.008, simple_loss=0.6377, pruned_loss=0.6891, over 12936.00 frames. ], tot_loss[loss=1.043, simple_loss=0.665, pruned_loss=0.7108, over 2572874.26 frames. ], batch size: 39, lr: 2.05e-02, grad_scale: 0.25 2024-06-19 18:17:16,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.79 vs. limit=15.0 2024-06-19 18:17:19,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=36058.0, ans=0.125 2024-06-19 18:17:25,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=18.67 vs. limit=15.0 2024-06-19 18:17:30,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=36094.666666666664, ans=0.125 2024-06-19 18:17:32,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36094.666666666664, ans=0.1 2024-06-19 18:17:34,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.583e+03 5.388e+03 6.731e+03 7.845e+03 1.805e+04, threshold=1.346e+04, percent-clipped=6.0 2024-06-19 18:17:40,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=36113.0, ans=0.125 2024-06-19 18:17:45,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=36131.333333333336, ans=0.07 2024-06-19 18:17:47,820 INFO [train.py:1028] (0/2) Epoch 2, batch 9600, loss[loss=0.9592, simple_loss=0.6504, pruned_loss=0.6339, over 10547.00 frames. ], tot_loss[loss=1.038, simple_loss=0.6639, pruned_loss=0.706, over 2572238.33 frames. ], batch size: 304, lr: 2.05e-02, grad_scale: 0.5 2024-06-19 18:17:58,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.33 vs. limit=15.0 2024-06-19 18:18:10,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36204.666666666664, ans=0.1 2024-06-19 18:18:18,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=36223.0, ans=0.025 2024-06-19 18:18:19,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=36223.0, ans=0.05 2024-06-19 18:18:20,678 INFO [train.py:1028] (0/2) Epoch 2, batch 9650, loss[loss=0.9795, simple_loss=0.6411, pruned_loss=0.659, over 13132.00 frames. ], tot_loss[loss=1.032, simple_loss=0.6617, pruned_loss=0.7008, over 2561941.84 frames. ], batch size: 132, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:18:25,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=15.0 2024-06-19 18:18:26,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=36259.666666666664, ans=0.025 2024-06-19 18:18:38,873 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.472e+03 6.088e+03 7.459e+03 8.682e+03 2.157e+04, threshold=1.492e+04, percent-clipped=2.0 2024-06-19 18:18:43,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36296.333333333336, ans=0.1 2024-06-19 18:18:44,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=36296.333333333336, ans=0.125 2024-06-19 18:18:47,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.66 vs. limit=22.5 2024-06-19 18:18:49,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.94 vs. limit=15.0 2024-06-19 18:18:51,764 INFO [train.py:1028] (0/2) Epoch 2, batch 9700, loss[loss=0.959, simple_loss=0.6381, pruned_loss=0.64, over 13042.00 frames. ], tot_loss[loss=1.025, simple_loss=0.6582, pruned_loss=0.6954, over 2556936.05 frames. ], batch size: 144, lr: 2.04e-02, grad_scale: 0.5 2024-06-19 18:18:53,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=36333.0, ans=0.0 2024-06-19 18:18:55,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-19 18:18:58,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=36351.333333333336, ans=0.2 2024-06-19 18:19:01,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36351.333333333336, ans=0.125 2024-06-19 18:19:02,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-19 18:19:03,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=36351.333333333336, ans=0.125 2024-06-19 18:19:09,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36369.666666666664, ans=0.1 2024-06-19 18:19:13,137 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=1.048e-01 2024-06-19 18:19:15,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=36388.0, ans=0.125 2024-06-19 18:19:20,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=36406.333333333336, ans=0.2 2024-06-19 18:19:24,419 INFO [train.py:1028] (0/2) Epoch 2, batch 9750, loss[loss=0.9394, simple_loss=0.6198, pruned_loss=0.6295, over 13094.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6569, pruned_loss=0.6928, over 2553101.33 frames. ], batch size: 132, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:19:28,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=21.02 vs. limit=15.0 2024-06-19 18:19:35,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.30 vs. limit=22.5 2024-06-19 18:19:43,799 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.312e+03 6.708e+03 8.321e+03 9.772e+03 3.624e+04, threshold=1.664e+04, percent-clipped=6.0 2024-06-19 18:19:47,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=36479.666666666664, ans=0.0 2024-06-19 18:19:55,672 INFO [train.py:1028] (0/2) Epoch 2, batch 9800, loss[loss=1.094, simple_loss=0.6799, pruned_loss=0.7543, over 12960.00 frames. ], tot_loss[loss=1.021, simple_loss=0.6558, pruned_loss=0.6929, over 2545610.72 frames. ], batch size: 39, lr: 2.04e-02, grad_scale: 0.5 2024-06-19 18:20:03,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36534.666666666664, ans=0.125 2024-06-19 18:20:10,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36553.0, ans=0.125 2024-06-19 18:20:10,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=36553.0, ans=0.002923260869565217 2024-06-19 18:20:16,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.64 vs. limit=15.0 2024-06-19 18:20:21,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=36589.666666666664, ans=0.0 2024-06-19 18:20:21,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=24.14 vs. limit=15.0 2024-06-19 18:20:27,251 INFO [train.py:1028] (0/2) Epoch 2, batch 9850, loss[loss=1.027, simple_loss=0.6653, pruned_loss=0.6941, over 13083.00 frames. ], tot_loss[loss=1.018, simple_loss=0.6555, pruned_loss=0.6905, over 2537717.37 frames. ], batch size: 102, lr: 2.04e-02, grad_scale: 0.25 2024-06-19 18:20:27,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=36608.0, ans=0.2 2024-06-19 18:20:48,449 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-20000.pt 2024-06-19 18:20:53,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=36663.0, ans=0.125 2024-06-19 18:20:53,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.929e+03 5.250e+03 6.975e+03 8.596e+03 1.760e+04, threshold=1.395e+04, percent-clipped=1.0 2024-06-19 18:20:55,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36663.0, ans=0.1 2024-06-19 18:21:05,077 INFO [train.py:1028] (0/2) Epoch 2, batch 9900, loss[loss=1.016, simple_loss=0.6513, pruned_loss=0.6903, over 12923.00 frames. ], tot_loss[loss=1.011, simple_loss=0.653, pruned_loss=0.6847, over 2529761.46 frames. ], batch size: 39, lr: 2.03e-02, grad_scale: 0.5 2024-06-19 18:21:05,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.07 vs. limit=15.0 2024-06-19 18:21:09,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=36699.666666666664, ans=0.125 2024-06-19 18:21:15,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=36718.0, ans=0.2 2024-06-19 18:21:15,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=36718.0, ans=0.0 2024-06-19 18:21:18,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.48 vs. limit=15.0 2024-06-19 18:21:18,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36736.333333333336, ans=0.125 2024-06-19 18:21:19,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=12.0 2024-06-19 18:21:24,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=15.0 2024-06-19 18:21:33,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36773.0, ans=0.1 2024-06-19 18:21:37,108 INFO [train.py:1028] (0/2) Epoch 2, batch 9950, loss[loss=1.091, simple_loss=0.6873, pruned_loss=0.7471, over 12929.00 frames. ], tot_loss[loss=1.006, simple_loss=0.6506, pruned_loss=0.6809, over 2525206.47 frames. ], batch size: 30, lr: 2.03e-02, grad_scale: 0.125 2024-06-19 18:21:43,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.09 vs. limit=10.0 2024-06-19 18:21:48,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=36809.666666666664, ans=0.125 2024-06-19 18:21:53,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.92 vs. limit=15.0 2024-06-19 18:21:59,343 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.311e+03 5.742e+03 7.351e+03 9.688e+03 4.180e+04, threshold=1.470e+04, percent-clipped=8.0 2024-06-19 18:22:09,815 INFO [train.py:1028] (0/2) Epoch 2, batch 10000, loss[loss=1.042, simple_loss=0.6632, pruned_loss=0.7105, over 12507.00 frames. ], tot_loss[loss=1.007, simple_loss=0.6504, pruned_loss=0.6815, over 2487118.57 frames. ], batch size: 22, lr: 2.03e-02, grad_scale: 0.25 2024-06-19 18:22:11,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=36883.0, ans=0.95 2024-06-19 18:22:12,061 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=26.22 vs. limit=15.0 2024-06-19 18:22:12,512 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=5.318e-03 2024-06-19 18:22:16,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2024-06-19 18:22:19,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=36901.333333333336, ans=0.125 2024-06-19 18:22:29,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=36938.0, ans=0.2 2024-06-19 18:22:30,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=36938.0, ans=0.125 2024-06-19 18:22:34,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=36938.0, ans=0.1 2024-06-19 18:22:40,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36956.333333333336, ans=0.1 2024-06-19 18:22:42,361 INFO [train.py:1028] (0/2) Epoch 2, batch 10050, loss[loss=1.052, simple_loss=0.6534, pruned_loss=0.7256, over 12532.00 frames. ], tot_loss[loss=1.011, simple_loss=0.6517, pruned_loss=0.6849, over 2445416.93 frames. ], batch size: 22, lr: 2.03e-02, grad_scale: 0.25 2024-06-19 18:22:49,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36993.0, ans=0.125 2024-06-19 18:22:51,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=36993.0, ans=0.2 2024-06-19 18:22:57,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=37011.333333333336, ans=0.125 2024-06-19 18:23:02,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=37029.666666666664, ans=0.0 2024-06-19 18:23:03,470 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+03 2.909e+03 4.061e+03 4.962e+03 1.023e+04, threshold=8.122e+03, percent-clipped=0.0 2024-06-19 18:23:13,623 INFO [train.py:1028] (0/2) Epoch 2, batch 10100, loss[loss=1.017, simple_loss=0.6232, pruned_loss=0.7054, over 11963.00 frames. ], tot_loss[loss=1.013, simple_loss=0.6495, pruned_loss=0.6882, over 2425829.08 frames. ], batch size: 18, lr: 2.02e-02, grad_scale: 0.5 2024-06-19 18:23:15,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.25 vs. limit=10.0 2024-06-19 18:23:17,651 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.159e-01 2024-06-19 18:23:19,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.99 vs. limit=15.0 2024-06-19 18:23:20,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=37084.666666666664, ans=0.125 2024-06-19 18:23:21,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2024-06-19 18:23:26,569 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-2.pt 2024-06-19 18:25:26,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=37095.666666666664, ans=0.07 2024-06-19 18:25:27,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.44 vs. limit=15.0 2024-06-19 18:25:27,253 INFO [train.py:1028] (0/2) Epoch 3, batch 0, loss[loss=1.04, simple_loss=0.6475, pruned_loss=0.7159, over 12944.00 frames. ], tot_loss[loss=1.04, simple_loss=0.6475, pruned_loss=0.7159, over 12944.00 frames. ], batch size: 36, lr: 1.92e-02, grad_scale: 1.0 2024-06-19 18:25:27,254 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 18:25:34,563 INFO [train.py:1060] (0/2) Epoch 3, validation: loss=0.9885, simple_loss=0.6309, pruned_loss=0.673, over 351949.00 frames. 2024-06-19 18:25:34,563 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 18:25:38,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=37095.666666666664, ans=0.125 2024-06-19 18:25:45,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37114.0, ans=0.125 2024-06-19 18:25:52,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37132.333333333336, ans=0.1 2024-06-19 18:25:55,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=37132.333333333336, ans=0.05 2024-06-19 18:26:01,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37150.666666666664, ans=0.125 2024-06-19 18:26:01,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=37150.666666666664, ans=0.125 2024-06-19 18:26:08,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.09 vs. limit=22.5 2024-06-19 18:26:09,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37169.0, ans=0.1 2024-06-19 18:26:11,997 INFO [train.py:1028] (0/2) Epoch 3, batch 50, loss[loss=0.9928, simple_loss=0.6227, pruned_loss=0.6814, over 12566.00 frames. ], tot_loss[loss=0.9623, simple_loss=0.6229, pruned_loss=0.6509, over 574566.75 frames. ], batch size: 29, lr: 1.92e-02, grad_scale: 0.125 2024-06-19 18:26:14,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37187.333333333336, ans=0.125 2024-06-19 18:26:20,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37205.666666666664, ans=0.1 2024-06-19 18:26:23,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-19 18:26:25,843 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+03 2.994e+03 3.984e+03 5.286e+03 1.594e+04, threshold=7.968e+03, percent-clipped=8.0 2024-06-19 18:26:26,230 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.98 vs. limit=12.0 2024-06-19 18:26:29,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=12.0 2024-06-19 18:26:34,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=37242.333333333336, ans=0.125 2024-06-19 18:26:35,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.53 vs. limit=22.5 2024-06-19 18:26:47,249 INFO [train.py:1028] (0/2) Epoch 3, batch 100, loss[loss=1.005, simple_loss=0.6408, pruned_loss=0.6843, over 13293.00 frames. ], tot_loss[loss=0.9481, simple_loss=0.6165, pruned_loss=0.6398, over 1017999.93 frames. ], batch size: 46, lr: 1.92e-02, grad_scale: 0.25 2024-06-19 18:26:50,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.17 vs. limit=10.0 2024-06-19 18:26:53,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.87 vs. limit=15.0 2024-06-19 18:26:58,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=12.12 vs. limit=10.0 2024-06-19 18:26:59,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=37315.666666666664, ans=0.002757463768115943 2024-06-19 18:27:11,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=37334.0, ans=0.125 2024-06-19 18:27:15,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=15.0 2024-06-19 18:27:19,769 INFO [train.py:1028] (0/2) Epoch 3, batch 150, loss[loss=0.9642, simple_loss=0.6074, pruned_loss=0.6605, over 12642.00 frames. ], tot_loss[loss=0.9459, simple_loss=0.6142, pruned_loss=0.6388, over 1366001.34 frames. ], batch size: 29, lr: 1.92e-02, grad_scale: 0.25 2024-06-19 18:27:20,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.64 vs. limit=15.0 2024-06-19 18:27:26,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.10 vs. limit=6.0 2024-06-19 18:27:27,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=37389.0, ans=0.07 2024-06-19 18:27:30,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=37389.0, ans=0.0 2024-06-19 18:27:31,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.29 vs. limit=15.0 2024-06-19 18:27:33,775 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.831e+03 4.451e+03 5.431e+03 6.629e+03 3.161e+04, threshold=1.086e+04, percent-clipped=12.0 2024-06-19 18:27:36,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=37407.333333333336, ans=0.125 2024-06-19 18:27:40,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=37425.666666666664, ans=0.04949747468305833 2024-06-19 18:27:54,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2024-06-19 18:27:55,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=37444.0, ans=0.125 2024-06-19 18:27:57,883 INFO [train.py:1028] (0/2) Epoch 3, batch 200, loss[loss=0.906, simple_loss=0.6129, pruned_loss=0.5996, over 12555.00 frames. ], tot_loss[loss=0.9473, simple_loss=0.6155, pruned_loss=0.6395, over 1635009.41 frames. ], batch size: 202, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:28:12,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=37499.0, ans=0.0 2024-06-19 18:28:13,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.87 vs. limit=15.0 2024-06-19 18:28:17,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=37517.333333333336, ans=0.125 2024-06-19 18:28:17,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=37517.333333333336, ans=0.125 2024-06-19 18:28:18,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.68 vs. limit=10.0 2024-06-19 18:28:23,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=37517.333333333336, ans=0.125 2024-06-19 18:28:27,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=37535.666666666664, ans=0.125 2024-06-19 18:28:30,721 INFO [train.py:1028] (0/2) Epoch 3, batch 250, loss[loss=0.8285, simple_loss=0.5521, pruned_loss=0.5525, over 13027.00 frames. ], tot_loss[loss=0.9429, simple_loss=0.6135, pruned_loss=0.6361, over 1846216.65 frames. ], batch size: 144, lr: 1.91e-02, grad_scale: 0.25 2024-06-19 18:28:41,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=37572.333333333336, ans=0.125 2024-06-19 18:28:41,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=37572.333333333336, ans=0.125 2024-06-19 18:28:41,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-19 18:28:41,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=37572.333333333336, ans=0.125 2024-06-19 18:28:48,282 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.818e+03 5.994e+03 7.404e+03 9.653e+03 2.050e+04, threshold=1.481e+04, percent-clipped=15.0 2024-06-19 18:28:50,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.45 vs. limit=15.0 2024-06-19 18:28:57,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=37609.0, ans=0.025 2024-06-19 18:28:57,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=37609.0, ans=0.2 2024-06-19 18:28:57,758 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:29:00,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=37627.333333333336, ans=0.125 2024-06-19 18:29:06,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=37645.666666666664, ans=0.07 2024-06-19 18:29:06,626 INFO [train.py:1028] (0/2) Epoch 3, batch 300, loss[loss=0.9178, simple_loss=0.6027, pruned_loss=0.6165, over 13201.00 frames. ], tot_loss[loss=0.9467, simple_loss=0.6153, pruned_loss=0.6391, over 2009597.38 frames. ], batch size: 112, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:29:15,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37664.0, ans=0.1 2024-06-19 18:29:21,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=37682.333333333336, ans=0.125 2024-06-19 18:29:23,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=37682.333333333336, ans=0.2 2024-06-19 18:29:23,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=37682.333333333336, ans=15.0 2024-06-19 18:29:28,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=15.0 2024-06-19 18:29:37,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=37719.0, ans=0.025 2024-06-19 18:29:39,140 INFO [train.py:1028] (0/2) Epoch 3, batch 350, loss[loss=0.9842, simple_loss=0.6319, pruned_loss=0.6683, over 12846.00 frames. ], tot_loss[loss=0.9456, simple_loss=0.6134, pruned_loss=0.6389, over 2138568.80 frames. ], batch size: 33, lr: 1.91e-02, grad_scale: 0.25 2024-06-19 18:29:39,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=37737.333333333336, ans=0.125 2024-06-19 18:29:56,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=37774.0, ans=0.002657826086956522 2024-06-19 18:29:56,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.28 vs. limit=10.0 2024-06-19 18:29:57,951 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.380e+03 6.298e+03 8.242e+03 1.107e+04 3.750e+04, threshold=1.648e+04, percent-clipped=9.0 2024-06-19 18:29:59,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=37774.0, ans=0.0 2024-06-19 18:30:03,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=37792.333333333336, ans=0.125 2024-06-19 18:30:07,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37792.333333333336, ans=0.1 2024-06-19 18:30:12,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.04 vs. limit=22.5 2024-06-19 18:30:15,978 INFO [train.py:1028] (0/2) Epoch 3, batch 400, loss[loss=0.9983, simple_loss=0.6315, pruned_loss=0.6825, over 13255.00 frames. ], tot_loss[loss=0.9516, simple_loss=0.6157, pruned_loss=0.6437, over 2239358.33 frames. ], batch size: 63, lr: 1.91e-02, grad_scale: 0.5 2024-06-19 18:30:20,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37829.0, ans=0.1 2024-06-19 18:30:27,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=37847.333333333336, ans=0.0 2024-06-19 18:30:38,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=37884.0, ans=0.2 2024-06-19 18:30:40,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.51 vs. limit=10.0 2024-06-19 18:30:48,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37902.333333333336, ans=0.125 2024-06-19 18:30:53,182 INFO [train.py:1028] (0/2) Epoch 3, batch 450, loss[loss=0.9521, simple_loss=0.609, pruned_loss=0.6475, over 13195.00 frames. ], tot_loss[loss=0.9499, simple_loss=0.6145, pruned_loss=0.6427, over 2313299.43 frames. ], batch size: 67, lr: 1.90e-02, grad_scale: 0.25 2024-06-19 18:30:53,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37920.666666666664, ans=0.125 2024-06-19 18:30:58,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=37920.666666666664, ans=0.125 2024-06-19 18:30:59,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.33 vs. limit=15.0 2024-06-19 18:31:02,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.49 vs. limit=15.0 2024-06-19 18:31:08,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=37957.333333333336, ans=0.0 2024-06-19 18:31:08,958 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+03 4.137e+03 4.930e+03 5.709e+03 2.950e+04, threshold=9.859e+03, percent-clipped=2.0 2024-06-19 18:31:26,156 INFO [train.py:1028] (0/2) Epoch 3, batch 500, loss[loss=0.9041, simple_loss=0.5949, pruned_loss=0.6066, over 13143.00 frames. ], tot_loss[loss=0.9522, simple_loss=0.6153, pruned_loss=0.6446, over 2375507.67 frames. ], batch size: 121, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:31:29,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=12.0 2024-06-19 18:31:33,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=38030.666666666664, ans=0.125 2024-06-19 18:31:33,265 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:31:40,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.73 vs. limit=15.0 2024-06-19 18:32:01,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=38104.0, ans=0.5 2024-06-19 18:32:01,601 INFO [train.py:1028] (0/2) Epoch 3, batch 550, loss[loss=0.8704, simple_loss=0.5868, pruned_loss=0.577, over 12946.00 frames. ], tot_loss[loss=0.9503, simple_loss=0.6144, pruned_loss=0.6431, over 2420922.83 frames. ], batch size: 158, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:32:01,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=38104.0, ans=0.2 2024-06-19 18:32:04,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=38104.0, ans=0.2 2024-06-19 18:32:10,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38122.333333333336, ans=0.1 2024-06-19 18:32:10,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.26 vs. limit=15.0 2024-06-19 18:32:13,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.70 vs. limit=10.0 2024-06-19 18:32:17,898 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.783e+03 4.897e+03 5.981e+03 8.007e+03 2.975e+04, threshold=1.196e+04, percent-clipped=16.0 2024-06-19 18:32:23,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=38159.0, ans=0.2 2024-06-19 18:32:25,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38159.0, ans=0.125 2024-06-19 18:32:30,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=38177.333333333336, ans=0.125 2024-06-19 18:32:33,857 INFO [train.py:1028] (0/2) Epoch 3, batch 600, loss[loss=0.8372, simple_loss=0.5622, pruned_loss=0.556, over 13017.00 frames. ], tot_loss[loss=0.9487, simple_loss=0.6138, pruned_loss=0.6418, over 2458622.70 frames. ], batch size: 144, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:32:45,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2024-06-19 18:32:49,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.77 vs. limit=15.0 2024-06-19 18:32:56,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38250.666666666664, ans=0.125 2024-06-19 18:32:59,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=38250.666666666664, ans=0.025 2024-06-19 18:33:05,630 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=15.0 2024-06-19 18:33:06,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.22 vs. limit=15.0 2024-06-19 18:33:06,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=38269.0, ans=0.125 2024-06-19 18:33:07,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=38269.0, ans=0.125 2024-06-19 18:33:07,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=34.35 vs. limit=15.0 2024-06-19 18:33:09,655 INFO [train.py:1028] (0/2) Epoch 3, batch 650, loss[loss=0.9235, simple_loss=0.5987, pruned_loss=0.6241, over 13181.00 frames. ], tot_loss[loss=0.9471, simple_loss=0.6135, pruned_loss=0.6403, over 2490018.21 frames. ], batch size: 59, lr: 1.90e-02, grad_scale: 0.5 2024-06-19 18:33:12,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=38287.333333333336, ans=0.05 2024-06-19 18:33:19,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=38305.666666666664, ans=0.2 2024-06-19 18:33:21,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=38305.666666666664, ans=0.025 2024-06-19 18:33:25,887 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.762e+03 5.534e+03 7.156e+03 8.254e+03 5.489e+04, threshold=1.431e+04, percent-clipped=6.0 2024-06-19 18:33:29,016 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.66 vs. limit=22.5 2024-06-19 18:33:29,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=38342.333333333336, ans=0.125 2024-06-19 18:33:32,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.84 vs. limit=22.5 2024-06-19 18:33:40,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.93 vs. limit=15.0 2024-06-19 18:33:41,673 INFO [train.py:1028] (0/2) Epoch 3, batch 700, loss[loss=0.9777, simple_loss=0.6283, pruned_loss=0.6636, over 13233.00 frames. ], tot_loss[loss=0.942, simple_loss=0.6121, pruned_loss=0.636, over 2513145.95 frames. ], batch size: 46, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:33:41,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=38379.0, ans=0.125 2024-06-19 18:33:41,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=38379.0, ans=0.125 2024-06-19 18:33:57,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=38415.666666666664, ans=0.2 2024-06-19 18:34:00,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=38434.0, ans=0.0 2024-06-19 18:34:09,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=38452.333333333336, ans=0.0 2024-06-19 18:34:16,170 INFO [train.py:1028] (0/2) Epoch 3, batch 750, loss[loss=1.044, simple_loss=0.6683, pruned_loss=0.7097, over 13280.00 frames. ], tot_loss[loss=0.9456, simple_loss=0.6137, pruned_loss=0.6388, over 2528108.19 frames. ], batch size: 63, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:34:17,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=38470.666666666664, ans=0.07 2024-06-19 18:34:19,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38470.666666666664, ans=0.125 2024-06-19 18:34:20,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.08 vs. limit=10.0 2024-06-19 18:34:22,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=38489.0, ans=0.002502391304347827 2024-06-19 18:34:23,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38489.0, ans=0.0 2024-06-19 18:34:27,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=38489.0, ans=0.2 2024-06-19 18:34:31,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=38507.333333333336, ans=0.0 2024-06-19 18:34:32,995 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.877e+03 4.119e+03 5.447e+03 6.871e+03 1.880e+04, threshold=1.089e+04, percent-clipped=2.0 2024-06-19 18:34:45,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38544.0, ans=0.1 2024-06-19 18:34:47,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.36 vs. limit=22.5 2024-06-19 18:34:48,706 INFO [train.py:1028] (0/2) Epoch 3, batch 800, loss[loss=0.9479, simple_loss=0.5941, pruned_loss=0.6508, over 12877.00 frames. ], tot_loss[loss=0.9455, simple_loss=0.6127, pruned_loss=0.6392, over 2541912.13 frames. ], batch size: 36, lr: 1.89e-02, grad_scale: 1.0 2024-06-19 18:34:52,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.02 vs. limit=15.0 2024-06-19 18:35:00,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.04 vs. limit=10.0 2024-06-19 18:35:12,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.85 vs. limit=22.5 2024-06-19 18:35:13,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.13 vs. limit=10.0 2024-06-19 18:35:24,409 INFO [train.py:1028] (0/2) Epoch 3, batch 850, loss[loss=0.8828, simple_loss=0.589, pruned_loss=0.5883, over 13191.00 frames. ], tot_loss[loss=0.9457, simple_loss=0.6127, pruned_loss=0.6393, over 2552221.43 frames. ], batch size: 95, lr: 1.89e-02, grad_scale: 0.5 2024-06-19 18:35:35,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=38672.333333333336, ans=0.0 2024-06-19 18:35:37,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-19 18:35:41,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+03 3.389e+03 4.557e+03 6.298e+03 1.584e+04, threshold=9.114e+03, percent-clipped=2.0 2024-06-19 18:35:51,531 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.79 vs. limit=22.5 2024-06-19 18:35:56,973 INFO [train.py:1028] (0/2) Epoch 3, batch 900, loss[loss=0.9974, simple_loss=0.6408, pruned_loss=0.677, over 12883.00 frames. ], tot_loss[loss=0.9425, simple_loss=0.6119, pruned_loss=0.6365, over 2556532.44 frames. ], batch size: 36, lr: 1.89e-02, grad_scale: 1.0 2024-06-19 18:35:59,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38745.666666666664, ans=0.1 2024-06-19 18:36:09,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=38764.0, ans=0.2 2024-06-19 18:36:10,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38764.0, ans=0.1 2024-06-19 18:36:17,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38782.333333333336, ans=0.125 2024-06-19 18:36:33,162 INFO [train.py:1028] (0/2) Epoch 3, batch 950, loss[loss=0.9611, simple_loss=0.6062, pruned_loss=0.658, over 12885.00 frames. ], tot_loss[loss=0.9413, simple_loss=0.611, pruned_loss=0.6358, over 2559964.41 frames. ], batch size: 39, lr: 1.88e-02, grad_scale: 0.25 2024-06-19 18:36:35,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38837.333333333336, ans=0.125 2024-06-19 18:36:42,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38855.666666666664, ans=0.1 2024-06-19 18:36:52,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=12.0 2024-06-19 18:36:52,261 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.494e+03 4.271e+03 5.142e+03 6.220e+03 2.085e+04, threshold=1.028e+04, percent-clipped=9.0 2024-06-19 18:37:03,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.77 vs. limit=22.5 2024-06-19 18:37:08,663 INFO [train.py:1028] (0/2) Epoch 3, batch 1000, loss[loss=1.043, simple_loss=0.6677, pruned_loss=0.7094, over 13367.00 frames. ], tot_loss[loss=0.9372, simple_loss=0.6091, pruned_loss=0.6326, over 2562101.60 frames. ], batch size: 49, lr: 1.88e-02, grad_scale: 0.5 2024-06-19 18:37:17,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38947.333333333336, ans=0.1 2024-06-19 18:37:40,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=39002.333333333336, ans=0.07 2024-06-19 18:37:41,739 INFO [train.py:1028] (0/2) Epoch 3, batch 1050, loss[loss=0.9287, simple_loss=0.6132, pruned_loss=0.6221, over 13192.00 frames. ], tot_loss[loss=0.9411, simple_loss=0.6113, pruned_loss=0.6354, over 2564628.99 frames. ], batch size: 77, lr: 1.88e-02, grad_scale: 0.5 2024-06-19 18:37:49,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.11 vs. limit=15.0 2024-06-19 18:37:50,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=39039.0, ans=0.04949747468305833 2024-06-19 18:37:57,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39057.333333333336, ans=0.1 2024-06-19 18:37:58,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=39057.333333333336, ans=0.125 2024-06-19 18:38:00,477 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.820e+03 4.352e+03 5.063e+03 6.175e+03 1.714e+04, threshold=1.013e+04, percent-clipped=2.0 2024-06-19 18:38:03,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-19 18:38:10,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=39094.0, ans=0.2 2024-06-19 18:38:12,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=22.59 vs. limit=15.0 2024-06-19 18:38:16,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=39094.0, ans=0.0 2024-06-19 18:38:18,028 INFO [train.py:1028] (0/2) Epoch 3, batch 1100, loss[loss=0.9893, simple_loss=0.6381, pruned_loss=0.6702, over 13321.00 frames. ], tot_loss[loss=0.94, simple_loss=0.6115, pruned_loss=0.6342, over 2569889.21 frames. ], batch size: 52, lr: 1.88e-02, grad_scale: 1.0 2024-06-19 18:38:20,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=39112.333333333336, ans=0.035 2024-06-19 18:38:21,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=39112.333333333336, ans=0.002366884057971013 2024-06-19 18:38:28,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-19 18:38:33,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=39149.0, ans=0.125 2024-06-19 18:38:33,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.65 vs. limit=15.0 2024-06-19 18:38:37,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.24 vs. limit=15.0 2024-06-19 18:38:46,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=39185.666666666664, ans=0.125 2024-06-19 18:38:51,846 INFO [train.py:1028] (0/2) Epoch 3, batch 1150, loss[loss=0.9323, simple_loss=0.604, pruned_loss=0.6303, over 13246.00 frames. ], tot_loss[loss=0.9393, simple_loss=0.612, pruned_loss=0.6333, over 2571061.75 frames. ], batch size: 52, lr: 1.88e-02, grad_scale: 0.25 2024-06-19 18:39:00,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.84 vs. limit=22.5 2024-06-19 18:39:09,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.79 vs. limit=15.0 2024-06-19 18:39:15,068 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.781e+03 5.650e+03 7.309e+03 9.130e+03 2.030e+04, threshold=1.462e+04, percent-clipped=14.0 2024-06-19 18:39:18,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=39259.0, ans=0.2 2024-06-19 18:39:21,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.24 vs. limit=10.0 2024-06-19 18:39:24,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.35 vs. limit=15.0 2024-06-19 18:39:27,155 INFO [train.py:1028] (0/2) Epoch 3, batch 1200, loss[loss=0.947, simple_loss=0.6241, pruned_loss=0.635, over 13134.00 frames. ], tot_loss[loss=0.9356, simple_loss=0.6111, pruned_loss=0.63, over 2572932.21 frames. ], batch size: 77, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:39:36,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.54 vs. limit=15.0 2024-06-19 18:39:42,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=12.73 vs. limit=12.0 2024-06-19 18:39:54,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=39369.0, ans=0.125 2024-06-19 18:39:59,768 INFO [train.py:1028] (0/2) Epoch 3, batch 1250, loss[loss=0.8425, simple_loss=0.5619, pruned_loss=0.5615, over 13123.00 frames. ], tot_loss[loss=0.9346, simple_loss=0.611, pruned_loss=0.6291, over 2582456.10 frames. ], batch size: 112, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:40:02,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=39387.333333333336, ans=0.125 2024-06-19 18:40:20,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2024-06-19 18:40:23,631 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.725e+03 5.399e+03 6.509e+03 7.810e+03 3.117e+04, threshold=1.302e+04, percent-clipped=4.0 2024-06-19 18:40:28,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.79 vs. limit=15.0 2024-06-19 18:40:31,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39460.666666666664, ans=0.125 2024-06-19 18:40:35,152 INFO [train.py:1028] (0/2) Epoch 3, batch 1300, loss[loss=0.8695, simple_loss=0.5864, pruned_loss=0.5763, over 12782.00 frames. ], tot_loss[loss=0.9347, simple_loss=0.6111, pruned_loss=0.6292, over 2582476.36 frames. ], batch size: 176, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:40:38,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=15.0 2024-06-19 18:40:46,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2024-06-19 18:40:47,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39515.666666666664, ans=0.1 2024-06-19 18:40:51,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=39515.666666666664, ans=0.0 2024-06-19 18:40:52,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=39515.666666666664, ans=0.125 2024-06-19 18:40:53,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-19 18:41:00,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=39534.0, ans=0.125 2024-06-19 18:41:00,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.07 vs. limit=15.0 2024-06-19 18:41:10,749 INFO [train.py:1028] (0/2) Epoch 3, batch 1350, loss[loss=0.9901, simple_loss=0.6509, pruned_loss=0.6647, over 13255.00 frames. ], tot_loss[loss=0.9344, simple_loss=0.6115, pruned_loss=0.6287, over 2585594.66 frames. ], batch size: 59, lr: 1.87e-02, grad_scale: 0.25 2024-06-19 18:41:14,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.09 vs. limit=10.0 2024-06-19 18:41:19,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=39589.0, ans=0.125 2024-06-19 18:41:20,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.95 vs. limit=15.0 2024-06-19 18:41:26,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.67 vs. limit=22.5 2024-06-19 18:41:32,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=39625.666666666664, ans=0.125 2024-06-19 18:41:33,368 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.273e+03 4.394e+03 5.413e+03 6.474e+03 2.657e+04, threshold=1.083e+04, percent-clipped=4.0 2024-06-19 18:41:36,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.14 vs. limit=22.5 2024-06-19 18:41:39,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=39644.0, ans=0.07 2024-06-19 18:41:44,835 INFO [train.py:1028] (0/2) Epoch 3, batch 1400, loss[loss=1.06, simple_loss=0.6795, pruned_loss=0.7198, over 12343.00 frames. ], tot_loss[loss=0.9325, simple_loss=0.6111, pruned_loss=0.6269, over 2587409.70 frames. ], batch size: 25, lr: 1.87e-02, grad_scale: 0.5 2024-06-19 18:41:49,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.65 vs. limit=15.0 2024-06-19 18:41:52,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=39680.666666666664, ans=0.0022433333333333333 2024-06-19 18:41:57,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.15 vs. limit=15.0 2024-06-19 18:41:57,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=39699.0, ans=0.07 2024-06-19 18:41:58,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=39699.0, ans=0.0022393478260869567 2024-06-19 18:41:59,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=39699.0, ans=0.2 2024-06-19 18:41:59,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.72 vs. limit=22.5 2024-06-19 18:42:20,314 INFO [train.py:1028] (0/2) Epoch 3, batch 1450, loss[loss=0.8786, simple_loss=0.5889, pruned_loss=0.5842, over 13106.00 frames. ], tot_loss[loss=0.9297, simple_loss=0.6105, pruned_loss=0.6244, over 2586717.29 frames. ], batch size: 121, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:42:24,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.87 vs. limit=15.0 2024-06-19 18:42:28,469 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.98 vs. limit=15.0 2024-06-19 18:42:37,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=12.0 2024-06-19 18:42:41,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.55 vs. limit=15.0 2024-06-19 18:42:42,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+03 3.850e+03 4.974e+03 5.928e+03 1.537e+04, threshold=9.949e+03, percent-clipped=3.0 2024-06-19 18:42:42,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=39809.0, ans=0.125 2024-06-19 18:42:43,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-19 18:42:46,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.51 vs. limit=15.0 2024-06-19 18:42:53,283 INFO [train.py:1028] (0/2) Epoch 3, batch 1500, loss[loss=0.8624, simple_loss=0.5767, pruned_loss=0.574, over 13153.00 frames. ], tot_loss[loss=0.9257, simple_loss=0.6091, pruned_loss=0.6212, over 2588963.70 frames. ], batch size: 83, lr: 1.86e-02, grad_scale: 1.0 2024-06-19 18:42:56,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.76 vs. limit=15.0 2024-06-19 18:42:58,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=39845.666666666664, ans=0.0 2024-06-19 18:42:58,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39845.666666666664, ans=0.1 2024-06-19 18:43:01,123 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.63 vs. limit=10.0 2024-06-19 18:43:06,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.38 vs. limit=15.0 2024-06-19 18:43:15,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=15.0 2024-06-19 18:43:17,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=39900.666666666664, ans=0.125 2024-06-19 18:43:20,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.98 vs. limit=15.0 2024-06-19 18:43:22,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=15.0 2024-06-19 18:43:26,312 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.462e+03 2024-06-19 18:43:28,805 INFO [train.py:1028] (0/2) Epoch 3, batch 1550, loss[loss=0.8414, simple_loss=0.5664, pruned_loss=0.5582, over 12987.00 frames. ], tot_loss[loss=0.925, simple_loss=0.6091, pruned_loss=0.6204, over 2583997.98 frames. ], batch size: 102, lr: 1.86e-02, grad_scale: 0.25 2024-06-19 18:43:29,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=15.0 2024-06-19 18:43:34,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=39937.333333333336, ans=0.125 2024-06-19 18:43:40,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=12.0 2024-06-19 18:43:52,010 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.633e+03 5.266e+03 5.930e+03 7.301e+03 2.204e+04, threshold=1.186e+04, percent-clipped=10.0 2024-06-19 18:43:53,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.51 vs. limit=22.5 2024-06-19 18:44:01,576 INFO [train.py:1028] (0/2) Epoch 3, batch 1600, loss[loss=0.9754, simple_loss=0.6512, pruned_loss=0.6498, over 13092.00 frames. ], tot_loss[loss=0.9228, simple_loss=0.6088, pruned_loss=0.6184, over 2580222.87 frames. ], batch size: 77, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:44:06,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=40029.0, ans=0.2 2024-06-19 18:44:11,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=40047.333333333336, ans=0.125 2024-06-19 18:44:12,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=40047.333333333336, ans=0.0 2024-06-19 18:44:19,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.81 vs. limit=15.0 2024-06-19 18:44:23,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=40084.0, ans=0.0 2024-06-19 18:44:28,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=40084.0, ans=0.125 2024-06-19 18:44:31,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40102.333333333336, ans=0.1 2024-06-19 18:44:36,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=40120.666666666664, ans=0.0021476811594202897 2024-06-19 18:44:36,560 INFO [train.py:1028] (0/2) Epoch 3, batch 1650, loss[loss=0.9088, simple_loss=0.6113, pruned_loss=0.6031, over 13161.00 frames. ], tot_loss[loss=0.9203, simple_loss=0.6084, pruned_loss=0.6161, over 2575748.29 frames. ], batch size: 95, lr: 1.86e-02, grad_scale: 0.5 2024-06-19 18:44:38,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40120.666666666664, ans=0.1 2024-06-19 18:44:39,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.32 vs. limit=15.0 2024-06-19 18:44:44,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40139.0, ans=0.125 2024-06-19 18:45:02,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.75 vs. limit=15.0 2024-06-19 18:45:03,781 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.181e+03 5.290e+03 6.209e+03 7.609e+03 2.355e+04, threshold=1.242e+04, percent-clipped=4.0 2024-06-19 18:45:07,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=40194.0, ans=0.125 2024-06-19 18:45:07,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.93 vs. limit=15.0 2024-06-19 18:45:10,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40194.0, ans=0.125 2024-06-19 18:45:13,636 INFO [train.py:1028] (0/2) Epoch 3, batch 1700, loss[loss=0.9533, simple_loss=0.6233, pruned_loss=0.6416, over 12492.00 frames. ], tot_loss[loss=0.9217, simple_loss=0.61, pruned_loss=0.6167, over 2580374.91 frames. ], batch size: 25, lr: 1.86e-02, grad_scale: 1.0 2024-06-19 18:45:16,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=40212.333333333336, ans=0.125 2024-06-19 18:45:18,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=40212.333333333336, ans=0.0 2024-06-19 18:45:27,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.01 vs. limit=22.5 2024-06-19 18:45:31,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=40249.0, ans=0.125 2024-06-19 18:45:39,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40285.666666666664, ans=0.125 2024-06-19 18:45:43,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=40285.666666666664, ans=0.2 2024-06-19 18:45:43,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=40285.666666666664, ans=22.5 2024-06-19 18:45:46,238 INFO [train.py:1028] (0/2) Epoch 3, batch 1750, loss[loss=0.9636, simple_loss=0.6325, pruned_loss=0.6473, over 12561.00 frames. ], tot_loss[loss=0.9197, simple_loss=0.6096, pruned_loss=0.6149, over 2581651.24 frames. ], batch size: 22, lr: 1.85e-02, grad_scale: 0.25 2024-06-19 18:45:50,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.63 vs. limit=22.5 2024-06-19 18:45:51,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.62 vs. limit=22.5 2024-06-19 18:46:01,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=40340.666666666664, ans=0.0 2024-06-19 18:46:05,576 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 18:46:13,343 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.995e+03 5.130e+03 6.218e+03 7.636e+03 3.397e+04, threshold=1.244e+04, percent-clipped=8.0 2024-06-19 18:46:21,805 INFO [train.py:1028] (0/2) Epoch 3, batch 1800, loss[loss=0.8908, simple_loss=0.5954, pruned_loss=0.5931, over 13194.00 frames. ], tot_loss[loss=0.918, simple_loss=0.6099, pruned_loss=0.6131, over 2583107.85 frames. ], batch size: 67, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:46:29,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=15.0 2024-06-19 18:46:32,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.13 vs. limit=15.0 2024-06-19 18:46:34,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=40414.0, ans=0.0020839130434782607 2024-06-19 18:46:39,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.71 vs. limit=10.0 2024-06-19 18:46:48,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=40469.0, ans=0.125 2024-06-19 18:46:53,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=40469.0, ans=0.002071956521739131 2024-06-19 18:46:55,109 INFO [train.py:1028] (0/2) Epoch 3, batch 1850, loss[loss=0.8929, simple_loss=0.5977, pruned_loss=0.5941, over 13239.00 frames. ], tot_loss[loss=0.9139, simple_loss=0.6086, pruned_loss=0.6096, over 2584751.45 frames. ], batch size: 83, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:46:56,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=15.0 2024-06-19 18:47:02,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=40487.333333333336, ans=0.2 2024-06-19 18:47:02,527 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.53 vs. limit=10.0 2024-06-19 18:47:05,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=40505.666666666664, ans=0.0 2024-06-19 18:47:06,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.43 vs. limit=6.0 2024-06-19 18:47:11,887 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=15.0 2024-06-19 18:47:22,543 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.019e+03 4.777e+03 5.751e+03 7.040e+03 2.086e+04, threshold=1.150e+04, percent-clipped=4.0 2024-06-19 18:47:30,560 INFO [train.py:1028] (0/2) Epoch 3, batch 1900, loss[loss=0.8837, simple_loss=0.5948, pruned_loss=0.5863, over 13129.00 frames. ], tot_loss[loss=0.9093, simple_loss=0.6068, pruned_loss=0.6059, over 2586773.87 frames. ], batch size: 95, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:47:31,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=15.0 2024-06-19 18:47:35,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.23 vs. limit=15.0 2024-06-19 18:47:50,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.85 vs. limit=12.0 2024-06-19 18:47:55,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=40634.0, ans=0.125 2024-06-19 18:47:59,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.67 vs. limit=10.0 2024-06-19 18:48:00,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-19 18:48:03,233 INFO [train.py:1028] (0/2) Epoch 3, batch 1950, loss[loss=0.8792, simple_loss=0.5903, pruned_loss=0.5841, over 13243.00 frames. ], tot_loss[loss=0.9038, simple_loss=0.6047, pruned_loss=0.6014, over 2592370.69 frames. ], batch size: 52, lr: 1.85e-02, grad_scale: 0.5 2024-06-19 18:48:25,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=40725.666666666664, ans=0.002016159420289855 2024-06-19 18:48:31,450 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.366e+03 5.194e+03 5.876e+03 7.303e+03 3.239e+04, threshold=1.175e+04, percent-clipped=6.0 2024-06-19 18:48:33,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=12.0 2024-06-19 18:48:35,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2024-06-19 18:48:39,686 INFO [train.py:1028] (0/2) Epoch 3, batch 2000, loss[loss=0.9597, simple_loss=0.6468, pruned_loss=0.6363, over 12508.00 frames. ], tot_loss[loss=0.9007, simple_loss=0.6044, pruned_loss=0.5985, over 2588349.30 frames. ], batch size: 22, lr: 1.84e-02, grad_scale: 1.0 2024-06-19 18:48:49,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=40780.666666666664, ans=0.2 2024-06-19 18:48:51,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=12.0 2024-06-19 18:48:52,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40799.0, ans=0.125 2024-06-19 18:48:58,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=40799.0, ans=0.2 2024-06-19 18:49:04,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.78 vs. limit=12.0 2024-06-19 18:49:09,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=40835.666666666664, ans=0.025 2024-06-19 18:49:11,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=40835.666666666664, ans=0.04949747468305833 2024-06-19 18:49:15,427 INFO [train.py:1028] (0/2) Epoch 3, batch 2050, loss[loss=0.9686, simple_loss=0.6522, pruned_loss=0.6425, over 12575.00 frames. ], tot_loss[loss=0.8987, simple_loss=0.6044, pruned_loss=0.5965, over 2583037.88 frames. ], batch size: 29, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:49:16,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.55 vs. limit=15.0 2024-06-19 18:49:18,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40854.0, ans=0.1 2024-06-19 18:49:23,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=40872.333333333336, ans=0.2 2024-06-19 18:49:35,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=40909.0, ans=0.125 2024-06-19 18:49:40,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40927.333333333336, ans=0.125 2024-06-19 18:49:41,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.988e+03 4.935e+03 6.106e+03 7.334e+03 2.024e+04, threshold=1.221e+04, percent-clipped=5.0 2024-06-19 18:49:46,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=40927.333333333336, ans=0.125 2024-06-19 18:49:48,004 INFO [train.py:1028] (0/2) Epoch 3, batch 2100, loss[loss=0.8783, simple_loss=0.5911, pruned_loss=0.5827, over 13218.00 frames. ], tot_loss[loss=0.901, simple_loss=0.6062, pruned_loss=0.598, over 2585421.48 frames. ], batch size: 59, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:49:55,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=20.14 vs. limit=15.0 2024-06-19 18:49:58,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=40964.0, ans=0.0 2024-06-19 18:50:12,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41000.666666666664, ans=0.125 2024-06-19 18:50:17,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.80 vs. limit=15.0 2024-06-19 18:50:19,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=41019.0, ans=0.035 2024-06-19 18:50:23,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=41037.333333333336, ans=0.0 2024-06-19 18:50:23,668 INFO [train.py:1028] (0/2) Epoch 3, batch 2150, loss[loss=0.9522, simple_loss=0.6391, pruned_loss=0.6327, over 13214.00 frames. ], tot_loss[loss=0.902, simple_loss=0.6069, pruned_loss=0.5985, over 2588701.67 frames. ], batch size: 52, lr: 1.84e-02, grad_scale: 0.5 2024-06-19 18:50:38,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41074.0, ans=0.1 2024-06-19 18:50:42,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.63 vs. limit=22.5 2024-06-19 18:50:50,462 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.648e+03 3.926e+03 4.705e+03 5.559e+03 2.632e+04, threshold=9.410e+03, percent-clipped=1.0 2024-06-19 18:50:56,988 INFO [train.py:1028] (0/2) Epoch 3, batch 2200, loss[loss=0.9074, simple_loss=0.617, pruned_loss=0.5989, over 13158.00 frames. ], tot_loss[loss=0.8996, simple_loss=0.6065, pruned_loss=0.5963, over 2588011.80 frames. ], batch size: 83, lr: 1.84e-02, grad_scale: 1.0 2024-06-19 18:51:06,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41147.333333333336, ans=0.125 2024-06-19 18:51:06,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=41147.333333333336, ans=0.125 2024-06-19 18:51:12,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.05 vs. limit=22.5 2024-06-19 18:51:17,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.49 vs. limit=22.5 2024-06-19 18:51:26,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=41202.333333333336, ans=0.07 2024-06-19 18:51:32,299 INFO [train.py:1028] (0/2) Epoch 3, batch 2250, loss[loss=0.9183, simple_loss=0.6229, pruned_loss=0.6068, over 13256.00 frames. ], tot_loss[loss=0.8971, simple_loss=0.6059, pruned_loss=0.5941, over 2586913.50 frames. ], batch size: 63, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:51:37,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41220.666666666664, ans=0.1 2024-06-19 18:51:37,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-19 18:51:52,254 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=8.972e-02 2024-06-19 18:51:54,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=41275.666666666664, ans=0.2 2024-06-19 18:51:56,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=41275.666666666664, ans=0.0 2024-06-19 18:51:57,101 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=15.0 2024-06-19 18:51:57,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=15.0 2024-06-19 18:51:59,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=41294.0, ans=0.0 2024-06-19 18:51:59,445 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.665e+03 4.281e+03 4.893e+03 5.711e+03 1.693e+04, threshold=9.786e+03, percent-clipped=6.0 2024-06-19 18:52:05,674 INFO [train.py:1028] (0/2) Epoch 3, batch 2300, loss[loss=0.9195, simple_loss=0.6243, pruned_loss=0.6074, over 12936.00 frames. ], tot_loss[loss=0.8992, simple_loss=0.6077, pruned_loss=0.5953, over 2581451.64 frames. ], batch size: 33, lr: 1.83e-02, grad_scale: 1.0 2024-06-19 18:52:10,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.09 vs. limit=10.0 2024-06-19 18:52:12,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.93 vs. limit=15.0 2024-06-19 18:52:16,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=15.0 2024-06-19 18:52:20,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=41330.666666666664, ans=0.0 2024-06-19 18:52:25,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=41349.0, ans=0.125 2024-06-19 18:52:32,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=41367.333333333336, ans=0.0 2024-06-19 18:52:33,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=41367.333333333336, ans=0.001876666666666667 2024-06-19 18:52:33,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=12.0 2024-06-19 18:52:41,866 INFO [train.py:1028] (0/2) Epoch 3, batch 2350, loss[loss=0.906, simple_loss=0.621, pruned_loss=0.5954, over 13281.00 frames. ], tot_loss[loss=0.8962, simple_loss=0.6073, pruned_loss=0.5926, over 2585385.17 frames. ], batch size: 67, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:52:45,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.03 vs. limit=10.0 2024-06-19 18:52:46,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.82 vs. limit=15.0 2024-06-19 18:52:50,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.65 vs. limit=15.0 2024-06-19 18:53:01,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=41440.666666666664, ans=0.0 2024-06-19 18:53:02,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-19 18:53:11,718 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=5.683e-02 2024-06-19 18:53:13,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=41477.333333333336, ans=0.125 2024-06-19 18:53:13,513 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.769e+03 4.073e+03 4.804e+03 5.808e+03 2.098e+04, threshold=9.609e+03, percent-clipped=4.0 2024-06-19 18:53:18,288 INFO [train.py:1028] (0/2) Epoch 3, batch 2400, loss[loss=0.8628, simple_loss=0.5908, pruned_loss=0.5674, over 13295.00 frames. ], tot_loss[loss=0.8894, simple_loss=0.6041, pruned_loss=0.5874, over 2587878.92 frames. ], batch size: 46, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:53:27,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.68 vs. limit=10.0 2024-06-19 18:53:30,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2024-06-19 18:53:39,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=41550.666666666664, ans=0.0 2024-06-19 18:53:45,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=41569.0, ans=0.0 2024-06-19 18:53:49,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=41569.0, ans=0.2 2024-06-19 18:53:50,849 INFO [train.py:1028] (0/2) Epoch 3, batch 2450, loss[loss=0.8165, simple_loss=0.5673, pruned_loss=0.5328, over 13277.00 frames. ], tot_loss[loss=0.8821, simple_loss=0.601, pruned_loss=0.5816, over 2584533.26 frames. ], batch size: 63, lr: 1.83e-02, grad_scale: 0.5 2024-06-19 18:53:53,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41587.333333333336, ans=0.1 2024-06-19 18:53:54,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=15.0 2024-06-19 18:54:02,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=41605.666666666664, ans=0.1 2024-06-19 18:54:05,457 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.037e-03 2024-06-19 18:54:09,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=41624.0, ans=0.125 2024-06-19 18:54:21,866 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+03 3.963e+03 4.743e+03 5.334e+03 1.602e+04, threshold=9.487e+03, percent-clipped=4.0 2024-06-19 18:54:26,862 INFO [train.py:1028] (0/2) Epoch 3, batch 2500, loss[loss=0.8415, simple_loss=0.5854, pruned_loss=0.5488, over 13249.00 frames. ], tot_loss[loss=0.8735, simple_loss=0.5974, pruned_loss=0.5748, over 2589062.91 frames. ], batch size: 83, lr: 1.83e-02, grad_scale: 1.0 2024-06-19 18:54:27,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.05 vs. limit=10.0 2024-06-19 18:54:31,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41679.0, ans=0.125 2024-06-19 18:54:31,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=41679.0, ans=0.035 2024-06-19 18:54:32,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=41679.0, ans=0.125 2024-06-19 18:54:33,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.01 vs. limit=22.5 2024-06-19 18:54:33,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.36 vs. limit=12.0 2024-06-19 18:54:35,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41697.333333333336, ans=0.125 2024-06-19 18:54:45,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=41715.666666666664, ans=15.0 2024-06-19 18:55:02,862 INFO [train.py:1028] (0/2) Epoch 3, batch 2550, loss[loss=0.8798, simple_loss=0.6167, pruned_loss=0.5714, over 12624.00 frames. ], tot_loss[loss=0.865, simple_loss=0.5939, pruned_loss=0.568, over 2587403.72 frames. ], batch size: 22, lr: 1.82e-02, grad_scale: 1.0 2024-06-19 18:55:07,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=41770.666666666664, ans=0.0 2024-06-19 18:55:08,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-19 18:55:11,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.90 vs. limit=6.0 2024-06-19 18:55:13,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.56 vs. limit=22.5 2024-06-19 18:55:19,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.95 vs. limit=15.0 2024-06-19 18:55:32,222 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.899e+03 4.611e+03 5.346e+03 6.458e+03 2.438e+04, threshold=1.069e+04, percent-clipped=5.0 2024-06-19 18:55:33,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.36 vs. limit=6.0 2024-06-19 18:55:35,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=41862.333333333336, ans=0.125 2024-06-19 18:55:35,445 INFO [train.py:1028] (0/2) Epoch 3, batch 2600, loss[loss=0.8236, simple_loss=0.5771, pruned_loss=0.535, over 13305.00 frames. ], tot_loss[loss=0.8568, simple_loss=0.5905, pruned_loss=0.5616, over 2587745.25 frames. ], batch size: 52, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:55:43,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.01 vs. limit=22.5 2024-06-19 18:55:45,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.37 vs. limit=15.0 2024-06-19 18:55:45,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=41880.666666666664, ans=0.0 2024-06-19 18:56:06,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.17 vs. limit=22.5 2024-06-19 18:56:11,525 INFO [train.py:1028] (0/2) Epoch 3, batch 2650, loss[loss=0.7596, simple_loss=0.5363, pruned_loss=0.4914, over 13077.00 frames. ], tot_loss[loss=0.8474, simple_loss=0.5861, pruned_loss=0.5544, over 2588105.43 frames. ], batch size: 144, lr: 1.82e-02, grad_scale: 0.25 2024-06-19 18:56:12,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.21 vs. limit=15.0 2024-06-19 18:56:18,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=41972.333333333336, ans=0.025 2024-06-19 18:56:23,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-19 18:56:28,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.83 vs. limit=15.0 2024-06-19 18:56:30,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.28 vs. limit=22.5 2024-06-19 18:56:32,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=42009.0, ans=0.0017371739130434775 2024-06-19 18:56:32,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42009.0, ans=0.1 2024-06-19 18:56:41,820 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+03 4.192e+03 4.700e+03 5.564e+03 2.083e+04, threshold=9.401e+03, percent-clipped=2.0 2024-06-19 18:56:44,347 INFO [train.py:1028] (0/2) Epoch 3, batch 2700, loss[loss=0.7571, simple_loss=0.5333, pruned_loss=0.4904, over 13216.00 frames. ], tot_loss[loss=0.8379, simple_loss=0.5813, pruned_loss=0.5472, over 2586622.39 frames. ], batch size: 89, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:56:52,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=42045.666666666664, ans=0.0 2024-06-19 18:56:55,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=42064.0, ans=0.125 2024-06-19 18:56:57,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2024-06-19 18:57:01,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=12.0 2024-06-19 18:57:02,141 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=1.875e-02 2024-06-19 18:57:04,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2024-06-19 18:57:20,224 INFO [train.py:1028] (0/2) Epoch 3, batch 2750, loss[loss=0.8441, simple_loss=0.5959, pruned_loss=0.5461, over 13256.00 frames. ], tot_loss[loss=0.8298, simple_loss=0.5783, pruned_loss=0.5407, over 2581790.63 frames. ], batch size: 43, lr: 1.82e-02, grad_scale: 0.5 2024-06-19 18:57:32,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.61 vs. limit=10.0 2024-06-19 18:57:44,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=42192.333333333336, ans=0.125 2024-06-19 18:57:48,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=42210.666666666664, ans=0.125 2024-06-19 18:57:49,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.27 vs. limit=15.0 2024-06-19 18:57:54,473 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.994e+03 4.317e+03 4.963e+03 6.140e+03 1.989e+04, threshold=9.927e+03, percent-clipped=5.0 2024-06-19 18:57:55,543 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=24.04 vs. limit=15.0 2024-06-19 18:57:56,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=42229.0, ans=0.125 2024-06-19 18:57:57,143 INFO [train.py:1028] (0/2) Epoch 3, batch 2800, loss[loss=0.754, simple_loss=0.5375, pruned_loss=0.4853, over 10805.00 frames. ], tot_loss[loss=0.8214, simple_loss=0.5747, pruned_loss=0.5341, over 2578954.84 frames. ], batch size: 304, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 18:58:05,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=42247.333333333336, ans=0.00168536231884058 2024-06-19 18:58:10,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=42265.666666666664, ans=0.0016813768115942034 2024-06-19 18:58:11,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=42265.666666666664, ans=0.125 2024-06-19 18:58:14,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42265.666666666664, ans=0.125 2024-06-19 18:58:16,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=42284.0, ans=0.0 2024-06-19 18:58:17,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-19 18:58:24,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=42302.333333333336, ans=0.125 2024-06-19 18:58:25,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-19 18:58:28,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.47 vs. limit=15.0 2024-06-19 18:58:29,877 INFO [train.py:1028] (0/2) Epoch 3, batch 2850, loss[loss=0.843, simple_loss=0.598, pruned_loss=0.5439, over 13287.00 frames. ], tot_loss[loss=0.8138, simple_loss=0.5714, pruned_loss=0.5281, over 2576999.65 frames. ], batch size: 49, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 18:58:30,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=42320.666666666664, ans=0.125 2024-06-19 18:58:38,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=42339.0, ans=0.001665434782608697 2024-06-19 18:59:00,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=42394.0, ans=0.1 2024-06-19 18:59:01,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2024-06-19 18:59:02,666 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.537e+03 4.093e+03 4.703e+03 5.436e+03 1.850e+04, threshold=9.407e+03, percent-clipped=4.0 2024-06-19 18:59:04,698 INFO [train.py:1028] (0/2) Epoch 3, batch 2900, loss[loss=0.7807, simple_loss=0.5658, pruned_loss=0.4978, over 13193.00 frames. ], tot_loss[loss=0.801, simple_loss=0.5649, pruned_loss=0.5186, over 2585647.02 frames. ], batch size: 55, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 18:59:07,247 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=27.32 vs. limit=15.0 2024-06-19 18:59:09,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42412.333333333336, ans=0.125 2024-06-19 18:59:19,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=42449.0, ans=0.07 2024-06-19 18:59:22,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=42449.0, ans=0.125 2024-06-19 18:59:33,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=42485.666666666664, ans=0.0 2024-06-19 18:59:36,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2024-06-19 18:59:36,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=42485.666666666664, ans=0.125 2024-06-19 18:59:37,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=42504.0, ans=0.125 2024-06-19 18:59:38,182 INFO [train.py:1028] (0/2) Epoch 3, batch 2950, loss[loss=0.8326, simple_loss=0.6008, pruned_loss=0.5322, over 13215.00 frames. ], tot_loss[loss=0.7966, simple_loss=0.5639, pruned_loss=0.5147, over 2579189.29 frames. ], batch size: 43, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 18:59:49,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.62 vs. limit=15.0 2024-06-19 18:59:55,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=42540.666666666664, ans=0.125 2024-06-19 18:59:57,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.15 vs. limit=15.0 2024-06-19 19:00:04,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.50 vs. limit=22.5 2024-06-19 19:00:14,842 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.566e+03 4.148e+03 4.866e+03 5.618e+03 2.094e+04, threshold=9.733e+03, percent-clipped=6.0 2024-06-19 19:00:16,313 INFO [train.py:1028] (0/2) Epoch 3, batch 3000, loss[loss=0.7841, simple_loss=0.5673, pruned_loss=0.5004, over 13177.00 frames. ], tot_loss[loss=0.788, simple_loss=0.5599, pruned_loss=0.5081, over 2578388.19 frames. ], batch size: 59, lr: 1.81e-02, grad_scale: 1.0 2024-06-19 19:00:16,315 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 19:00:24,401 INFO [train.py:1060] (0/2) Epoch 3, validation: loss=0.7243, simple_loss=0.5476, pruned_loss=0.4504, over 351949.00 frames. 2024-06-19 19:00:24,401 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 19:00:24,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-19 19:00:26,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2024-06-19 19:00:27,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42595.666666666664, ans=0.1 2024-06-19 19:00:30,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=42614.0, ans=0.04949747468305833 2024-06-19 19:00:33,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=42614.0, ans=0.2 2024-06-19 19:00:46,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42632.333333333336, ans=0.1 2024-06-19 19:01:03,458 INFO [train.py:1028] (0/2) Epoch 3, batch 3050, loss[loss=0.7798, simple_loss=0.5702, pruned_loss=0.4947, over 13233.00 frames. ], tot_loss[loss=0.7771, simple_loss=0.5549, pruned_loss=0.4996, over 2577543.90 frames. ], batch size: 46, lr: 1.81e-02, grad_scale: 0.5 2024-06-19 19:01:09,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=42705.666666666664, ans=0.0 2024-06-19 19:01:12,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=42705.666666666664, ans=0.125 2024-06-19 19:01:26,330 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=1.240e-01 2024-06-19 19:01:27,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=42742.333333333336, ans=0.125 2024-06-19 19:01:27,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=42742.333333333336, ans=0.0 2024-06-19 19:01:28,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=42742.333333333336, ans=0.2 2024-06-19 19:01:34,314 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:01:35,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.13 vs. limit=15.0 2024-06-19 19:01:35,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.49 vs. limit=22.5 2024-06-19 19:01:36,075 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.412e+03 4.195e+03 4.873e+03 5.548e+03 1.971e+04, threshold=9.747e+03, percent-clipped=5.0 2024-06-19 19:01:36,722 INFO [train.py:1028] (0/2) Epoch 3, batch 3100, loss[loss=0.7045, simple_loss=0.5116, pruned_loss=0.4487, over 13053.00 frames. ], tot_loss[loss=0.7703, simple_loss=0.5517, pruned_loss=0.4945, over 2578428.02 frames. ], batch size: 144, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:01:39,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-19 19:01:40,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=42779.0, ans=0.0 2024-06-19 19:01:44,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=42797.333333333336, ans=0.001565797101449275 2024-06-19 19:01:46,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=42797.333333333336, ans=0.025 2024-06-19 19:02:06,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=42852.333333333336, ans=0.125 2024-06-19 19:02:12,847 INFO [train.py:1028] (0/2) Epoch 3, batch 3150, loss[loss=0.7122, simple_loss=0.5146, pruned_loss=0.4549, over 12936.00 frames. ], tot_loss[loss=0.7614, simple_loss=0.5471, pruned_loss=0.4879, over 2580534.79 frames. ], batch size: 158, lr: 1.80e-02, grad_scale: 0.5 2024-06-19 19:02:22,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=42889.0, ans=0.125 2024-06-19 19:02:27,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=42907.333333333336, ans=0.125 2024-06-19 19:02:30,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=8.0 2024-06-19 19:02:41,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=42925.666666666664, ans=0.0 2024-06-19 19:02:48,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.57 vs. limit=22.5 2024-06-19 19:02:49,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.793e+03 4.752e+03 5.502e+03 6.391e+03 3.022e+04, threshold=1.100e+04, percent-clipped=7.0 2024-06-19 19:02:49,768 INFO [train.py:1028] (0/2) Epoch 3, batch 3200, loss[loss=0.7279, simple_loss=0.535, pruned_loss=0.4604, over 13175.00 frames. ], tot_loss[loss=0.7548, simple_loss=0.544, pruned_loss=0.4828, over 2581735.03 frames. ], batch size: 55, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:02:49,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=42962.333333333336, ans=0.125 2024-06-19 19:02:54,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.91 vs. limit=10.0 2024-06-19 19:02:57,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=42980.666666666664, ans=0.05 2024-06-19 19:03:01,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.23 vs. limit=15.0 2024-06-19 19:03:02,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=42999.0, ans=0.2 2024-06-19 19:03:03,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=42999.0, ans=15.0 2024-06-19 19:03:10,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=15.0 2024-06-19 19:03:11,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=43017.333333333336, ans=0.95 2024-06-19 19:03:12,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=43017.333333333336, ans=0.2 2024-06-19 19:03:22,510 INFO [train.py:1028] (0/2) Epoch 3, batch 3250, loss[loss=0.6987, simple_loss=0.5191, pruned_loss=0.4392, over 13245.00 frames. ], tot_loss[loss=0.7483, simple_loss=0.5409, pruned_loss=0.4779, over 2585396.19 frames. ], batch size: 72, lr: 1.80e-02, grad_scale: 0.5 2024-06-19 19:03:22,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=43054.0, ans=0.0015099999999999992 2024-06-19 19:03:29,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=43072.333333333336, ans=0.125 2024-06-19 19:03:31,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=12.0 2024-06-19 19:03:32,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43072.333333333336, ans=0.125 2024-06-19 19:03:45,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43109.0, ans=0.125 2024-06-19 19:03:50,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43127.333333333336, ans=0.1 2024-06-19 19:04:00,236 INFO [train.py:1028] (0/2) Epoch 3, batch 3300, loss[loss=0.7314, simple_loss=0.5323, pruned_loss=0.4653, over 12796.00 frames. ], tot_loss[loss=0.741, simple_loss=0.5379, pruned_loss=0.472, over 2581886.67 frames. ], batch size: 177, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:04:00,899 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.465e+03 4.309e+03 4.998e+03 5.820e+03 8.440e+03, threshold=9.996e+03, percent-clipped=0.0 2024-06-19 19:04:14,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.59 vs. limit=10.0 2024-06-19 19:04:19,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=43182.333333333336, ans=0.0014821014492753613 2024-06-19 19:04:20,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=43200.666666666664, ans=0.0014781159420289864 2024-06-19 19:04:36,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=43237.333333333336, ans=0.2 2024-06-19 19:04:37,202 INFO [train.py:1028] (0/2) Epoch 3, batch 3350, loss[loss=0.6898, simple_loss=0.5016, pruned_loss=0.439, over 12996.00 frames. ], tot_loss[loss=0.733, simple_loss=0.5338, pruned_loss=0.4661, over 2577719.65 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 1.0 2024-06-19 19:04:38,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.31 vs. limit=22.5 2024-06-19 19:04:41,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=43237.333333333336, ans=0.0 2024-06-19 19:04:42,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43237.333333333336, ans=0.125 2024-06-19 19:04:46,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.41 vs. limit=15.0 2024-06-19 19:04:53,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43274.0, ans=0.1 2024-06-19 19:05:03,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.65 vs. limit=15.0 2024-06-19 19:05:04,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=43310.666666666664, ans=0.125 2024-06-19 19:05:10,572 INFO [train.py:1028] (0/2) Epoch 3, batch 3400, loss[loss=0.7238, simple_loss=0.5409, pruned_loss=0.4534, over 12389.00 frames. ], tot_loss[loss=0.7256, simple_loss=0.5304, pruned_loss=0.4604, over 2576383.10 frames. ], batch size: 22, lr: 1.79e-02, grad_scale: 1.0 2024-06-19 19:05:10,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=43329.0, ans=0.125 2024-06-19 19:05:11,939 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.396e+03 4.379e+03 4.796e+03 5.450e+03 1.460e+04, threshold=9.591e+03, percent-clipped=2.0 2024-06-19 19:05:14,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43329.0, ans=0.125 2024-06-19 19:05:14,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.95 vs. limit=10.0 2024-06-19 19:05:17,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.52 vs. limit=15.0 2024-06-19 19:05:20,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=43347.333333333336, ans=0.2 2024-06-19 19:05:32,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.96 vs. limit=15.0 2024-06-19 19:05:33,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=43384.0, ans=0.125 2024-06-19 19:05:33,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=43384.0, ans=0.09899494936611666 2024-06-19 19:05:36,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=43384.0, ans=0.125 2024-06-19 19:05:39,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=43402.333333333336, ans=0.025 2024-06-19 19:05:41,332 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:05:42,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2024-06-19 19:05:43,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.68 vs. limit=22.5 2024-06-19 19:05:44,309 INFO [train.py:1028] (0/2) Epoch 3, batch 3450, loss[loss=0.7375, simple_loss=0.5399, pruned_loss=0.4676, over 12838.00 frames. ], tot_loss[loss=0.7167, simple_loss=0.5264, pruned_loss=0.4535, over 2577580.03 frames. ], batch size: 177, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:05:45,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43420.666666666664, ans=0.0 2024-06-19 19:05:59,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=43457.333333333336, ans=0.0 2024-06-19 19:06:20,559 INFO [train.py:1028] (0/2) Epoch 3, batch 3500, loss[loss=0.6795, simple_loss=0.5088, pruned_loss=0.4251, over 12962.00 frames. ], tot_loss[loss=0.7132, simple_loss=0.5251, pruned_loss=0.4506, over 2577198.25 frames. ], batch size: 33, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:06:22,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.80 vs. limit=10.0 2024-06-19 19:06:23,245 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.338e+03 4.525e+03 5.690e+03 7.285e+03 1.724e+04, threshold=1.138e+04, percent-clipped=9.0 2024-06-19 19:06:27,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43512.333333333336, ans=0.125 2024-06-19 19:06:29,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=43512.333333333336, ans=0.125 2024-06-19 19:06:31,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=43530.666666666664, ans=0.025 2024-06-19 19:06:31,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=43530.666666666664, ans=0.125 2024-06-19 19:06:34,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.66 vs. limit=15.0 2024-06-19 19:06:44,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.18 vs. limit=15.0 2024-06-19 19:06:54,238 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:06:54,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=43585.666666666664, ans=0.1 2024-06-19 19:06:55,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=43585.666666666664, ans=0.125 2024-06-19 19:06:57,354 INFO [train.py:1028] (0/2) Epoch 3, batch 3550, loss[loss=0.6096, simple_loss=0.4559, pruned_loss=0.3817, over 13178.00 frames. ], tot_loss[loss=0.7051, simple_loss=0.5213, pruned_loss=0.4445, over 2578181.12 frames. ], batch size: 95, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:07:00,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=43604.0, ans=0.2 2024-06-19 19:07:01,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43604.0, ans=0.0 2024-06-19 19:07:03,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.29 vs. limit=15.0 2024-06-19 19:07:05,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43622.333333333336, ans=0.1 2024-06-19 19:07:06,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=43622.333333333336, ans=0.125 2024-06-19 19:07:14,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=13.15 vs. limit=10.0 2024-06-19 19:07:16,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=43659.0, ans=0.125 2024-06-19 19:07:19,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=43659.0, ans=0.04949747468305833 2024-06-19 19:07:25,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=43677.333333333336, ans=0.125 2024-06-19 19:07:30,332 INFO [train.py:1028] (0/2) Epoch 3, batch 3600, loss[loss=0.6269, simple_loss=0.4792, pruned_loss=0.3873, over 13289.00 frames. ], tot_loss[loss=0.6964, simple_loss=0.5173, pruned_loss=0.4378, over 2580629.20 frames. ], batch size: 49, lr: 1.79e-02, grad_scale: 1.0 2024-06-19 19:07:33,090 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+03 3.679e+03 4.215e+03 4.931e+03 1.419e+04, threshold=8.430e+03, percent-clipped=1.0 2024-06-19 19:07:44,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=43732.333333333336, ans=0.125 2024-06-19 19:07:47,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43732.333333333336, ans=0.1 2024-06-19 19:07:48,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.73 vs. limit=15.0 2024-06-19 19:07:51,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-19 19:08:00,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43750.666666666664, ans=0.1 2024-06-19 19:08:06,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43769.0, ans=0.1 2024-06-19 19:08:08,463 INFO [train.py:1028] (0/2) Epoch 3, batch 3650, loss[loss=0.6438, simple_loss=0.4863, pruned_loss=0.4006, over 13138.00 frames. ], tot_loss[loss=0.6887, simple_loss=0.5139, pruned_loss=0.4317, over 2579284.05 frames. ], batch size: 103, lr: 1.79e-02, grad_scale: 0.5 2024-06-19 19:08:09,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43787.333333333336, ans=0.125 2024-06-19 19:08:19,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=43805.666666666664, ans=0.125 2024-06-19 19:08:23,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43824.0, ans=0.125 2024-06-19 19:08:23,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=43824.0, ans=0.2 2024-06-19 19:08:23,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.11 vs. limit=15.0 2024-06-19 19:08:46,938 INFO [train.py:1028] (0/2) Epoch 3, batch 3700, loss[loss=0.679, simple_loss=0.5172, pruned_loss=0.4204, over 13258.00 frames. ], tot_loss[loss=0.6824, simple_loss=0.5113, pruned_loss=0.4268, over 2584191.79 frames. ], batch size: 72, lr: 1.78e-02, grad_scale: 1.0 2024-06-19 19:08:50,157 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.217e+03 3.217e+03 3.890e+03 4.383e+03 2.165e+04, threshold=7.780e+03, percent-clipped=6.0 2024-06-19 19:08:53,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.25 vs. limit=22.5 2024-06-19 19:08:54,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.19 vs. limit=6.0 2024-06-19 19:08:54,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=43897.333333333336, ans=0.0 2024-06-19 19:09:10,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.70 vs. limit=22.5 2024-06-19 19:09:19,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=43970.666666666664, ans=0.0 2024-06-19 19:09:19,534 INFO [train.py:1028] (0/2) Epoch 3, batch 3750, loss[loss=0.708, simple_loss=0.5375, pruned_loss=0.4392, over 12728.00 frames. ], tot_loss[loss=0.6761, simple_loss=0.5084, pruned_loss=0.4219, over 2585565.32 frames. ], batch size: 22, lr: 1.78e-02, grad_scale: 0.5 2024-06-19 19:09:22,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=43970.666666666664, ans=0.125 2024-06-19 19:09:24,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=43970.666666666664, ans=0.025 2024-06-19 19:09:25,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=43989.0, ans=0.001306739130434784 2024-06-19 19:09:29,299 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-24000.pt 2024-06-19 19:09:50,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.18 vs. limit=15.0 2024-06-19 19:09:56,789 INFO [train.py:1028] (0/2) Epoch 3, batch 3800, loss[loss=0.5941, simple_loss=0.4599, pruned_loss=0.3642, over 13202.00 frames. ], tot_loss[loss=0.6698, simple_loss=0.5058, pruned_loss=0.4169, over 2583779.42 frames. ], batch size: 83, lr: 1.78e-02, grad_scale: 1.0 2024-06-19 19:10:00,756 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+03 3.287e+03 3.846e+03 4.476e+03 7.364e+03, threshold=7.691e+03, percent-clipped=0.0 2024-06-19 19:10:01,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44062.333333333336, ans=0.125 2024-06-19 19:10:12,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.96 vs. limit=10.0 2024-06-19 19:10:34,465 INFO [train.py:1028] (0/2) Epoch 3, batch 3850, loss[loss=0.665, simple_loss=0.501, pruned_loss=0.4145, over 13062.00 frames. ], tot_loss[loss=0.6633, simple_loss=0.503, pruned_loss=0.4118, over 2584165.03 frames. ], batch size: 144, lr: 1.78e-02, grad_scale: 0.25 2024-06-19 19:10:40,815 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.14 vs. limit=15.0 2024-06-19 19:10:41,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-19 19:10:56,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=44209.0, ans=0.125 2024-06-19 19:10:58,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=44209.0, ans=0.125 2024-06-19 19:10:59,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44209.0, ans=0.125 2024-06-19 19:11:10,398 INFO [train.py:1028] (0/2) Epoch 3, batch 3900, loss[loss=0.6409, simple_loss=0.4951, pruned_loss=0.3934, over 13186.00 frames. ], tot_loss[loss=0.6594, simple_loss=0.5011, pruned_loss=0.4088, over 2586309.18 frames. ], batch size: 83, lr: 1.78e-02, grad_scale: 0.5 2024-06-19 19:11:15,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.750e+03 4.809e+03 5.812e+03 6.715e+03 2.324e+04, threshold=1.162e+04, percent-clipped=11.0 2024-06-19 19:11:22,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44264.0, ans=0.125 2024-06-19 19:11:25,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.68 vs. limit=15.0 2024-06-19 19:11:26,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.31 vs. limit=10.0 2024-06-19 19:11:28,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=44282.333333333336, ans=0.0 2024-06-19 19:11:28,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44282.333333333336, ans=0.1 2024-06-19 19:11:43,853 INFO [train.py:1028] (0/2) Epoch 3, batch 3950, loss[loss=0.6357, simple_loss=0.481, pruned_loss=0.3952, over 13111.00 frames. ], tot_loss[loss=0.6508, simple_loss=0.4969, pruned_loss=0.4023, over 2587452.02 frames. ], batch size: 132, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:11:56,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=22.5 2024-06-19 19:11:58,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.41 vs. limit=15.0 2024-06-19 19:12:00,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44374.0, ans=0.125 2024-06-19 19:12:01,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.89 vs. limit=22.5 2024-06-19 19:12:07,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=44392.333333333336, ans=0.0 2024-06-19 19:12:12,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44410.666666666664, ans=0.1 2024-06-19 19:12:20,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=44410.666666666664, ans=0.001215072463768117 2024-06-19 19:12:22,200 INFO [train.py:1028] (0/2) Epoch 3, batch 4000, loss[loss=0.7049, simple_loss=0.5383, pruned_loss=0.4358, over 13016.00 frames. ], tot_loss[loss=0.6454, simple_loss=0.4942, pruned_loss=0.3983, over 2582089.73 frames. ], batch size: 39, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:12:25,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=44429.0, ans=0.05 2024-06-19 19:12:28,377 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.353e+03 3.801e+03 4.647e+03 5.271e+03 9.446e+03, threshold=9.294e+03, percent-clipped=0.0 2024-06-19 19:12:29,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44447.333333333336, ans=0.1 2024-06-19 19:12:29,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.88 vs. limit=22.5 2024-06-19 19:12:37,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.68 vs. limit=15.0 2024-06-19 19:12:38,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=44465.666666666664, ans=0.125 2024-06-19 19:12:42,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44484.0, ans=0.125 2024-06-19 19:12:51,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.61 vs. limit=22.5 2024-06-19 19:12:52,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=44502.333333333336, ans=0.0 2024-06-19 19:12:53,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=15.0 2024-06-19 19:13:00,045 INFO [train.py:1028] (0/2) Epoch 3, batch 4050, loss[loss=0.6365, simple_loss=0.4811, pruned_loss=0.3959, over 11105.00 frames. ], tot_loss[loss=0.6424, simple_loss=0.4926, pruned_loss=0.3961, over 2579397.24 frames. ], batch size: 304, lr: 1.77e-02, grad_scale: 0.5 2024-06-19 19:13:02,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=44520.666666666664, ans=0.125 2024-06-19 19:13:07,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=44539.0, ans=0.035 2024-06-19 19:13:09,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2024-06-19 19:13:21,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44575.666666666664, ans=0.125 2024-06-19 19:13:26,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.91 vs. limit=15.0 2024-06-19 19:13:33,497 INFO [train.py:1028] (0/2) Epoch 3, batch 4100, loss[loss=0.6452, simple_loss=0.4961, pruned_loss=0.3971, over 13000.00 frames. ], tot_loss[loss=0.6348, simple_loss=0.4887, pruned_loss=0.3905, over 2575636.42 frames. ], batch size: 102, lr: 1.77e-02, grad_scale: 1.0 2024-06-19 19:13:39,391 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.499e+03 3.752e+03 4.260e+03 4.556e+03 1.053e+04, threshold=8.519e+03, percent-clipped=1.0 2024-06-19 19:13:39,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=44630.666666666664, ans=0.125 2024-06-19 19:13:47,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=44649.0, ans=0.2 2024-06-19 19:13:50,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=44649.0, ans=0.125 2024-06-19 19:13:57,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=44667.333333333336, ans=0.125 2024-06-19 19:13:59,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=44667.333333333336, ans=0.1 2024-06-19 19:14:04,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=44685.666666666664, ans=0.125 2024-06-19 19:14:07,203 INFO [train.py:1028] (0/2) Epoch 3, batch 4150, loss[loss=0.59, simple_loss=0.4685, pruned_loss=0.3557, over 13083.00 frames. ], tot_loss[loss=0.6285, simple_loss=0.4855, pruned_loss=0.3858, over 2575211.03 frames. ], batch size: 55, lr: 1.77e-02, grad_scale: 1.0 2024-06-19 19:14:13,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=44722.333333333336, ans=0.125 2024-06-19 19:14:14,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=44722.333333333336, ans=0.0 2024-06-19 19:14:15,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=44722.333333333336, ans=0.0011473188405797096 2024-06-19 19:14:45,973 INFO [train.py:1028] (0/2) Epoch 3, batch 4200, loss[loss=0.5703, simple_loss=0.4512, pruned_loss=0.3447, over 13199.00 frames. ], tot_loss[loss=0.6202, simple_loss=0.4813, pruned_loss=0.3796, over 2578516.28 frames. ], batch size: 103, lr: 1.77e-02, grad_scale: 2.0 2024-06-19 19:14:46,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.03 vs. limit=22.5 2024-06-19 19:14:51,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=44795.666666666664, ans=0.125 2024-06-19 19:14:52,471 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+03 2.895e+03 3.419e+03 3.761e+03 1.069e+04, threshold=6.837e+03, percent-clipped=2.0 2024-06-19 19:14:55,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.62 vs. limit=15.0 2024-06-19 19:14:58,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=44832.333333333336, ans=0.125 2024-06-19 19:15:13,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=44850.666666666664, ans=0.125 2024-06-19 19:15:19,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=15.0 2024-06-19 19:15:22,912 INFO [train.py:1028] (0/2) Epoch 3, batch 4250, loss[loss=0.5542, simple_loss=0.4482, pruned_loss=0.3301, over 13252.00 frames. ], tot_loss[loss=0.6152, simple_loss=0.479, pruned_loss=0.3757, over 2581212.57 frames. ], batch size: 46, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:15:32,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44905.666666666664, ans=0.1 2024-06-19 19:15:35,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=44924.0, ans=0.0 2024-06-19 19:15:38,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=44924.0, ans=15.0 2024-06-19 19:15:51,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=15.0 2024-06-19 19:15:52,245 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=15.0 2024-06-19 19:15:54,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=44960.666666666664, ans=0.025 2024-06-19 19:15:54,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 19:15:55,834 INFO [train.py:1028] (0/2) Epoch 3, batch 4300, loss[loss=0.5673, simple_loss=0.4527, pruned_loss=0.341, over 13149.00 frames. ], tot_loss[loss=0.6075, simple_loss=0.475, pruned_loss=0.37, over 2582151.12 frames. ], batch size: 59, lr: 1.76e-02, grad_scale: 2.0 2024-06-19 19:16:02,668 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+03 2.606e+03 2.886e+03 3.311e+03 1.153e+04, threshold=5.771e+03, percent-clipped=1.0 2024-06-19 19:16:07,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=44997.333333333336, ans=0.125 2024-06-19 19:16:09,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=45015.666666666664, ans=0.2 2024-06-19 19:16:12,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=45015.666666666664, ans=0.0 2024-06-19 19:16:12,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=22.5 2024-06-19 19:16:34,530 INFO [train.py:1028] (0/2) Epoch 3, batch 4350, loss[loss=0.5745, simple_loss=0.4631, pruned_loss=0.343, over 13233.00 frames. ], tot_loss[loss=0.6017, simple_loss=0.4719, pruned_loss=0.3657, over 2586780.54 frames. ], batch size: 59, lr: 1.76e-02, grad_scale: 0.5 2024-06-19 19:16:40,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=45089.0, ans=0.2 2024-06-19 19:16:41,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2024-06-19 19:16:44,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2024-06-19 19:16:51,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=45107.333333333336, ans=0.125 2024-06-19 19:16:55,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.95 vs. limit=6.0 2024-06-19 19:17:02,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.72 vs. limit=15.0 2024-06-19 19:17:08,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=45144.0, ans=0.125 2024-06-19 19:17:11,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45162.333333333336, ans=0.125 2024-06-19 19:17:11,549 INFO [train.py:1028] (0/2) Epoch 3, batch 4400, loss[loss=0.5548, simple_loss=0.4438, pruned_loss=0.3329, over 13215.00 frames. ], tot_loss[loss=0.5975, simple_loss=0.4698, pruned_loss=0.3626, over 2587006.22 frames. ], batch size: 83, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:17:12,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45162.333333333336, ans=0.1 2024-06-19 19:17:19,670 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+03 3.555e+03 3.998e+03 4.760e+03 1.239e+04, threshold=7.996e+03, percent-clipped=11.0 2024-06-19 19:17:23,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=45180.666666666664, ans=0.125 2024-06-19 19:17:26,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=15.0 2024-06-19 19:17:29,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.14 vs. limit=15.0 2024-06-19 19:17:33,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=45217.333333333336, ans=0.0010397101449275362 2024-06-19 19:17:38,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=15.0 2024-06-19 19:17:40,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.45 vs. limit=22.5 2024-06-19 19:17:44,575 INFO [train.py:1028] (0/2) Epoch 3, batch 4450, loss[loss=0.5055, simple_loss=0.4224, pruned_loss=0.2943, over 13035.00 frames. ], tot_loss[loss=0.593, simple_loss=0.4677, pruned_loss=0.3591, over 2582066.92 frames. ], batch size: 33, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:17:45,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=45254.0, ans=0.0 2024-06-19 19:17:53,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45272.333333333336, ans=0.1 2024-06-19 19:17:57,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2024-06-19 19:17:59,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=45290.666666666664, ans=0.0010237681159420298 2024-06-19 19:18:00,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.79 vs. limit=22.5 2024-06-19 19:18:01,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-19 19:18:08,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=45309.0, ans=10.0 2024-06-19 19:18:12,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=12.0 2024-06-19 19:18:13,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=45327.333333333336, ans=0.0 2024-06-19 19:18:16,833 INFO [train.py:1028] (0/2) Epoch 3, batch 4500, loss[loss=0.5611, simple_loss=0.444, pruned_loss=0.3391, over 13231.00 frames. ], tot_loss[loss=0.5911, simple_loss=0.4664, pruned_loss=0.3579, over 2586285.73 frames. ], batch size: 89, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:18:17,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.20 vs. limit=22.5 2024-06-19 19:18:23,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2024-06-19 19:18:28,582 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+03 3.021e+03 3.472e+03 4.246e+03 1.154e+04, threshold=6.944e+03, percent-clipped=1.0 2024-06-19 19:18:28,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=45364.0, ans=0.2 2024-06-19 19:18:34,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=45382.333333333336, ans=0.1 2024-06-19 19:18:47,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.19 vs. limit=22.5 2024-06-19 19:18:53,181 INFO [train.py:1028] (0/2) Epoch 3, batch 4550, loss[loss=0.5586, simple_loss=0.4638, pruned_loss=0.3267, over 13287.00 frames. ], tot_loss[loss=0.5869, simple_loss=0.4646, pruned_loss=0.3546, over 2590218.73 frames. ], batch size: 52, lr: 1.76e-02, grad_scale: 1.0 2024-06-19 19:18:55,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=45437.333333333336, ans=10.0 2024-06-19 19:18:59,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.03 vs. limit=22.5 2024-06-19 19:19:13,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45474.0, ans=0.0 2024-06-19 19:19:16,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.89 vs. limit=22.5 2024-06-19 19:19:16,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=45492.333333333336, ans=0.04949747468305833 2024-06-19 19:19:24,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.01 vs. limit=10.0 2024-06-19 19:19:29,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45510.666666666664, ans=0.0 2024-06-19 19:19:30,388 INFO [train.py:1028] (0/2) Epoch 3, batch 4600, loss[loss=0.6033, simple_loss=0.4715, pruned_loss=0.3676, over 12549.00 frames. ], tot_loss[loss=0.5832, simple_loss=0.4629, pruned_loss=0.3517, over 2585630.21 frames. ], batch size: 202, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:19:34,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45529.0, ans=0.0 2024-06-19 19:19:35,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=45529.0, ans=0.05 2024-06-19 19:19:39,403 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.360e+03 2.329e+03 2.704e+03 3.038e+03 8.611e+03, threshold=5.407e+03, percent-clipped=1.0 2024-06-19 19:19:40,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.58 vs. limit=22.5 2024-06-19 19:19:41,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-19 19:19:47,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=45565.666666666664, ans=0.07 2024-06-19 19:19:53,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.86 vs. limit=10.0 2024-06-19 19:20:04,175 INFO [train.py:1028] (0/2) Epoch 3, batch 4650, loss[loss=0.5381, simple_loss=0.4319, pruned_loss=0.3222, over 13163.00 frames. ], tot_loss[loss=0.577, simple_loss=0.4597, pruned_loss=0.3472, over 2588741.70 frames. ], batch size: 132, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:20:09,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45620.666666666664, ans=0.125 2024-06-19 19:20:10,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=45639.0, ans=0.125 2024-06-19 19:20:13,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.81 vs. limit=15.0 2024-06-19 19:20:22,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=45657.333333333336, ans=0.0 2024-06-19 19:20:40,872 INFO [train.py:1028] (0/2) Epoch 3, batch 4700, loss[loss=0.5435, simple_loss=0.4469, pruned_loss=0.3201, over 12340.00 frames. ], tot_loss[loss=0.5718, simple_loss=0.4573, pruned_loss=0.3431, over 2583414.36 frames. ], batch size: 25, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:20:46,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=45712.333333333336, ans=0.125 2024-06-19 19:20:46,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-19 19:20:48,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.40 vs. limit=10.0 2024-06-19 19:20:49,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=45730.666666666664, ans=0.125 2024-06-19 19:20:50,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=15.0 2024-06-19 19:20:51,034 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+03 2.530e+03 2.879e+03 3.262e+03 9.392e+03, threshold=5.758e+03, percent-clipped=4.0 2024-06-19 19:20:55,604 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.46 vs. limit=22.5 2024-06-19 19:20:57,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=45749.0, ans=0.125 2024-06-19 19:21:05,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=45767.333333333336, ans=0.0009201449275362313 2024-06-19 19:21:12,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-19 19:21:14,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=45785.666666666664, ans=0.125 2024-06-19 19:21:14,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=45785.666666666664, ans=0.0 2024-06-19 19:21:17,919 INFO [train.py:1028] (0/2) Epoch 3, batch 4750, loss[loss=0.5966, simple_loss=0.4687, pruned_loss=0.3622, over 12595.00 frames. ], tot_loss[loss=0.5668, simple_loss=0.4547, pruned_loss=0.3395, over 2580115.80 frames. ], batch size: 202, lr: 1.75e-02, grad_scale: 1.0 2024-06-19 19:21:20,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-19 19:21:26,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=15.0 2024-06-19 19:21:51,909 INFO [train.py:1028] (0/2) Epoch 3, batch 4800, loss[loss=0.6121, simple_loss=0.4933, pruned_loss=0.3655, over 13277.00 frames. ], tot_loss[loss=0.5641, simple_loss=0.4537, pruned_loss=0.3372, over 2576964.90 frames. ], batch size: 63, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:21:58,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.10 vs. limit=15.0 2024-06-19 19:22:00,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.10 vs. limit=6.0 2024-06-19 19:22:01,849 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+03 2.639e+03 2.961e+03 3.417e+03 8.350e+03, threshold=5.921e+03, percent-clipped=3.0 2024-06-19 19:22:02,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45914.0, ans=0.1 2024-06-19 19:22:03,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2024-06-19 19:22:04,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.92 vs. limit=22.5 2024-06-19 19:22:04,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=45932.333333333336, ans=0.04949747468305833 2024-06-19 19:22:07,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=45932.333333333336, ans=0.0 2024-06-19 19:22:12,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45950.666666666664, ans=0.1 2024-06-19 19:22:25,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=45969.0, ans=0.125 2024-06-19 19:22:26,409 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=5.215e-03 2024-06-19 19:22:28,965 INFO [train.py:1028] (0/2) Epoch 3, batch 4850, loss[loss=0.5399, simple_loss=0.4402, pruned_loss=0.3198, over 13281.00 frames. ], tot_loss[loss=0.5607, simple_loss=0.452, pruned_loss=0.3347, over 2574950.58 frames. ], batch size: 89, lr: 1.75e-02, grad_scale: 2.0 2024-06-19 19:22:32,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=45987.333333333336, ans=0.125 2024-06-19 19:22:45,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.34 vs. limit=22.5 2024-06-19 19:22:45,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=17.31 vs. limit=15.0 2024-06-19 19:22:50,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=46042.333333333336, ans=0.125 2024-06-19 19:23:03,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46060.666666666664, ans=0.125 2024-06-19 19:23:06,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.72 vs. limit=10.0 2024-06-19 19:23:07,400 INFO [train.py:1028] (0/2) Epoch 3, batch 4900, loss[loss=0.5538, simple_loss=0.4545, pruned_loss=0.3265, over 13163.00 frames. ], tot_loss[loss=0.5571, simple_loss=0.4504, pruned_loss=0.3319, over 2576247.81 frames. ], batch size: 59, lr: 1.74e-02, grad_scale: 2.0 2024-06-19 19:23:08,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46079.0, ans=0.125 2024-06-19 19:23:18,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=46097.333333333336, ans=0.0008484057971014491 2024-06-19 19:23:19,001 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.211e+03 3.047e+03 3.481e+03 3.892e+03 6.998e+03, threshold=6.963e+03, percent-clipped=3.0 2024-06-19 19:23:19,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=46097.333333333336, ans=0.125 2024-06-19 19:23:23,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=46115.666666666664, ans=10.0 2024-06-19 19:23:24,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=46115.666666666664, ans=0.125 2024-06-19 19:23:34,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.15 vs. limit=6.0 2024-06-19 19:23:39,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=46152.333333333336, ans=0.125 2024-06-19 19:23:40,468 INFO [train.py:1028] (0/2) Epoch 3, batch 4950, loss[loss=0.5692, simple_loss=0.4457, pruned_loss=0.3464, over 11002.00 frames. ], tot_loss[loss=0.5581, simple_loss=0.4512, pruned_loss=0.3325, over 2570342.79 frames. ], batch size: 303, lr: 1.74e-02, grad_scale: 0.5 2024-06-19 19:23:41,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46170.666666666664, ans=0.125 2024-06-19 19:23:59,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=46225.666666666664, ans=0.125 2024-06-19 19:24:06,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46244.0, ans=0.1 2024-06-19 19:24:11,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46244.0, ans=0.125 2024-06-19 19:24:13,535 INFO [train.py:1028] (0/2) Epoch 3, batch 5000, loss[loss=0.4876, simple_loss=0.4028, pruned_loss=0.2862, over 13179.00 frames. ], tot_loss[loss=0.5525, simple_loss=0.4484, pruned_loss=0.3284, over 2574276.45 frames. ], batch size: 95, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:24:22,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46280.666666666664, ans=0.1 2024-06-19 19:24:25,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+03 2.469e+03 2.950e+03 3.535e+03 1.221e+04, threshold=5.900e+03, percent-clipped=5.0 2024-06-19 19:24:28,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=46299.0, ans=0.125 2024-06-19 19:24:37,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=46317.333333333336, ans=0.0 2024-06-19 19:24:38,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46317.333333333336, ans=0.125 2024-06-19 19:24:51,190 INFO [train.py:1028] (0/2) Epoch 3, batch 5050, loss[loss=0.504, simple_loss=0.4206, pruned_loss=0.2937, over 12900.00 frames. ], tot_loss[loss=0.548, simple_loss=0.4464, pruned_loss=0.3248, over 2571411.70 frames. ], batch size: 36, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:24:51,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=46354.0, ans=0.0 2024-06-19 19:25:03,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=46372.333333333336, ans=15.0 2024-06-19 19:25:13,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=46409.0, ans=0.0007806521739130434 2024-06-19 19:25:13,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=46409.0, ans=0.0 2024-06-19 19:25:13,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46409.0, ans=0.1 2024-06-19 19:25:18,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=46409.0, ans=10.0 2024-06-19 19:25:27,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.11 vs. limit=15.0 2024-06-19 19:25:28,065 INFO [train.py:1028] (0/2) Epoch 3, batch 5100, loss[loss=0.5147, simple_loss=0.4338, pruned_loss=0.2978, over 12968.00 frames. ], tot_loss[loss=0.5448, simple_loss=0.4442, pruned_loss=0.3227, over 2567915.62 frames. ], batch size: 39, lr: 1.74e-02, grad_scale: 2.0 2024-06-19 19:25:37,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=46464.0, ans=0.0 2024-06-19 19:25:40,550 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.451e+03 2.537e+03 2.908e+03 3.402e+03 6.063e+03, threshold=5.816e+03, percent-clipped=1.0 2024-06-19 19:25:43,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=46482.333333333336, ans=0.0007647101449275353 2024-06-19 19:25:44,030 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.13 vs. limit=22.5 2024-06-19 19:25:44,672 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:25:46,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.95 vs. limit=15.0 2024-06-19 19:25:48,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-19 19:25:50,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46500.666666666664, ans=0.1 2024-06-19 19:26:01,114 INFO [train.py:1028] (0/2) Epoch 3, batch 5150, loss[loss=0.5444, simple_loss=0.4458, pruned_loss=0.3215, over 13153.00 frames. ], tot_loss[loss=0.5439, simple_loss=0.4439, pruned_loss=0.3219, over 2570375.05 frames. ], batch size: 132, lr: 1.74e-02, grad_scale: 1.0 2024-06-19 19:26:01,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=46537.333333333336, ans=0.125 2024-06-19 19:26:09,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=46555.666666666664, ans=0.125 2024-06-19 19:26:14,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=46574.0, ans=0.0007447826086956523 2024-06-19 19:26:15,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=46574.0, ans=0.0007447826086956523 2024-06-19 19:26:18,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=46574.0, ans=0.0 2024-06-19 19:26:20,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=46574.0, ans=0.125 2024-06-19 19:26:27,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46592.333333333336, ans=0.125 2024-06-19 19:26:27,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=46592.333333333336, ans=0.125 2024-06-19 19:26:30,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=46592.333333333336, ans=0.0007407971014492757 2024-06-19 19:26:38,089 INFO [train.py:1028] (0/2) Epoch 3, batch 5200, loss[loss=0.5505, simple_loss=0.4517, pruned_loss=0.3246, over 13219.00 frames. ], tot_loss[loss=0.541, simple_loss=0.4425, pruned_loss=0.3197, over 2574572.36 frames. ], batch size: 95, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:26:39,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=46629.0, ans=0.05 2024-06-19 19:26:39,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.32 vs. limit=10.0 2024-06-19 19:26:39,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46629.0, ans=0.1 2024-06-19 19:26:50,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+03 2.484e+03 2.722e+03 3.148e+03 8.839e+03, threshold=5.445e+03, percent-clipped=3.0 2024-06-19 19:26:54,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46665.666666666664, ans=0.1 2024-06-19 19:27:14,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=46720.666666666664, ans=0.0007128985507246378 2024-06-19 19:27:14,737 INFO [train.py:1028] (0/2) Epoch 3, batch 5250, loss[loss=0.512, simple_loss=0.4308, pruned_loss=0.2966, over 13268.00 frames. ], tot_loss[loss=0.5385, simple_loss=0.4415, pruned_loss=0.3177, over 2569581.85 frames. ], batch size: 52, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:27:16,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=46720.666666666664, ans=0.125 2024-06-19 19:27:21,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=46739.0, ans=0.125 2024-06-19 19:27:23,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=46739.0, ans=0.04949747468305833 2024-06-19 19:27:25,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=46739.0, ans=0.125 2024-06-19 19:27:48,221 INFO [train.py:1028] (0/2) Epoch 3, batch 5300, loss[loss=0.5335, simple_loss=0.4317, pruned_loss=0.3176, over 13009.00 frames. ], tot_loss[loss=0.536, simple_loss=0.4407, pruned_loss=0.3157, over 2567201.41 frames. ], batch size: 144, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:28:00,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2024-06-19 19:28:03,222 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.358e+03 2.340e+03 2.676e+03 3.141e+03 8.225e+03, threshold=5.352e+03, percent-clipped=3.0 2024-06-19 19:28:03,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46849.0, ans=0.125 2024-06-19 19:28:10,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=46867.333333333336, ans=0.0 2024-06-19 19:28:12,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=46867.333333333336, ans=0.125 2024-06-19 19:28:15,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46885.666666666664, ans=0.125 2024-06-19 19:28:15,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46885.666666666664, ans=0.125 2024-06-19 19:28:16,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=46885.666666666664, ans=0.125 2024-06-19 19:28:27,111 INFO [train.py:1028] (0/2) Epoch 3, batch 5350, loss[loss=0.5227, simple_loss=0.4424, pruned_loss=0.3015, over 12228.00 frames. ], tot_loss[loss=0.5328, simple_loss=0.4393, pruned_loss=0.3132, over 2575148.98 frames. ], batch size: 18, lr: 1.73e-02, grad_scale: 0.5 2024-06-19 19:28:27,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.14 vs. limit=22.5 2024-06-19 19:28:28,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46904.0, ans=0.1 2024-06-19 19:28:34,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-19 19:28:45,313 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=2.389e-01 2024-06-19 19:29:03,521 INFO [train.py:1028] (0/2) Epoch 3, batch 5400, loss[loss=0.5366, simple_loss=0.4249, pruned_loss=0.3242, over 12025.00 frames. ], tot_loss[loss=0.5277, simple_loss=0.436, pruned_loss=0.3097, over 2567275.37 frames. ], batch size: 240, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:29:11,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=47014.0, ans=12.0 2024-06-19 19:29:15,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-19 19:29:16,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=47032.333333333336, ans=0.125 2024-06-19 19:29:17,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=47032.333333333336, ans=0.05 2024-06-19 19:29:18,336 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.352e+03 2.122e+03 2.365e+03 2.676e+03 5.183e+03, threshold=4.730e+03, percent-clipped=0.0 2024-06-19 19:29:20,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=47032.333333333336, ans=15.0 2024-06-19 19:29:24,846 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.569e+01 2024-06-19 19:29:25,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2024-06-19 19:29:25,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.98 vs. limit=15.0 2024-06-19 19:29:31,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=47069.0, ans=0.2 2024-06-19 19:29:37,080 INFO [train.py:1028] (0/2) Epoch 3, batch 5450, loss[loss=0.5179, simple_loss=0.4428, pruned_loss=0.2965, over 12449.00 frames. ], tot_loss[loss=0.5215, simple_loss=0.4331, pruned_loss=0.3049, over 2571022.41 frames. ], batch size: 25, lr: 1.73e-02, grad_scale: 1.0 2024-06-19 19:29:40,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-19 19:29:42,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=47087.333333333336, ans=0.0 2024-06-19 19:29:47,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.92 vs. limit=6.0 2024-06-19 19:30:03,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=47160.666666666664, ans=0.0 2024-06-19 19:30:03,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47160.666666666664, ans=0.125 2024-06-19 19:30:10,590 INFO [train.py:1028] (0/2) Epoch 3, batch 5500, loss[loss=0.5516, simple_loss=0.4435, pruned_loss=0.3298, over 12291.00 frames. ], tot_loss[loss=0.5171, simple_loss=0.4306, pruned_loss=0.3018, over 2564890.13 frames. ], batch size: 241, lr: 1.73e-02, grad_scale: 2.0 2024-06-19 19:30:11,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=47179.0, ans=0.125 2024-06-19 19:30:14,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=47179.0, ans=0.125 2024-06-19 19:30:22,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=47197.333333333336, ans=0.125 2024-06-19 19:30:30,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-19 19:30:31,755 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.496e+03 2.155e+03 2.560e+03 3.017e+03 5.450e+03, threshold=5.120e+03, percent-clipped=2.0 2024-06-19 19:30:35,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=47215.666666666664, ans=22.5 2024-06-19 19:30:37,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=47234.0, ans=0.0 2024-06-19 19:30:38,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=47234.0, ans=0.0 2024-06-19 19:30:49,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=47252.333333333336, ans=0.125 2024-06-19 19:30:51,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=47270.666666666664, ans=0.0 2024-06-19 19:30:51,534 INFO [train.py:1028] (0/2) Epoch 3, batch 5550, loss[loss=0.5181, simple_loss=0.4436, pruned_loss=0.2963, over 13185.00 frames. ], tot_loss[loss=0.5137, simple_loss=0.4289, pruned_loss=0.2993, over 2569446.14 frames. ], batch size: 43, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:30:52,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=47270.666666666664, ans=0.125 2024-06-19 19:30:55,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.02 vs. limit=10.0 2024-06-19 19:30:55,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.67 vs. limit=22.5 2024-06-19 19:30:55,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.64 vs. limit=15.0 2024-06-19 19:31:13,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47307.333333333336, ans=0.125 2024-06-19 19:31:14,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.85 vs. limit=22.5 2024-06-19 19:31:21,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=47344.0, ans=0.05 2024-06-19 19:31:25,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=47344.0, ans=0.0 2024-06-19 19:31:26,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.54 vs. limit=10.0 2024-06-19 19:31:27,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.60 vs. limit=12.0 2024-06-19 19:31:28,456 INFO [train.py:1028] (0/2) Epoch 3, batch 5600, loss[loss=0.4803, simple_loss=0.4131, pruned_loss=0.2737, over 13277.00 frames. ], tot_loss[loss=0.5109, simple_loss=0.4274, pruned_loss=0.2972, over 2571517.79 frames. ], batch size: 89, lr: 1.72e-02, grad_scale: 2.0 2024-06-19 19:31:31,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=47362.333333333336, ans=0.1 2024-06-19 19:31:35,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=47380.666666666664, ans=0.125 2024-06-19 19:31:38,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47380.666666666664, ans=0.1 2024-06-19 19:31:38,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47380.666666666664, ans=0.1 2024-06-19 19:31:43,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.72 vs. limit=10.0 2024-06-19 19:31:45,243 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+03 2.337e+03 2.666e+03 3.079e+03 1.375e+04, threshold=5.333e+03, percent-clipped=6.0 2024-06-19 19:31:46,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=47399.0, ans=0.125 2024-06-19 19:31:48,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.83 vs. limit=15.0 2024-06-19 19:31:52,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=47417.333333333336, ans=0.0 2024-06-19 19:32:00,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.72 vs. limit=15.0 2024-06-19 19:32:02,690 INFO [train.py:1028] (0/2) Epoch 3, batch 5650, loss[loss=0.5677, simple_loss=0.453, pruned_loss=0.3412, over 12531.00 frames. ], tot_loss[loss=0.5101, simple_loss=0.4274, pruned_loss=0.2964, over 2577052.48 frames. ], batch size: 202, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:32:02,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=47454.0, ans=0.2 2024-06-19 19:32:06,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=47454.0, ans=0.125 2024-06-19 19:32:10,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=47472.333333333336, ans=0.0 2024-06-19 19:32:12,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.38 vs. limit=15.0 2024-06-19 19:32:17,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.40 vs. limit=15.0 2024-06-19 19:32:24,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.86 vs. limit=15.0 2024-06-19 19:32:35,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=47527.333333333336, ans=0.125 2024-06-19 19:32:35,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=47527.333333333336, ans=0.0 2024-06-19 19:32:39,761 INFO [train.py:1028] (0/2) Epoch 3, batch 5700, loss[loss=0.4717, simple_loss=0.4087, pruned_loss=0.2674, over 13269.00 frames. ], tot_loss[loss=0.5057, simple_loss=0.4251, pruned_loss=0.2932, over 2580440.99 frames. ], batch size: 63, lr: 1.72e-02, grad_scale: 2.0 2024-06-19 19:32:42,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=47545.666666666664, ans=0.2 2024-06-19 19:32:44,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=47545.666666666664, ans=0.2 2024-06-19 19:32:53,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.26 vs. limit=15.0 2024-06-19 19:32:55,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+03 2.274e+03 2.536e+03 2.749e+03 5.319e+03, threshold=5.072e+03, percent-clipped=0.0 2024-06-19 19:33:10,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=47619.0, ans=0.125 2024-06-19 19:33:16,042 INFO [train.py:1028] (0/2) Epoch 3, batch 5750, loss[loss=0.5292, simple_loss=0.4364, pruned_loss=0.311, over 12777.00 frames. ], tot_loss[loss=0.506, simple_loss=0.4258, pruned_loss=0.2931, over 2579873.49 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:33:18,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=47637.333333333336, ans=0.125 2024-06-19 19:33:23,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47655.666666666664, ans=0.1 2024-06-19 19:33:25,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=47655.666666666664, ans=0.125 2024-06-19 19:33:33,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=47674.0, ans=0.125 2024-06-19 19:33:34,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=47674.0, ans=0.05 2024-06-19 19:33:41,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47692.333333333336, ans=0.1 2024-06-19 19:33:44,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=47710.666666666664, ans=15.0 2024-06-19 19:33:46,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47710.666666666664, ans=0.1 2024-06-19 19:33:49,475 INFO [train.py:1028] (0/2) Epoch 3, batch 5800, loss[loss=0.5234, simple_loss=0.4332, pruned_loss=0.3068, over 12827.00 frames. ], tot_loss[loss=0.509, simple_loss=0.4274, pruned_loss=0.2953, over 2578854.25 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 1.0 2024-06-19 19:33:54,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=47729.0, ans=10.0 2024-06-19 19:33:54,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2024-06-19 19:33:59,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=15.0 2024-06-19 19:34:01,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=47747.333333333336, ans=0.0 2024-06-19 19:34:07,224 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+03 2.831e+03 3.397e+03 4.042e+03 1.204e+04, threshold=6.795e+03, percent-clipped=5.0 2024-06-19 19:34:10,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=47784.0, ans=0.125 2024-06-19 19:34:16,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=47802.333333333336, ans=0.125 2024-06-19 19:34:22,564 INFO [train.py:1028] (0/2) Epoch 3, batch 5850, loss[loss=0.5291, simple_loss=0.4358, pruned_loss=0.3112, over 12496.00 frames. ], tot_loss[loss=0.5107, simple_loss=0.4289, pruned_loss=0.2962, over 2577576.30 frames. ], batch size: 202, lr: 1.71e-02, grad_scale: 0.5 2024-06-19 19:34:22,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=18.46 vs. limit=15.0 2024-06-19 19:34:31,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=6.0 2024-06-19 19:34:38,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=47839.0, ans=0.125 2024-06-19 19:34:42,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=47857.333333333336, ans=0.125 2024-06-19 19:34:59,175 INFO [train.py:1028] (0/2) Epoch 3, batch 5900, loss[loss=0.5172, simple_loss=0.4372, pruned_loss=0.2987, over 13154.00 frames. ], tot_loss[loss=0.5145, simple_loss=0.4323, pruned_loss=0.2984, over 2577142.67 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:35:09,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.25 vs. limit=6.0 2024-06-19 19:35:15,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=47930.666666666664, ans=0.2 2024-06-19 19:35:22,066 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.802e+03 2.737e+03 2.992e+03 3.452e+03 5.961e+03, threshold=5.985e+03, percent-clipped=0.0 2024-06-19 19:35:23,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=15.0 2024-06-19 19:35:25,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=47967.333333333336, ans=0.125 2024-06-19 19:35:28,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=47967.333333333336, ans=0.0 2024-06-19 19:35:32,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=47985.666666666664, ans=0.125 2024-06-19 19:35:37,285 INFO [train.py:1028] (0/2) Epoch 3, batch 5950, loss[loss=0.497, simple_loss=0.419, pruned_loss=0.2875, over 13083.00 frames. ], tot_loss[loss=0.5158, simple_loss=0.4344, pruned_loss=0.2986, over 2581709.01 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:35:38,187 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.037e+01 2024-06-19 19:35:45,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=48022.333333333336, ans=0.125 2024-06-19 19:35:47,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=48022.333333333336, ans=0.125 2024-06-19 19:35:56,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48040.666666666664, ans=0.125 2024-06-19 19:35:57,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=48059.0, ans=0.0 2024-06-19 19:35:59,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2024-06-19 19:36:00,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2024-06-19 19:36:02,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.92 vs. limit=22.5 2024-06-19 19:36:04,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=48077.333333333336, ans=0.1 2024-06-19 19:36:07,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=48077.333333333336, ans=0.125 2024-06-19 19:36:10,739 INFO [train.py:1028] (0/2) Epoch 3, batch 6000, loss[loss=0.6038, simple_loss=0.482, pruned_loss=0.3628, over 12223.00 frames. ], tot_loss[loss=0.5176, simple_loss=0.4359, pruned_loss=0.2996, over 2574775.81 frames. ], batch size: 240, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:36:10,740 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 19:36:18,518 INFO [train.py:1060] (0/2) Epoch 3, validation: loss=0.4102, simple_loss=0.3952, pruned_loss=0.2126, over 351949.00 frames. 2024-06-19 19:36:18,518 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16714MB 2024-06-19 19:36:27,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=48114.0, ans=0.125 2024-06-19 19:36:28,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=48114.0, ans=0.00041000000000000064 2024-06-19 19:36:30,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48114.0, ans=0.1 2024-06-19 19:36:32,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.06 vs. limit=6.0 2024-06-19 19:36:33,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48132.333333333336, ans=0.1 2024-06-19 19:36:40,461 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+03 2.626e+03 2.925e+03 3.337e+03 6.181e+03, threshold=5.850e+03, percent-clipped=1.0 2024-06-19 19:36:46,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48150.666666666664, ans=0.125 2024-06-19 19:36:46,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-19 19:36:47,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.42 vs. limit=15.0 2024-06-19 19:36:49,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=48169.0, ans=0.025 2024-06-19 19:36:51,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=43.08 vs. limit=15.0 2024-06-19 19:36:55,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=48187.333333333336, ans=0.0 2024-06-19 19:36:56,019 INFO [train.py:1028] (0/2) Epoch 3, batch 6050, loss[loss=0.5386, simple_loss=0.4576, pruned_loss=0.3098, over 12989.00 frames. ], tot_loss[loss=0.519, simple_loss=0.4381, pruned_loss=0.3, over 2576960.10 frames. ], batch size: 39, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:36:59,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=48187.333333333336, ans=0.125 2024-06-19 19:37:19,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=48224.0, ans=0.0003860869565217393 2024-06-19 19:37:31,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=21.70 vs. limit=15.0 2024-06-19 19:37:34,579 INFO [train.py:1028] (0/2) Epoch 3, batch 6100, loss[loss=0.5355, simple_loss=0.4511, pruned_loss=0.3099, over 13069.00 frames. ], tot_loss[loss=0.5203, simple_loss=0.4399, pruned_loss=0.3003, over 2580307.36 frames. ], batch size: 121, lr: 1.71e-02, grad_scale: 2.0 2024-06-19 19:37:46,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=48297.333333333336, ans=0.2 2024-06-19 19:37:48,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=48315.666666666664, ans=0.125 2024-06-19 19:37:54,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.455e+03 2.147e+03 2.599e+03 2.971e+03 1.154e+04, threshold=5.197e+03, percent-clipped=1.0 2024-06-19 19:38:05,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=48352.333333333336, ans=0.09899494936611666 2024-06-19 19:38:06,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=48352.333333333336, ans=0.95 2024-06-19 19:38:07,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48352.333333333336, ans=0.1 2024-06-19 19:38:07,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48352.333333333336, ans=0.0 2024-06-19 19:38:09,238 INFO [train.py:1028] (0/2) Epoch 3, batch 6150, loss[loss=0.5575, simple_loss=0.4569, pruned_loss=0.3291, over 10984.00 frames. ], tot_loss[loss=0.5219, simple_loss=0.4415, pruned_loss=0.3011, over 2579372.73 frames. ], batch size: 304, lr: 1.71e-02, grad_scale: 1.0 2024-06-19 19:38:15,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48389.0, ans=0.1 2024-06-19 19:38:17,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=48389.0, ans=0.0003502173913043482 2024-06-19 19:38:18,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=22.5 2024-06-19 19:38:28,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48425.666666666664, ans=0.1 2024-06-19 19:38:30,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.58 vs. limit=15.0 2024-06-19 19:38:35,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48444.0, ans=0.125 2024-06-19 19:38:38,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=48444.0, ans=0.5 2024-06-19 19:38:43,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=48444.0, ans=0.0003382608695652184 2024-06-19 19:38:46,605 INFO [train.py:1028] (0/2) Epoch 3, batch 6200, loss[loss=0.5843, simple_loss=0.4973, pruned_loss=0.3357, over 13242.00 frames. ], tot_loss[loss=0.5237, simple_loss=0.4433, pruned_loss=0.302, over 2575937.08 frames. ], batch size: 89, lr: 1.70e-02, grad_scale: 2.0 2024-06-19 19:38:48,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48462.333333333336, ans=0.125 2024-06-19 19:38:52,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=48480.666666666664, ans=0.09899494936611666 2024-06-19 19:39:02,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=48499.0, ans=0.00032630434782608686 2024-06-19 19:39:06,067 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.250e+03 2.119e+03 2.354e+03 2.707e+03 9.483e+03, threshold=4.709e+03, percent-clipped=2.0 2024-06-19 19:39:07,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.31 vs. limit=10.0 2024-06-19 19:39:22,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2024-06-19 19:39:23,487 INFO [train.py:1028] (0/2) Epoch 3, batch 6250, loss[loss=0.503, simple_loss=0.433, pruned_loss=0.2864, over 13198.00 frames. ], tot_loss[loss=0.5248, simple_loss=0.4448, pruned_loss=0.3024, over 2569117.65 frames. ], batch size: 83, lr: 1.70e-02, grad_scale: 0.5 2024-06-19 19:39:33,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=48572.333333333336, ans=0.0 2024-06-19 19:39:33,434 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.70 vs. limit=22.5 2024-06-19 19:39:36,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48590.666666666664, ans=0.1 2024-06-19 19:39:48,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=12.0 2024-06-19 19:39:50,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2024-06-19 19:39:50,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2024-06-19 19:39:51,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=48627.333333333336, ans=0.0 2024-06-19 19:39:52,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=48627.333333333336, ans=0.0 2024-06-19 19:39:56,538 INFO [train.py:1028] (0/2) Epoch 3, batch 6300, loss[loss=0.5194, simple_loss=0.4443, pruned_loss=0.2972, over 11173.00 frames. ], tot_loss[loss=0.5251, simple_loss=0.4458, pruned_loss=0.3022, over 2564285.35 frames. ], batch size: 16, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:40:01,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.91 vs. limit=22.5 2024-06-19 19:40:02,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48664.0, ans=0.1 2024-06-19 19:40:07,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48664.0, ans=0.125 2024-06-19 19:40:10,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.14 vs. limit=15.0 2024-06-19 19:40:16,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.15 vs. limit=15.0 2024-06-19 19:40:17,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48700.666666666664, ans=0.0 2024-06-19 19:40:18,257 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+03 2.445e+03 3.043e+03 3.791e+03 8.866e+03, threshold=6.086e+03, percent-clipped=9.0 2024-06-19 19:40:19,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=48700.666666666664, ans=0.0 2024-06-19 19:40:25,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48719.0, ans=0.125 2024-06-19 19:40:30,107 INFO [train.py:1028] (0/2) Epoch 3, batch 6350, loss[loss=0.589, simple_loss=0.4945, pruned_loss=0.3418, over 12637.00 frames. ], tot_loss[loss=0.5244, simple_loss=0.4467, pruned_loss=0.301, over 2574099.62 frames. ], batch size: 202, lr: 1.70e-02, grad_scale: 0.5 2024-06-19 19:40:46,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=48774.0, ans=0.125 2024-06-19 19:40:59,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=48792.333333333336, ans=0.0 2024-06-19 19:41:07,202 INFO [train.py:1028] (0/2) Epoch 3, batch 6400, loss[loss=0.5455, simple_loss=0.4647, pruned_loss=0.3131, over 13263.00 frames. ], tot_loss[loss=0.5272, simple_loss=0.4494, pruned_loss=0.3025, over 2576022.60 frames. ], batch size: 67, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:41:10,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=48829.0, ans=0.2 2024-06-19 19:41:13,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48847.333333333336, ans=0.1 2024-06-19 19:41:22,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=48847.333333333336, ans=0.95 2024-06-19 19:41:32,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2024-06-19 19:41:33,201 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.413e+03 2.468e+03 2.756e+03 3.196e+03 9.569e+03, threshold=5.512e+03, percent-clipped=2.0 2024-06-19 19:41:33,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48884.0, ans=0.0 2024-06-19 19:41:38,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=48902.333333333336, ans=0.125 2024-06-19 19:41:40,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=48902.333333333336, ans=0.1 2024-06-19 19:41:45,163 INFO [train.py:1028] (0/2) Epoch 3, batch 6450, loss[loss=0.607, simple_loss=0.5027, pruned_loss=0.3557, over 12580.00 frames. ], tot_loss[loss=0.53, simple_loss=0.4516, pruned_loss=0.3042, over 2581983.63 frames. ], batch size: 202, lr: 1.70e-02, grad_scale: 1.0 2024-06-19 19:41:50,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-19 19:41:52,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48939.0, ans=0.1 2024-06-19 19:41:53,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-19 19:41:54,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=48939.0, ans=0.00023065217391304328 2024-06-19 19:42:00,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=48957.333333333336, ans=0.125 2024-06-19 19:42:02,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48957.333333333336, ans=0.125 2024-06-19 19:42:03,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=48957.333333333336, ans=0.0 2024-06-19 19:42:07,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-19 19:42:17,772 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:42:17,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=49012.333333333336, ans=0.00021471014492753515 2024-06-19 19:42:18,264 INFO [train.py:1028] (0/2) Epoch 3, batch 6500, loss[loss=0.5498, simple_loss=0.4583, pruned_loss=0.3207, over 10719.00 frames. ], tot_loss[loss=0.5305, simple_loss=0.4531, pruned_loss=0.304, over 2586596.67 frames. ], batch size: 303, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:42:18,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=49012.333333333336, ans=0.1 2024-06-19 19:42:23,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=49012.333333333336, ans=0.125 2024-06-19 19:42:28,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=49030.666666666664, ans=0.2 2024-06-19 19:42:28,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49030.666666666664, ans=0.1 2024-06-19 19:42:33,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=49049.0, ans=0.125 2024-06-19 19:42:39,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.445e+03 2.653e+03 3.062e+03 3.365e+03 9.178e+03, threshold=6.123e+03, percent-clipped=3.0 2024-06-19 19:42:51,148 INFO [train.py:1028] (0/2) Epoch 3, batch 6550, loss[loss=0.554, simple_loss=0.4783, pruned_loss=0.3149, over 12511.00 frames. ], tot_loss[loss=0.5321, simple_loss=0.4547, pruned_loss=0.3047, over 2590406.07 frames. ], batch size: 22, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:42:51,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.21 vs. limit=10.0 2024-06-19 19:42:51,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49104.0, ans=0.125 2024-06-19 19:42:53,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49104.0, ans=0.125 2024-06-19 19:43:09,246 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=4.912e+02 2024-06-19 19:43:12,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.29 vs. limit=22.5 2024-06-19 19:43:14,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2024-06-19 19:43:20,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.19 vs. limit=22.5 2024-06-19 19:43:23,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49177.333333333336, ans=0.1 2024-06-19 19:43:29,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=49177.333333333336, ans=0.125 2024-06-19 19:43:30,222 INFO [train.py:1028] (0/2) Epoch 3, batch 6600, loss[loss=0.5056, simple_loss=0.4401, pruned_loss=0.2855, over 13211.00 frames. ], tot_loss[loss=0.5324, simple_loss=0.4553, pruned_loss=0.3048, over 2591836.27 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:43:32,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=49195.666666666664, ans=0.125 2024-06-19 19:43:32,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49195.666666666664, ans=0.1 2024-06-19 19:43:35,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=49195.666666666664, ans=0.5 2024-06-19 19:43:40,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=49214.0, ans=0.125 2024-06-19 19:43:41,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2024-06-19 19:43:42,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=49214.0, ans=0.125 2024-06-19 19:43:45,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.04 vs. limit=22.5 2024-06-19 19:43:49,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=49250.666666666664, ans=0.125 2024-06-19 19:43:53,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.571e+03 2.337e+03 2.747e+03 3.259e+03 5.199e+03, threshold=5.494e+03, percent-clipped=0.0 2024-06-19 19:44:00,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49269.0, ans=0.0 2024-06-19 19:44:03,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49287.333333333336, ans=0.125 2024-06-19 19:44:03,439 INFO [train.py:1028] (0/2) Epoch 3, batch 6650, loss[loss=0.5679, simple_loss=0.4806, pruned_loss=0.3276, over 12917.00 frames. ], tot_loss[loss=0.5333, simple_loss=0.4567, pruned_loss=0.3049, over 2584402.44 frames. ], batch size: 158, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:44:03,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=49287.333333333336, ans=0.2 2024-06-19 19:44:11,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=49305.666666666664, ans=0.025 2024-06-19 19:44:13,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=49305.666666666664, ans=12.0 2024-06-19 19:44:14,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=49305.666666666664, ans=0.1 2024-06-19 19:44:23,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49342.333333333336, ans=0.1 2024-06-19 19:44:24,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=49342.333333333336, ans=0.025 2024-06-19 19:44:29,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=49342.333333333336, ans=0.025 2024-06-19 19:44:33,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.37 vs. limit=15.0 2024-06-19 19:44:37,264 INFO [train.py:1028] (0/2) Epoch 3, batch 6700, loss[loss=0.5816, simple_loss=0.4848, pruned_loss=0.3392, over 12796.00 frames. ], tot_loss[loss=0.5327, simple_loss=0.4569, pruned_loss=0.3043, over 2584205.97 frames. ], batch size: 176, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:44:42,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49379.0, ans=0.125 2024-06-19 19:44:47,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=49397.333333333336, ans=0.0001310144927536231 2024-06-19 19:44:53,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=49415.666666666664, ans=0.0001270289855072465 2024-06-19 19:45:00,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=49434.0, ans=0.0001230434782608699 2024-06-19 19:45:04,095 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+03 2.205e+03 2.587e+03 2.916e+03 9.266e+03, threshold=5.174e+03, percent-clipped=2.0 2024-06-19 19:45:07,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=49452.333333333336, ans=0.125 2024-06-19 19:45:14,486 INFO [train.py:1028] (0/2) Epoch 3, batch 6750, loss[loss=0.6432, simple_loss=0.5207, pruned_loss=0.3828, over 12109.00 frames. ], tot_loss[loss=0.5325, simple_loss=0.4572, pruned_loss=0.3039, over 2578320.98 frames. ], batch size: 240, lr: 1.69e-02, grad_scale: 1.0 2024-06-19 19:45:17,368 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.10 vs. limit=22.5 2024-06-19 19:45:29,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=49489.0, ans=0.125 2024-06-19 19:45:35,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=49507.333333333336, ans=0.2 2024-06-19 19:45:36,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=49525.666666666664, ans=0.0 2024-06-19 19:45:37,075 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2024-06-19 19:45:39,043 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.21 vs. limit=22.5 2024-06-19 19:45:39,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-19 19:45:50,691 INFO [train.py:1028] (0/2) Epoch 3, batch 6800, loss[loss=0.5092, simple_loss=0.4364, pruned_loss=0.291, over 13231.00 frames. ], tot_loss[loss=0.5332, simple_loss=0.4587, pruned_loss=0.3039, over 2579537.01 frames. ], batch size: 67, lr: 1.69e-02, grad_scale: 2.0 2024-06-19 19:45:53,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.85 vs. limit=15.0 2024-06-19 19:45:55,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49562.333333333336, ans=0.125 2024-06-19 19:46:03,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=49599.0, ans=0.1 2024-06-19 19:46:05,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49599.0, ans=0.125 2024-06-19 19:46:10,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49617.333333333336, ans=0.1 2024-06-19 19:46:14,851 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+03 2.718e+03 3.103e+03 3.677e+03 1.008e+04, threshold=6.206e+03, percent-clipped=6.0 2024-06-19 19:46:19,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=49635.666666666664, ans=0.015 2024-06-19 19:46:19,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=49635.666666666664, ans=0.125 2024-06-19 19:46:23,652 INFO [train.py:1028] (0/2) Epoch 3, batch 6850, loss[loss=0.5687, simple_loss=0.4957, pruned_loss=0.3209, over 13273.00 frames. ], tot_loss[loss=0.533, simple_loss=0.4591, pruned_loss=0.3034, over 2583350.42 frames. ], batch size: 63, lr: 1.68e-02, grad_scale: 0.5 2024-06-19 19:46:35,983 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=3.864e+00 2024-06-19 19:46:40,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-19 19:46:52,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=15.0 2024-06-19 19:46:57,475 INFO [train.py:1028] (0/2) Epoch 3, batch 6900, loss[loss=0.4984, simple_loss=0.4383, pruned_loss=0.2793, over 13006.00 frames. ], tot_loss[loss=0.5326, simple_loss=0.4595, pruned_loss=0.3028, over 2585200.45 frames. ], batch size: 48, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:46:58,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=49745.666666666664, ans=0.1 2024-06-19 19:47:02,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=49745.666666666664, ans=0.2 2024-06-19 19:47:19,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=49782.333333333336, ans=22.5 2024-06-19 19:47:25,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+03 2.568e+03 2.922e+03 3.526e+03 1.182e+04, threshold=5.843e+03, percent-clipped=5.0 2024-06-19 19:47:25,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=49800.666666666664, ans=0.125 2024-06-19 19:47:38,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.82 vs. limit=5.0 2024-06-19 19:47:38,582 INFO [train.py:1028] (0/2) Epoch 3, batch 6950, loss[loss=0.51, simple_loss=0.4446, pruned_loss=0.2877, over 11393.00 frames. ], tot_loss[loss=0.5305, simple_loss=0.4586, pruned_loss=0.3012, over 2578105.03 frames. ], batch size: 16, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:47:39,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.08 vs. limit=5.0 2024-06-19 19:48:00,952 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.82 vs. limit=22.5 2024-06-19 19:48:11,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=49929.0, ans=0.125 2024-06-19 19:48:11,851 INFO [train.py:1028] (0/2) Epoch 3, batch 7000, loss[loss=0.5969, simple_loss=0.4936, pruned_loss=0.3501, over 12910.00 frames. ], tot_loss[loss=0.5306, simple_loss=0.4587, pruned_loss=0.3012, over 2576160.37 frames. ], batch size: 158, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:48:11,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=49929.0, ans=0.125 2024-06-19 19:48:18,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49947.333333333336, ans=0.125 2024-06-19 19:48:28,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49965.666666666664, ans=0.125 2024-06-19 19:48:37,289 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.217e+03 1.891e+03 2.256e+03 2.678e+03 7.422e+03, threshold=4.512e+03, percent-clipped=1.0 2024-06-19 19:48:40,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50002.333333333336, ans=0.125 2024-06-19 19:48:45,704 INFO [train.py:1028] (0/2) Epoch 3, batch 7050, loss[loss=0.5573, simple_loss=0.4748, pruned_loss=0.3199, over 12830.00 frames. ], tot_loss[loss=0.5314, simple_loss=0.4603, pruned_loss=0.3013, over 2583153.53 frames. ], batch size: 176, lr: 1.68e-02, grad_scale: 1.0 2024-06-19 19:48:45,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50020.666666666664, ans=0.125 2024-06-19 19:48:46,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=50020.666666666664, ans=0.125 2024-06-19 19:48:50,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.81 vs. limit=15.0 2024-06-19 19:49:00,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=50057.333333333336, ans=0.025 2024-06-19 19:49:01,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=50057.333333333336, ans=0.0 2024-06-19 19:49:04,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=50075.666666666664, ans=0.125 2024-06-19 19:49:21,162 INFO [train.py:1028] (0/2) Epoch 3, batch 7100, loss[loss=0.5206, simple_loss=0.4537, pruned_loss=0.2937, over 13166.00 frames. ], tot_loss[loss=0.5294, simple_loss=0.4595, pruned_loss=0.2997, over 2575056.29 frames. ], batch size: 112, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:49:24,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=50112.333333333336, ans=0.2 2024-06-19 19:49:26,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.28 vs. limit=12.0 2024-06-19 19:49:29,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=50130.666666666664, ans=0.2 2024-06-19 19:49:31,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.22 vs. limit=22.5 2024-06-19 19:49:37,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=50130.666666666664, ans=0.07 2024-06-19 19:49:50,853 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.390e+03 1.978e+03 2.223e+03 2.581e+03 4.326e+03, threshold=4.446e+03, percent-clipped=0.0 2024-06-19 19:49:51,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-19 19:49:58,802 INFO [train.py:1028] (0/2) Epoch 3, batch 7150, loss[loss=0.6041, simple_loss=0.507, pruned_loss=0.3506, over 12559.00 frames. ], tot_loss[loss=0.5278, simple_loss=0.4592, pruned_loss=0.2982, over 2573783.22 frames. ], batch size: 202, lr: 1.68e-02, grad_scale: 2.0 2024-06-19 19:50:04,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.65 vs. limit=22.5 2024-06-19 19:50:07,099 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2024-06-19 19:50:08,136 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.086e+01 2024-06-19 19:50:09,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=50222.333333333336, ans=15.0 2024-06-19 19:50:10,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=50222.333333333336, ans=0.1 2024-06-19 19:50:16,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.39 vs. limit=10.0 2024-06-19 19:50:21,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.64 vs. limit=15.0 2024-06-19 19:50:21,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=50259.0, ans=15.0 2024-06-19 19:50:25,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=50277.333333333336, ans=0.125 2024-06-19 19:50:29,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=50277.333333333336, ans=0.0 2024-06-19 19:50:31,533 INFO [train.py:1028] (0/2) Epoch 3, batch 7200, loss[loss=0.5635, simple_loss=0.488, pruned_loss=0.3195, over 13164.00 frames. ], tot_loss[loss=0.5276, simple_loss=0.4595, pruned_loss=0.2978, over 2578616.42 frames. ], batch size: 112, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:50:32,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2024-06-19 19:50:32,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50295.666666666664, ans=0.1 2024-06-19 19:50:33,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=12.0 2024-06-19 19:50:35,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=50295.666666666664, ans=0.2 2024-06-19 19:50:38,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50314.0, ans=0.1 2024-06-19 19:50:45,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=50332.333333333336, ans=0.025 2024-06-19 19:50:47,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=50332.333333333336, ans=0.1 2024-06-19 19:50:49,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.49 vs. limit=15.0 2024-06-19 19:50:51,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.82 vs. limit=15.0 2024-06-19 19:50:53,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=50350.666666666664, ans=0.125 2024-06-19 19:50:57,897 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+03 2.198e+03 2.572e+03 2.959e+03 1.036e+04, threshold=5.145e+03, percent-clipped=4.0 2024-06-19 19:51:04,811 INFO [train.py:1028] (0/2) Epoch 3, batch 7250, loss[loss=0.5044, simple_loss=0.4537, pruned_loss=0.2776, over 12932.00 frames. ], tot_loss[loss=0.5251, simple_loss=0.4589, pruned_loss=0.2956, over 2579622.57 frames. ], batch size: 36, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:51:06,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=50387.333333333336, ans=0.125 2024-06-19 19:51:35,714 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.023e+01 2024-06-19 19:51:38,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=50460.666666666664, ans=0.0 2024-06-19 19:51:38,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=50460.666666666664, ans=0.125 2024-06-19 19:51:38,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50460.666666666664, ans=0.125 2024-06-19 19:51:40,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.03 vs. limit=22.5 2024-06-19 19:51:41,572 INFO [train.py:1028] (0/2) Epoch 3, batch 7300, loss[loss=0.5461, simple_loss=0.4803, pruned_loss=0.306, over 12910.00 frames. ], tot_loss[loss=0.5246, simple_loss=0.4589, pruned_loss=0.2951, over 2579131.94 frames. ], batch size: 36, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:51:42,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=50479.0, ans=0.0 2024-06-19 19:51:58,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=50515.666666666664, ans=0.2 2024-06-19 19:52:00,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50515.666666666664, ans=0.125 2024-06-19 19:52:01,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=50515.666666666664, ans=0.125 2024-06-19 19:52:07,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=50534.0, ans=0.95 2024-06-19 19:52:12,625 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.381e+03 2.035e+03 2.430e+03 2.938e+03 7.032e+03, threshold=4.859e+03, percent-clipped=2.0 2024-06-19 19:52:17,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2024-06-19 19:52:17,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.65 vs. limit=22.5 2024-06-19 19:52:18,155 INFO [train.py:1028] (0/2) Epoch 3, batch 7350, loss[loss=0.4845, simple_loss=0.4399, pruned_loss=0.2646, over 13321.00 frames. ], tot_loss[loss=0.5234, simple_loss=0.4586, pruned_loss=0.2941, over 2580569.36 frames. ], batch size: 46, lr: 1.67e-02, grad_scale: 0.5 2024-06-19 19:52:27,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.07 vs. limit=10.0 2024-06-19 19:52:43,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-19 19:52:45,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2024-06-19 19:52:45,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.21 vs. limit=15.0 2024-06-19 19:52:45,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=50644.0, ans=0.0 2024-06-19 19:52:51,395 INFO [train.py:1028] (0/2) Epoch 3, batch 7400, loss[loss=0.5984, simple_loss=0.5212, pruned_loss=0.3378, over 13265.00 frames. ], tot_loss[loss=0.5242, simple_loss=0.4598, pruned_loss=0.2943, over 2586595.22 frames. ], batch size: 63, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:53:02,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=50680.666666666664, ans=0.0 2024-06-19 19:53:04,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=50699.0, ans=0.0 2024-06-19 19:53:07,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=50699.0, ans=0.0 2024-06-19 19:53:07,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=50699.0, ans=0.125 2024-06-19 19:53:14,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50717.333333333336, ans=0.1 2024-06-19 19:53:15,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=50717.333333333336, ans=0.125 2024-06-19 19:53:18,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.18 vs. limit=10.0 2024-06-19 19:53:19,806 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.169e+03 1.896e+03 2.230e+03 2.793e+03 9.441e+03, threshold=4.460e+03, percent-clipped=1.0 2024-06-19 19:53:24,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=50754.0, ans=0.2 2024-06-19 19:53:25,311 INFO [train.py:1028] (0/2) Epoch 3, batch 7450, loss[loss=0.4898, simple_loss=0.4378, pruned_loss=0.2709, over 12720.00 frames. ], tot_loss[loss=0.5233, simple_loss=0.4597, pruned_loss=0.2934, over 2579920.82 frames. ], batch size: 29, lr: 1.67e-02, grad_scale: 1.0 2024-06-19 19:53:37,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.35 vs. limit=15.0 2024-06-19 19:53:56,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.95 vs. limit=22.5 2024-06-19 19:53:56,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=50809.0, ans=0.2 2024-06-19 19:54:05,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.34 vs. limit=15.0 2024-06-19 19:54:06,481 INFO [train.py:1028] (0/2) Epoch 3, batch 7500, loss[loss=0.5465, simple_loss=0.4602, pruned_loss=0.3163, over 10556.00 frames. ], tot_loss[loss=0.5233, simple_loss=0.4603, pruned_loss=0.2932, over 2577756.79 frames. ], batch size: 303, lr: 1.67e-02, grad_scale: 2.0 2024-06-19 19:54:08,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=50845.666666666664, ans=0.0 2024-06-19 19:54:10,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=50845.666666666664, ans=0.0 2024-06-19 19:54:11,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=50845.666666666664, ans=0.0 2024-06-19 19:54:16,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=50864.0, ans=0.125 2024-06-19 19:54:19,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=50882.333333333336, ans=0.125 2024-06-19 19:54:24,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=50882.333333333336, ans=0.125 2024-06-19 19:54:30,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50900.666666666664, ans=0.125 2024-06-19 19:54:34,402 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.688e+02 1.675e+03 1.923e+03 2.263e+03 4.119e+03, threshold=3.846e+03, percent-clipped=0.0 2024-06-19 19:54:35,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.46 vs. limit=15.0 2024-06-19 19:54:39,023 INFO [train.py:1028] (0/2) Epoch 3, batch 7550, loss[loss=0.5432, simple_loss=0.4676, pruned_loss=0.3094, over 12975.00 frames. ], tot_loss[loss=0.5225, simple_loss=0.4598, pruned_loss=0.2926, over 2577488.26 frames. ], batch size: 158, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:54:40,651 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.94 vs. limit=22.5 2024-06-19 19:54:40,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=50937.333333333336, ans=0.125 2024-06-19 19:54:47,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50955.666666666664, ans=0.1 2024-06-19 19:54:51,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=50974.0, ans=0.07 2024-06-19 19:55:00,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50992.333333333336, ans=0.1 2024-06-19 19:55:12,576 INFO [train.py:1028] (0/2) Epoch 3, batch 7600, loss[loss=0.4753, simple_loss=0.4276, pruned_loss=0.2615, over 13185.00 frames. ], tot_loss[loss=0.5235, simple_loss=0.4607, pruned_loss=0.2931, over 2577011.06 frames. ], batch size: 83, lr: 1.66e-02, grad_scale: 2.0 2024-06-19 19:55:20,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=51047.333333333336, ans=0.125 2024-06-19 19:55:23,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.76 vs. limit=5.0 2024-06-19 19:55:34,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=51065.666666666664, ans=0.125 2024-06-19 19:55:35,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-06-19 19:55:36,907 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:55:41,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-19 19:55:47,105 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.118e+03 2.116e+03 2.487e+03 3.103e+03 7.993e+03, threshold=4.974e+03, percent-clipped=9.0 2024-06-19 19:55:52,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51102.333333333336, ans=0.1 2024-06-19 19:55:53,822 INFO [train.py:1028] (0/2) Epoch 3, batch 7650, loss[loss=0.4979, simple_loss=0.4474, pruned_loss=0.2742, over 12866.00 frames. ], tot_loss[loss=0.522, simple_loss=0.46, pruned_loss=0.292, over 2573583.85 frames. ], batch size: 33, lr: 1.66e-02, grad_scale: 0.5 2024-06-19 19:55:54,043 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.910e+01 2024-06-19 19:55:57,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-19 19:55:57,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51120.666666666664, ans=0.1 2024-06-19 19:56:01,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.85 vs. limit=22.5 2024-06-19 19:56:03,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=51139.0, ans=0.2 2024-06-19 19:56:07,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=51157.333333333336, ans=0.025 2024-06-19 19:56:07,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2024-06-19 19:56:08,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.49 vs. limit=15.0 2024-06-19 19:56:10,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=51157.333333333336, ans=22.5 2024-06-19 19:56:10,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51157.333333333336, ans=0.125 2024-06-19 19:56:18,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=51175.666666666664, ans=0.0 2024-06-19 19:56:20,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51175.666666666664, ans=0.1 2024-06-19 19:56:20,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=51194.0, ans=0.125 2024-06-19 19:56:21,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=51194.0, ans=0.125 2024-06-19 19:56:25,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51194.0, ans=0.125 2024-06-19 19:56:27,803 INFO [train.py:1028] (0/2) Epoch 3, batch 7700, loss[loss=0.5296, simple_loss=0.4708, pruned_loss=0.2942, over 13296.00 frames. ], tot_loss[loss=0.521, simple_loss=0.4596, pruned_loss=0.2911, over 2570293.69 frames. ], batch size: 63, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:56:40,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=51249.0, ans=0.125 2024-06-19 19:56:42,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=51249.0, ans=0.0 2024-06-19 19:56:43,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51249.0, ans=0.125 2024-06-19 19:56:56,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.292e+03 2.060e+03 2.328e+03 2.775e+03 5.827e+03, threshold=4.656e+03, percent-clipped=1.0 2024-06-19 19:56:57,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=51285.666666666664, ans=0.0 2024-06-19 19:56:58,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=51285.666666666664, ans=0.2 2024-06-19 19:56:59,599 INFO [train.py:1028] (0/2) Epoch 3, batch 7750, loss[loss=0.5045, simple_loss=0.4572, pruned_loss=0.2759, over 13268.00 frames. ], tot_loss[loss=0.5208, simple_loss=0.4599, pruned_loss=0.2908, over 2574224.37 frames. ], batch size: 72, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:57:02,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=51304.0, ans=0.125 2024-06-19 19:57:04,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.36 vs. limit=10.0 2024-06-19 19:57:09,868 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-28000.pt 2024-06-19 19:57:31,544 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 19:57:38,024 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=9.473e+00 2024-06-19 19:57:41,371 INFO [train.py:1028] (0/2) Epoch 3, batch 7800, loss[loss=0.5344, simple_loss=0.4703, pruned_loss=0.2992, over 13128.00 frames. ], tot_loss[loss=0.5209, simple_loss=0.4605, pruned_loss=0.2906, over 2578113.96 frames. ], batch size: 95, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:57:44,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=51395.666666666664, ans=0.0 2024-06-19 19:57:51,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.95 vs. limit=15.0 2024-06-19 19:57:51,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.32 vs. limit=15.0 2024-06-19 19:57:52,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=51414.0, ans=0.125 2024-06-19 19:57:55,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2024-06-19 19:58:15,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=51469.0, ans=0.0 2024-06-19 19:58:16,581 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.471e+03 2.260e+03 2.550e+03 2.903e+03 7.779e+03, threshold=5.101e+03, percent-clipped=7.0 2024-06-19 19:58:17,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=51469.0, ans=0.025 2024-06-19 19:58:18,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51487.333333333336, ans=0.1 2024-06-19 19:58:18,625 INFO [train.py:1028] (0/2) Epoch 3, batch 7850, loss[loss=0.3851, simple_loss=0.3664, pruned_loss=0.2019, over 11775.00 frames. ], tot_loss[loss=0.5226, simple_loss=0.4622, pruned_loss=0.2915, over 2572531.99 frames. ], batch size: 17, lr: 1.66e-02, grad_scale: 0.5 2024-06-19 19:58:20,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=51487.333333333336, ans=0.125 2024-06-19 19:58:27,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=51505.666666666664, ans=0.025 2024-06-19 19:58:41,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=51542.333333333336, ans=0.025 2024-06-19 19:58:48,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.20 vs. limit=15.0 2024-06-19 19:58:50,550 INFO [train.py:1028] (0/2) Epoch 3, batch 7900, loss[loss=0.488, simple_loss=0.4479, pruned_loss=0.264, over 13138.00 frames. ], tot_loss[loss=0.5221, simple_loss=0.4621, pruned_loss=0.2911, over 2571870.60 frames. ], batch size: 77, lr: 1.66e-02, grad_scale: 1.0 2024-06-19 19:58:57,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.45 vs. limit=10.0 2024-06-19 19:58:59,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=51597.333333333336, ans=0.125 2024-06-19 19:59:10,096 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.94 vs. limit=10.0 2024-06-19 19:59:16,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51652.333333333336, ans=0.1 2024-06-19 19:59:25,364 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.213e+03 2.039e+03 2.484e+03 2.855e+03 6.061e+03, threshold=4.968e+03, percent-clipped=2.0 2024-06-19 19:59:26,780 INFO [train.py:1028] (0/2) Epoch 3, batch 7950, loss[loss=0.535, simple_loss=0.456, pruned_loss=0.3069, over 10585.00 frames. ], tot_loss[loss=0.5201, simple_loss=0.461, pruned_loss=0.2896, over 2575465.78 frames. ], batch size: 304, lr: 1.65e-02, grad_scale: 0.5 2024-06-19 19:59:30,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=51670.666666666664, ans=0.125 2024-06-19 19:59:34,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=51689.0, ans=0.125 2024-06-19 19:59:37,263 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=6.105e+02 2024-06-19 19:59:38,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51689.0, ans=0.125 2024-06-19 19:59:39,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=15.0 2024-06-19 19:59:42,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.30 vs. limit=15.0 2024-06-19 19:59:52,634 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.108e+00 2024-06-19 19:59:58,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=51744.0, ans=0.025 2024-06-19 20:00:04,101 INFO [train.py:1028] (0/2) Epoch 3, batch 8000, loss[loss=0.5587, simple_loss=0.4882, pruned_loss=0.3146, over 12536.00 frames. ], tot_loss[loss=0.5206, simple_loss=0.4615, pruned_loss=0.2899, over 2572285.64 frames. ], batch size: 29, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:00:14,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=51780.666666666664, ans=0.125 2024-06-19 20:00:20,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=51799.0, ans=0.0 2024-06-19 20:00:21,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=51799.0, ans=0.95 2024-06-19 20:00:27,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=51817.333333333336, ans=0.125 2024-06-19 20:00:37,783 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.045e+03 2.037e+03 2.546e+03 3.145e+03 7.740e+03, threshold=5.092e+03, percent-clipped=2.0 2024-06-19 20:00:39,164 INFO [train.py:1028] (0/2) Epoch 3, batch 8050, loss[loss=0.4857, simple_loss=0.445, pruned_loss=0.2632, over 13218.00 frames. ], tot_loss[loss=0.5201, simple_loss=0.4612, pruned_loss=0.2894, over 2572316.99 frames. ], batch size: 83, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:00:40,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=12.0 2024-06-19 20:00:43,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=51854.0, ans=0.125 2024-06-19 20:00:47,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=51872.333333333336, ans=0.125 2024-06-19 20:00:53,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=51890.666666666664, ans=10.0 2024-06-19 20:00:54,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=51890.666666666664, ans=0.025 2024-06-19 20:00:56,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=51890.666666666664, ans=0.125 2024-06-19 20:00:58,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=51909.0, ans=0.05 2024-06-19 20:01:03,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=12.0 2024-06-19 20:01:07,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=51927.333333333336, ans=0.125 2024-06-19 20:01:07,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.37 vs. limit=15.0 2024-06-19 20:01:10,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51927.333333333336, ans=0.125 2024-06-19 20:01:11,822 INFO [train.py:1028] (0/2) Epoch 3, batch 8100, loss[loss=0.5474, simple_loss=0.4789, pruned_loss=0.308, over 13118.00 frames. ], tot_loss[loss=0.5187, simple_loss=0.4606, pruned_loss=0.2884, over 2576398.25 frames. ], batch size: 112, lr: 1.65e-02, grad_scale: 2.0 2024-06-19 20:01:12,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.28 vs. limit=15.0 2024-06-19 20:01:19,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=51964.0, ans=0.125 2024-06-19 20:01:32,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2024-06-19 20:01:34,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52000.666666666664, ans=0.125 2024-06-19 20:01:37,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52000.666666666664, ans=0.125 2024-06-19 20:01:44,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=52019.0, ans=0.0 2024-06-19 20:01:51,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+03 2.296e+03 2.602e+03 2.906e+03 5.900e+03, threshold=5.204e+03, percent-clipped=2.0 2024-06-19 20:01:52,360 INFO [train.py:1028] (0/2) Epoch 3, batch 8150, loss[loss=0.5287, simple_loss=0.4652, pruned_loss=0.2962, over 13135.00 frames. ], tot_loss[loss=0.5161, simple_loss=0.4593, pruned_loss=0.2865, over 2581281.87 frames. ], batch size: 121, lr: 1.65e-02, grad_scale: 1.0 2024-06-19 20:01:53,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=52037.333333333336, ans=0.125 2024-06-19 20:01:53,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52037.333333333336, ans=0.125 2024-06-19 20:01:57,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2024-06-19 20:01:58,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=52055.666666666664, ans=0.0 2024-06-19 20:02:19,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=52110.666666666664, ans=0.125 2024-06-19 20:02:20,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=52110.666666666664, ans=0.025 2024-06-19 20:02:25,954 INFO [train.py:1028] (0/2) Epoch 3, batch 8200, loss[loss=0.5213, simple_loss=0.4663, pruned_loss=0.2882, over 13127.00 frames. ], tot_loss[loss=0.5161, simple_loss=0.46, pruned_loss=0.286, over 2584349.46 frames. ], batch size: 112, lr: 1.65e-02, grad_scale: 2.0 2024-06-19 20:02:38,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=52147.333333333336, ans=0.0 2024-06-19 20:02:43,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=52165.666666666664, ans=0.025 2024-06-19 20:02:44,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=52165.666666666664, ans=0.0 2024-06-19 20:02:44,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=52165.666666666664, ans=0.05 2024-06-19 20:03:00,668 INFO [train.py:1028] (0/2) Epoch 3, batch 8250, loss[loss=0.4724, simple_loss=0.4421, pruned_loss=0.2513, over 13258.00 frames. ], tot_loss[loss=0.5148, simple_loss=0.4594, pruned_loss=0.2851, over 2584783.62 frames. ], batch size: 52, lr: 1.65e-02, grad_scale: 0.5 2024-06-19 20:03:01,239 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.432e+03 2.041e+03 2.352e+03 2.928e+03 1.240e+04, threshold=4.704e+03, percent-clipped=5.0 2024-06-19 20:03:12,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52239.0, ans=0.1 2024-06-19 20:03:22,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.22 vs. limit=22.5 2024-06-19 20:03:28,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.11 vs. limit=22.5 2024-06-19 20:03:36,788 INFO [train.py:1028] (0/2) Epoch 3, batch 8300, loss[loss=0.4822, simple_loss=0.4362, pruned_loss=0.2641, over 13134.00 frames. ], tot_loss[loss=0.5111, simple_loss=0.457, pruned_loss=0.2826, over 2582476.99 frames. ], batch size: 103, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:03:37,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2024-06-19 20:03:38,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=52312.333333333336, ans=0.125 2024-06-19 20:03:43,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=52330.666666666664, ans=0.05 2024-06-19 20:03:43,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52330.666666666664, ans=0.1 2024-06-19 20:03:48,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=52330.666666666664, ans=0.125 2024-06-19 20:03:54,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=52349.0, ans=0.2 2024-06-19 20:03:57,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52349.0, ans=0.1 2024-06-19 20:04:00,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.25 vs. limit=22.5 2024-06-19 20:04:09,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52385.666666666664, ans=0.1 2024-06-19 20:04:12,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52385.666666666664, ans=0.1 2024-06-19 20:04:12,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=52404.0, ans=0.2 2024-06-19 20:04:13,354 INFO [train.py:1028] (0/2) Epoch 3, batch 8350, loss[loss=0.5096, simple_loss=0.4612, pruned_loss=0.279, over 13176.00 frames. ], tot_loss[loss=0.5094, simple_loss=0.4563, pruned_loss=0.2812, over 2583168.77 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:04:13,886 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.728e+02 1.698e+03 2.012e+03 2.412e+03 8.628e+03, threshold=4.023e+03, percent-clipped=1.0 2024-06-19 20:04:19,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2024-06-19 20:04:24,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2024-06-19 20:04:26,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52440.666666666664, ans=0.1 2024-06-19 20:04:26,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52440.666666666664, ans=0.125 2024-06-19 20:04:29,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=52440.666666666664, ans=0.09899494936611666 2024-06-19 20:04:47,156 INFO [train.py:1028] (0/2) Epoch 3, batch 8400, loss[loss=0.4657, simple_loss=0.4232, pruned_loss=0.2541, over 12896.00 frames. ], tot_loss[loss=0.5084, simple_loss=0.4557, pruned_loss=0.2805, over 2578769.05 frames. ], batch size: 39, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:04:49,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=52495.666666666664, ans=0.0 2024-06-19 20:04:49,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=52495.666666666664, ans=0.025 2024-06-19 20:05:11,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52550.666666666664, ans=0.1 2024-06-19 20:05:14,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.33 vs. limit=12.0 2024-06-19 20:05:15,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=52569.0, ans=0.2 2024-06-19 20:05:15,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.80 vs. limit=22.5 2024-06-19 20:05:20,272 INFO [train.py:1028] (0/2) Epoch 3, batch 8450, loss[loss=0.484, simple_loss=0.4368, pruned_loss=0.2656, over 13172.00 frames. ], tot_loss[loss=0.5082, simple_loss=0.4563, pruned_loss=0.28, over 2580880.08 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:05:21,554 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.776e+02 1.637e+03 1.870e+03 2.198e+03 3.721e+03, threshold=3.739e+03, percent-clipped=0.0 2024-06-19 20:05:23,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.92 vs. limit=15.0 2024-06-19 20:05:28,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=52605.666666666664, ans=0.0 2024-06-19 20:06:01,371 INFO [train.py:1028] (0/2) Epoch 3, batch 8500, loss[loss=0.4947, simple_loss=0.4513, pruned_loss=0.269, over 12647.00 frames. ], tot_loss[loss=0.5092, simple_loss=0.4573, pruned_loss=0.2806, over 2579619.39 frames. ], batch size: 29, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:06:02,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52679.0, ans=0.1 2024-06-19 20:06:03,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.89 vs. limit=15.0 2024-06-19 20:06:07,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.77 vs. limit=15.0 2024-06-19 20:06:24,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=52734.0, ans=0.2 2024-06-19 20:06:26,653 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.28 vs. limit=15.0 2024-06-19 20:06:36,155 INFO [train.py:1028] (0/2) Epoch 3, batch 8550, loss[loss=0.4455, simple_loss=0.4198, pruned_loss=0.2356, over 12575.00 frames. ], tot_loss[loss=0.5057, simple_loss=0.4554, pruned_loss=0.278, over 2577021.99 frames. ], batch size: 22, lr: 1.64e-02, grad_scale: 2.0 2024-06-19 20:06:37,397 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.013e+03 1.659e+03 2.071e+03 2.530e+03 8.065e+03, threshold=4.143e+03, percent-clipped=4.0 2024-06-19 20:06:47,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.72 vs. limit=22.5 2024-06-19 20:06:53,258 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-19 20:07:03,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=52844.0, ans=0.2 2024-06-19 20:07:05,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=18.18 vs. limit=15.0 2024-06-19 20:07:06,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.05 vs. limit=22.5 2024-06-19 20:07:09,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=52844.0, ans=0.125 2024-06-19 20:07:10,242 INFO [train.py:1028] (0/2) Epoch 3, batch 8600, loss[loss=0.4852, simple_loss=0.4386, pruned_loss=0.2659, over 13127.00 frames. ], tot_loss[loss=0.5059, simple_loss=0.4555, pruned_loss=0.2781, over 2574344.84 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 1.0 2024-06-19 20:07:12,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.31 vs. limit=12.0 2024-06-19 20:07:28,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.77 vs. limit=12.0 2024-06-19 20:07:30,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=52917.333333333336, ans=0.125 2024-06-19 20:07:46,887 INFO [train.py:1028] (0/2) Epoch 3, batch 8650, loss[loss=0.5212, simple_loss=0.4722, pruned_loss=0.2851, over 13036.00 frames. ], tot_loss[loss=0.505, simple_loss=0.4555, pruned_loss=0.2772, over 2576711.00 frames. ], batch size: 102, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:07:49,403 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.340e+03 2.132e+03 2.660e+03 3.051e+03 5.530e+03, threshold=5.321e+03, percent-clipped=7.0 2024-06-19 20:08:13,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=53009.0, ans=0.015 2024-06-19 20:08:23,047 INFO [train.py:1028] (0/2) Epoch 3, batch 8700, loss[loss=0.4925, simple_loss=0.4549, pruned_loss=0.2651, over 13222.00 frames. ], tot_loss[loss=0.5084, simple_loss=0.4578, pruned_loss=0.2795, over 2574928.14 frames. ], batch size: 59, lr: 1.63e-02, grad_scale: 2.0 2024-06-19 20:08:39,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=53082.333333333336, ans=0.0 2024-06-19 20:08:44,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=53100.666666666664, ans=0.1 2024-06-19 20:08:45,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=53100.666666666664, ans=0.125 2024-06-19 20:08:46,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.02 vs. limit=22.5 2024-06-19 20:08:46,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=53100.666666666664, ans=0.05 2024-06-19 20:08:55,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=15.0 2024-06-19 20:08:57,110 INFO [train.py:1028] (0/2) Epoch 3, batch 8750, loss[loss=0.5053, simple_loss=0.4518, pruned_loss=0.2794, over 13166.00 frames. ], tot_loss[loss=0.5092, simple_loss=0.4583, pruned_loss=0.2801, over 2570449.70 frames. ], batch size: 121, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:09:00,507 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+03 2.456e+03 2.786e+03 3.379e+03 7.357e+03, threshold=5.571e+03, percent-clipped=3.0 2024-06-19 20:09:00,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=53137.333333333336, ans=0.125 2024-06-19 20:09:03,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=53155.666666666664, ans=0.125 2024-06-19 20:09:13,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53174.0, ans=0.1 2024-06-19 20:09:13,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2024-06-19 20:09:17,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=53192.333333333336, ans=0.0 2024-06-19 20:09:31,033 INFO [train.py:1028] (0/2) Epoch 3, batch 8800, loss[loss=0.4847, simple_loss=0.4425, pruned_loss=0.2634, over 13114.00 frames. ], tot_loss[loss=0.5067, simple_loss=0.4568, pruned_loss=0.2783, over 2574747.56 frames. ], batch size: 71, lr: 1.63e-02, grad_scale: 0.5 2024-06-19 20:09:37,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=53229.0, ans=0.125 2024-06-19 20:09:53,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=53265.666666666664, ans=0.125 2024-06-19 20:09:55,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=53265.666666666664, ans=0.05 2024-06-19 20:09:56,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=53265.666666666664, ans=0.0 2024-06-19 20:09:58,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=53284.0, ans=0.025 2024-06-19 20:10:03,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53284.0, ans=0.1 2024-06-19 20:10:09,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=53302.333333333336, ans=0.0 2024-06-19 20:10:10,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-19 20:10:11,887 INFO [train.py:1028] (0/2) Epoch 3, batch 8850, loss[loss=0.576, simple_loss=0.4936, pruned_loss=0.3292, over 12596.00 frames. ], tot_loss[loss=0.5066, simple_loss=0.4564, pruned_loss=0.2784, over 2563763.23 frames. ], batch size: 202, lr: 1.63e-02, grad_scale: 0.5 2024-06-19 20:10:16,723 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+03 2.764e+03 3.391e+03 3.865e+03 9.837e+03, threshold=6.782e+03, percent-clipped=6.0 2024-06-19 20:10:25,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=53357.333333333336, ans=0.0 2024-06-19 20:10:32,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.19 vs. limit=15.0 2024-06-19 20:10:35,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=53375.666666666664, ans=0.0 2024-06-19 20:10:41,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=53394.0, ans=0.0 2024-06-19 20:10:45,383 INFO [train.py:1028] (0/2) Epoch 3, batch 8900, loss[loss=0.4714, simple_loss=0.4412, pruned_loss=0.2508, over 12985.00 frames. ], tot_loss[loss=0.5075, simple_loss=0.4574, pruned_loss=0.2788, over 2562018.26 frames. ], batch size: 33, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:10:46,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=53412.333333333336, ans=0.125 2024-06-19 20:10:47,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-19 20:10:58,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.64 vs. limit=15.0 2024-06-19 20:11:13,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=53485.666666666664, ans=0.0 2024-06-19 20:11:18,351 INFO [train.py:1028] (0/2) Epoch 3, batch 8950, loss[loss=0.5424, simple_loss=0.478, pruned_loss=0.3034, over 12482.00 frames. ], tot_loss[loss=0.5062, simple_loss=0.457, pruned_loss=0.2777, over 2561493.95 frames. ], batch size: 202, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:11:19,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.32 vs. limit=10.0 2024-06-19 20:11:22,950 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.211e+03 1.959e+03 2.261e+03 2.595e+03 5.998e+03, threshold=4.523e+03, percent-clipped=0.0 2024-06-19 20:11:27,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=53522.333333333336, ans=0.125 2024-06-19 20:11:30,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=53522.333333333336, ans=0.125 2024-06-19 20:11:33,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=53540.666666666664, ans=0.125 2024-06-19 20:11:42,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=53559.0, ans=0.1 2024-06-19 20:11:49,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=53577.333333333336, ans=0.125 2024-06-19 20:11:54,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.46 vs. limit=22.5 2024-06-19 20:11:59,003 INFO [train.py:1028] (0/2) Epoch 3, batch 9000, loss[loss=0.5013, simple_loss=0.4618, pruned_loss=0.2703, over 13282.00 frames. ], tot_loss[loss=0.5045, simple_loss=0.4566, pruned_loss=0.2762, over 2566932.49 frames. ], batch size: 46, lr: 1.63e-02, grad_scale: 1.0 2024-06-19 20:11:59,004 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 20:12:07,112 INFO [train.py:1060] (0/2) Epoch 3, validation: loss=0.328, simple_loss=0.3513, pruned_loss=0.1524, over 351949.00 frames. 2024-06-19 20:12:07,112 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 20:12:07,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53595.666666666664, ans=0.1 2024-06-19 20:12:09,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=53595.666666666664, ans=0.0 2024-06-19 20:12:20,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=34.69 vs. limit=22.5 2024-06-19 20:12:26,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53650.666666666664, ans=0.1 2024-06-19 20:12:32,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53669.0, ans=0.1 2024-06-19 20:12:32,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.28 vs. limit=22.5 2024-06-19 20:12:39,588 INFO [train.py:1028] (0/2) Epoch 3, batch 9050, loss[loss=0.4889, simple_loss=0.4381, pruned_loss=0.2699, over 11651.00 frames. ], tot_loss[loss=0.5057, simple_loss=0.4577, pruned_loss=0.2769, over 2566454.05 frames. ], batch size: 17, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:12:39,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=53687.333333333336, ans=0.0 2024-06-19 20:12:43,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2024-06-19 20:12:44,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.232e+03 1.899e+03 2.228e+03 2.618e+03 1.177e+04, threshold=4.455e+03, percent-clipped=3.0 2024-06-19 20:12:47,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.89 vs. limit=10.0 2024-06-19 20:12:49,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=53705.666666666664, ans=0.1 2024-06-19 20:12:50,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=53705.666666666664, ans=15.0 2024-06-19 20:12:59,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=53742.333333333336, ans=0.125 2024-06-19 20:13:04,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53742.333333333336, ans=0.1 2024-06-19 20:13:12,081 INFO [train.py:1028] (0/2) Epoch 3, batch 9100, loss[loss=0.4654, simple_loss=0.4289, pruned_loss=0.2509, over 13192.00 frames. ], tot_loss[loss=0.5031, simple_loss=0.456, pruned_loss=0.2751, over 2566755.37 frames. ], batch size: 72, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:13:16,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53779.0, ans=0.1 2024-06-19 20:13:16,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=53779.0, ans=0.125 2024-06-19 20:13:16,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.37 vs. limit=15.0 2024-06-19 20:13:17,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2024-06-19 20:13:25,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=53815.666666666664, ans=0.0 2024-06-19 20:13:29,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=31.29 vs. limit=15.0 2024-06-19 20:13:32,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53834.0, ans=0.125 2024-06-19 20:13:32,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.48 vs. limit=15.0 2024-06-19 20:13:41,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=53852.333333333336, ans=0.125 2024-06-19 20:13:42,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=53852.333333333336, ans=0.0 2024-06-19 20:13:44,003 INFO [train.py:1028] (0/2) Epoch 3, batch 9150, loss[loss=0.4979, simple_loss=0.4654, pruned_loss=0.2652, over 13142.00 frames. ], tot_loss[loss=0.5031, simple_loss=0.4565, pruned_loss=0.2748, over 2568202.76 frames. ], batch size: 77, lr: 1.62e-02, grad_scale: 0.5 2024-06-19 20:13:50,210 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.324e+03 1.889e+03 2.207e+03 2.487e+03 6.311e+03, threshold=4.415e+03, percent-clipped=3.0 2024-06-19 20:13:52,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=53889.0, ans=0.0 2024-06-19 20:14:04,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=53925.666666666664, ans=0.125 2024-06-19 20:14:05,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=53925.666666666664, ans=0.0 2024-06-19 20:14:10,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=53944.0, ans=0.0 2024-06-19 20:14:16,276 INFO [train.py:1028] (0/2) Epoch 3, batch 9200, loss[loss=0.487, simple_loss=0.4476, pruned_loss=0.2632, over 12962.00 frames. ], tot_loss[loss=0.5005, simple_loss=0.4554, pruned_loss=0.2728, over 2571523.61 frames. ], batch size: 36, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:14:16,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=53962.333333333336, ans=0.0 2024-06-19 20:14:39,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=54017.333333333336, ans=0.0 2024-06-19 20:14:40,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=54017.333333333336, ans=0.125 2024-06-19 20:14:48,000 INFO [train.py:1028] (0/2) Epoch 3, batch 9250, loss[loss=0.5083, simple_loss=0.4683, pruned_loss=0.2741, over 13255.00 frames. ], tot_loss[loss=0.4964, simple_loss=0.453, pruned_loss=0.2699, over 2573552.87 frames. ], batch size: 67, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:14:54,433 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.053e+03 1.831e+03 2.143e+03 2.534e+03 4.902e+03, threshold=4.286e+03, percent-clipped=2.0 2024-06-19 20:14:56,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=54072.333333333336, ans=0.125 2024-06-19 20:15:00,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54090.666666666664, ans=0.1 2024-06-19 20:15:01,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=54090.666666666664, ans=0.125 2024-06-19 20:15:05,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=54090.666666666664, ans=0.0 2024-06-19 20:15:19,750 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.98 vs. limit=22.5 2024-06-19 20:15:23,043 INFO [train.py:1028] (0/2) Epoch 3, batch 9300, loss[loss=0.4528, simple_loss=0.4187, pruned_loss=0.2434, over 12940.00 frames. ], tot_loss[loss=0.4949, simple_loss=0.452, pruned_loss=0.2688, over 2570883.24 frames. ], batch size: 39, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:15:25,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2024-06-19 20:15:30,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=54164.0, ans=0.0 2024-06-19 20:15:31,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=54164.0, ans=0.125 2024-06-19 20:15:32,165 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.63 vs. limit=15.0 2024-06-19 20:15:39,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.15 vs. limit=15.0 2024-06-19 20:15:56,055 INFO [train.py:1028] (0/2) Epoch 3, batch 9350, loss[loss=0.4868, simple_loss=0.4454, pruned_loss=0.2641, over 12572.00 frames. ], tot_loss[loss=0.4945, simple_loss=0.4517, pruned_loss=0.2686, over 2567392.74 frames. ], batch size: 22, lr: 1.62e-02, grad_scale: 1.0 2024-06-19 20:15:56,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=54237.333333333336, ans=0.0 2024-06-19 20:16:01,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2024-06-19 20:16:01,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=54237.333333333336, ans=22.5 2024-06-19 20:16:02,725 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.164e+03 1.678e+03 2.009e+03 2.319e+03 4.280e+03, threshold=4.019e+03, percent-clipped=0.0 2024-06-19 20:16:06,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.26 vs. limit=15.0 2024-06-19 20:16:06,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=54255.666666666664, ans=0.07 2024-06-19 20:16:10,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.14 vs. limit=6.0 2024-06-19 20:16:15,418 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.615e+01 2024-06-19 20:16:17,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-19 20:16:23,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=54310.666666666664, ans=0.125 2024-06-19 20:16:25,977 INFO [train.py:1028] (0/2) Epoch 3, batch 9400, loss[loss=0.5431, simple_loss=0.4895, pruned_loss=0.2983, over 13275.00 frames. ], tot_loss[loss=0.4933, simple_loss=0.451, pruned_loss=0.2678, over 2566830.46 frames. ], batch size: 52, lr: 1.62e-02, grad_scale: 2.0 2024-06-19 20:16:30,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=54329.0, ans=0.95 2024-06-19 20:16:34,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=54347.333333333336, ans=0.125 2024-06-19 20:16:39,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=54365.666666666664, ans=0.09899494936611666 2024-06-19 20:16:41,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=54365.666666666664, ans=0.125 2024-06-19 20:16:48,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=54384.0, ans=0.0 2024-06-19 20:16:53,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54402.333333333336, ans=0.1 2024-06-19 20:16:56,569 INFO [train.py:1028] (0/2) Epoch 3, batch 9450, loss[loss=0.4468, simple_loss=0.4094, pruned_loss=0.2421, over 12514.00 frames. ], tot_loss[loss=0.4941, simple_loss=0.4516, pruned_loss=0.2683, over 2567849.09 frames. ], batch size: 22, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:17:03,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54439.0, ans=0.1 2024-06-19 20:17:04,050 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.163e+03 1.719e+03 2.027e+03 2.480e+03 4.764e+03, threshold=4.054e+03, percent-clipped=4.0 2024-06-19 20:17:05,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.76 vs. limit=22.5 2024-06-19 20:17:07,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.54 vs. limit=22.5 2024-06-19 20:17:09,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=54457.333333333336, ans=0.015 2024-06-19 20:17:09,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.03 vs. limit=22.5 2024-06-19 20:17:26,801 INFO [train.py:1028] (0/2) Epoch 3, batch 9500, loss[loss=0.5127, simple_loss=0.4688, pruned_loss=0.2783, over 13266.00 frames. ], tot_loss[loss=0.4935, simple_loss=0.4513, pruned_loss=0.2678, over 2576625.84 frames. ], batch size: 43, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:17:29,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.49 vs. limit=22.5 2024-06-19 20:17:37,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=54530.666666666664, ans=0.125 2024-06-19 20:17:39,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.78 vs. limit=22.5 2024-06-19 20:17:56,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=54604.0, ans=0.2 2024-06-19 20:17:57,127 INFO [train.py:1028] (0/2) Epoch 3, batch 9550, loss[loss=0.4838, simple_loss=0.4449, pruned_loss=0.2614, over 12917.00 frames. ], tot_loss[loss=0.4915, simple_loss=0.4499, pruned_loss=0.2666, over 2571681.34 frames. ], batch size: 39, lr: 1.61e-02, grad_scale: 1.0 2024-06-19 20:17:57,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=54604.0, ans=0.0 2024-06-19 20:18:05,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.095e+03 1.587e+03 1.878e+03 2.169e+03 5.493e+03, threshold=3.756e+03, percent-clipped=3.0 2024-06-19 20:18:23,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.00 vs. limit=15.0 2024-06-19 20:18:32,418 INFO [train.py:1028] (0/2) Epoch 3, batch 9600, loss[loss=0.5495, simple_loss=0.4722, pruned_loss=0.3134, over 10553.00 frames. ], tot_loss[loss=0.4894, simple_loss=0.4483, pruned_loss=0.2653, over 2571299.65 frames. ], batch size: 303, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:18:33,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54695.666666666664, ans=0.1 2024-06-19 20:18:38,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=54714.0, ans=0.0 2024-06-19 20:18:39,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=54714.0, ans=0.125 2024-06-19 20:18:40,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=54714.0, ans=0.0 2024-06-19 20:18:43,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=54714.0, ans=0.125 2024-06-19 20:18:43,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=54714.0, ans=0.125 2024-06-19 20:18:43,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=54714.0, ans=0.125 2024-06-19 20:19:02,291 INFO [train.py:1028] (0/2) Epoch 3, batch 9650, loss[loss=0.5053, simple_loss=0.4536, pruned_loss=0.2785, over 13095.00 frames. ], tot_loss[loss=0.4881, simple_loss=0.4468, pruned_loss=0.2647, over 2561603.62 frames. ], batch size: 132, lr: 1.61e-02, grad_scale: 1.0 2024-06-19 20:19:03,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=54787.333333333336, ans=0.025 2024-06-19 20:19:07,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-19 20:19:07,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=54805.666666666664, ans=0.025 2024-06-19 20:19:09,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=54805.666666666664, ans=0.125 2024-06-19 20:19:10,619 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.208e+02 1.450e+03 1.703e+03 2.023e+03 5.489e+03, threshold=3.407e+03, percent-clipped=2.0 2024-06-19 20:19:32,612 INFO [train.py:1028] (0/2) Epoch 3, batch 9700, loss[loss=0.4778, simple_loss=0.4337, pruned_loss=0.261, over 13007.00 frames. ], tot_loss[loss=0.4865, simple_loss=0.4454, pruned_loss=0.2638, over 2555787.19 frames. ], batch size: 144, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:19:42,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=54897.333333333336, ans=22.5 2024-06-19 20:19:43,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=15.0 2024-06-19 20:19:46,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=30.96 vs. limit=22.5 2024-06-19 20:19:50,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54934.0, ans=0.1 2024-06-19 20:19:51,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=54934.0, ans=0.2 2024-06-19 20:19:51,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=54934.0, ans=0.1 2024-06-19 20:19:59,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=54952.333333333336, ans=0.0 2024-06-19 20:20:02,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=54952.333333333336, ans=0.125 2024-06-19 20:20:02,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.05 vs. limit=10.0 2024-06-19 20:20:02,967 INFO [train.py:1028] (0/2) Epoch 3, batch 9750, loss[loss=0.4838, simple_loss=0.4459, pruned_loss=0.2609, over 13053.00 frames. ], tot_loss[loss=0.4829, simple_loss=0.4432, pruned_loss=0.2613, over 2552489.99 frames. ], batch size: 132, lr: 1.61e-02, grad_scale: 2.0 2024-06-19 20:20:13,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.686e+02 1.765e+03 2.023e+03 2.264e+03 5.206e+03, threshold=4.045e+03, percent-clipped=3.0 2024-06-19 20:20:23,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2024-06-19 20:20:28,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=55025.666666666664, ans=0.125 2024-06-19 20:20:35,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=15.0 2024-06-19 20:20:37,720 INFO [train.py:1028] (0/2) Epoch 3, batch 9800, loss[loss=0.428, simple_loss=0.4118, pruned_loss=0.2221, over 12914.00 frames. ], tot_loss[loss=0.4803, simple_loss=0.4416, pruned_loss=0.2595, over 2545360.31 frames. ], batch size: 39, lr: 1.61e-02, grad_scale: 4.0 2024-06-19 20:20:40,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.35 vs. limit=15.0 2024-06-19 20:20:42,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=55062.333333333336, ans=10.0 2024-06-19 20:20:47,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-19 20:20:48,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=55080.666666666664, ans=0.125 2024-06-19 20:20:48,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=55080.666666666664, ans=0.125 2024-06-19 20:20:50,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.74 vs. limit=15.0 2024-06-19 20:20:52,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=55099.0, ans=0.125 2024-06-19 20:21:08,136 INFO [train.py:1028] (0/2) Epoch 3, batch 9850, loss[loss=0.4826, simple_loss=0.4504, pruned_loss=0.2573, over 13069.00 frames. ], tot_loss[loss=0.4789, simple_loss=0.4405, pruned_loss=0.2587, over 2538178.49 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:21:08,555 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2024-06-19 20:21:17,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.429e+03 2.086e+03 2.480e+03 2.951e+03 6.745e+03, threshold=4.961e+03, percent-clipped=5.0 2024-06-19 20:21:20,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=55190.666666666664, ans=0.5 2024-06-19 20:21:32,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=55227.333333333336, ans=10.0 2024-06-19 20:21:40,612 INFO [train.py:1028] (0/2) Epoch 3, batch 9900, loss[loss=0.451, simple_loss=0.4263, pruned_loss=0.2378, over 12904.00 frames. ], tot_loss[loss=0.4812, simple_loss=0.4415, pruned_loss=0.2604, over 2530760.93 frames. ], batch size: 39, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:21:42,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=55245.666666666664, ans=0.05 2024-06-19 20:21:45,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55245.666666666664, ans=0.1 2024-06-19 20:21:58,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=55300.666666666664, ans=0.0 2024-06-19 20:22:00,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=55300.666666666664, ans=0.0 2024-06-19 20:22:01,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=55300.666666666664, ans=0.0 2024-06-19 20:22:07,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=55319.0, ans=0.125 2024-06-19 20:22:10,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=55337.333333333336, ans=0.025 2024-06-19 20:22:10,757 INFO [train.py:1028] (0/2) Epoch 3, batch 9950, loss[loss=0.4759, simple_loss=0.4193, pruned_loss=0.2662, over 12969.00 frames. ], tot_loss[loss=0.4812, simple_loss=0.4404, pruned_loss=0.261, over 2526735.49 frames. ], batch size: 30, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:22:14,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=55337.333333333336, ans=0.025 2024-06-19 20:22:20,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=55355.666666666664, ans=0.07 2024-06-19 20:22:21,276 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.380e+03 2.908e+03 3.430e+03 4.325e+03 8.261e+03, threshold=6.861e+03, percent-clipped=16.0 2024-06-19 20:22:43,488 INFO [train.py:1028] (0/2) Epoch 3, batch 10000, loss[loss=0.4908, simple_loss=0.4597, pruned_loss=0.261, over 12635.00 frames. ], tot_loss[loss=0.4844, simple_loss=0.4424, pruned_loss=0.2632, over 2487399.33 frames. ], batch size: 22, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:22:50,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55447.333333333336, ans=0.125 2024-06-19 20:23:01,727 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:23:02,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=55484.0, ans=0.5 2024-06-19 20:23:12,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=55502.333333333336, ans=0.2 2024-06-19 20:23:14,470 INFO [train.py:1028] (0/2) Epoch 3, batch 10050, loss[loss=0.4936, simple_loss=0.4591, pruned_loss=0.264, over 12498.00 frames. ], tot_loss[loss=0.4855, simple_loss=0.4425, pruned_loss=0.2643, over 2446377.48 frames. ], batch size: 22, lr: 1.60e-02, grad_scale: 0.5 2024-06-19 20:23:17,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=55520.666666666664, ans=0.125 2024-06-19 20:23:26,062 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+03 2.953e+03 3.395e+03 3.963e+03 7.510e+03, threshold=6.790e+03, percent-clipped=1.0 2024-06-19 20:23:28,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=55557.333333333336, ans=0.0 2024-06-19 20:23:29,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=55557.333333333336, ans=0.0 2024-06-19 20:23:34,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=55575.666666666664, ans=0.2 2024-06-19 20:23:42,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=20.03 vs. limit=15.0 2024-06-19 20:23:44,770 INFO [train.py:1028] (0/2) Epoch 3, batch 10100, loss[loss=0.4972, simple_loss=0.4462, pruned_loss=0.2741, over 11682.00 frames. ], tot_loss[loss=0.4861, simple_loss=0.4426, pruned_loss=0.2649, over 2427892.38 frames. ], batch size: 17, lr: 1.60e-02, grad_scale: 1.0 2024-06-19 20:23:47,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.89 vs. limit=10.0 2024-06-19 20:23:57,563 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-3.pt 2024-06-19 20:26:02,115 INFO [train.py:1028] (0/2) Epoch 4, batch 0, loss[loss=0.4058, simple_loss=0.3863, pruned_loss=0.2126, over 12901.00 frames. ], tot_loss[loss=0.4058, simple_loss=0.3863, pruned_loss=0.2126, over 12901.00 frames. ], batch size: 36, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:26:02,116 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 20:26:09,122 INFO [train.py:1060] (0/2) Epoch 4, validation: loss=0.321, simple_loss=0.3487, pruned_loss=0.1466, over 351949.00 frames. 2024-06-19 20:26:09,123 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 20:26:28,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=55678.333333333336, ans=0.0 2024-06-19 20:26:28,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=55678.333333333336, ans=0.0 2024-06-19 20:26:36,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=55696.666666666664, ans=0.125 2024-06-19 20:26:37,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.50 vs. limit=6.0 2024-06-19 20:26:40,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=55715.0, ans=0.125 2024-06-19 20:26:43,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55715.0, ans=0.1 2024-06-19 20:26:45,893 INFO [train.py:1028] (0/2) Epoch 4, batch 50, loss[loss=0.4623, simple_loss=0.4285, pruned_loss=0.248, over 12655.00 frames. ], tot_loss[loss=0.4517, simple_loss=0.4154, pruned_loss=0.244, over 575202.72 frames. ], batch size: 29, lr: 1.49e-02, grad_scale: 0.5 2024-06-19 20:26:49,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.393e+03 2.352e+03 2.933e+03 3.453e+03 4.888e+03, threshold=5.865e+03, percent-clipped=0.0 2024-06-19 20:26:57,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2024-06-19 20:26:57,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55751.666666666664, ans=0.125 2024-06-19 20:26:59,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.87 vs. limit=15.0 2024-06-19 20:27:01,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=55770.0, ans=0.125 2024-06-19 20:27:05,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55788.333333333336, ans=0.1 2024-06-19 20:27:16,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=55806.666666666664, ans=0.125 2024-06-19 20:27:16,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2024-06-19 20:27:17,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=55825.0, ans=0.0 2024-06-19 20:27:17,644 INFO [train.py:1028] (0/2) Epoch 4, batch 100, loss[loss=0.4529, simple_loss=0.4318, pruned_loss=0.237, over 13225.00 frames. ], tot_loss[loss=0.4519, simple_loss=0.4145, pruned_loss=0.2447, over 1018610.81 frames. ], batch size: 46, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:27:18,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=55825.0, ans=0.0 2024-06-19 20:27:24,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=55843.333333333336, ans=0.0 2024-06-19 20:27:27,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55843.333333333336, ans=0.125 2024-06-19 20:27:44,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2024-06-19 20:27:51,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-19 20:27:52,389 INFO [train.py:1028] (0/2) Epoch 4, batch 150, loss[loss=0.3999, simple_loss=0.3858, pruned_loss=0.2069, over 12689.00 frames. ], tot_loss[loss=0.4449, simple_loss=0.4108, pruned_loss=0.2395, over 1365618.41 frames. ], batch size: 29, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:27:55,446 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+03 2.915e+03 3.417e+03 3.933e+03 1.034e+04, threshold=6.833e+03, percent-clipped=5.0 2024-06-19 20:27:55,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=55916.666666666664, ans=0.0 2024-06-19 20:27:56,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55916.666666666664, ans=0.1 2024-06-19 20:27:56,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=55916.666666666664, ans=0.125 2024-06-19 20:27:58,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=15.0 2024-06-19 20:28:04,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=55935.0, ans=0.0 2024-06-19 20:28:11,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=55953.333333333336, ans=0.125 2024-06-19 20:28:17,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=55971.666666666664, ans=0.0 2024-06-19 20:28:19,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.32 vs. limit=15.0 2024-06-19 20:28:19,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=55971.666666666664, ans=0.09899494936611666 2024-06-19 20:28:25,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=55990.0, ans=0.125 2024-06-19 20:28:27,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=55990.0, ans=0.1 2024-06-19 20:28:28,758 INFO [train.py:1028] (0/2) Epoch 4, batch 200, loss[loss=0.5166, simple_loss=0.4537, pruned_loss=0.2897, over 12625.00 frames. ], tot_loss[loss=0.4459, simple_loss=0.411, pruned_loss=0.2404, over 1635818.41 frames. ], batch size: 202, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:28:37,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56026.666666666664, ans=0.1 2024-06-19 20:28:40,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=56026.666666666664, ans=0.0 2024-06-19 20:28:41,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=56045.0, ans=0.0 2024-06-19 20:29:00,732 INFO [train.py:1028] (0/2) Epoch 4, batch 250, loss[loss=0.4234, simple_loss=0.3829, pruned_loss=0.2319, over 13066.00 frames. ], tot_loss[loss=0.4441, simple_loss=0.4098, pruned_loss=0.2391, over 1847035.37 frames. ], batch size: 144, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:29:03,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2024-06-19 20:29:03,988 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+03 2.872e+03 3.361e+03 3.701e+03 5.400e+03, threshold=6.722e+03, percent-clipped=0.0 2024-06-19 20:29:05,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.06 vs. limit=15.0 2024-06-19 20:29:05,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=56100.0, ans=0.125 2024-06-19 20:29:09,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=56118.333333333336, ans=0.0 2024-06-19 20:29:22,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=56155.0, ans=0.125 2024-06-19 20:29:25,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=15.0 2024-06-19 20:29:26,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=56173.333333333336, ans=0.125 2024-06-19 20:29:28,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=23.53 vs. limit=22.5 2024-06-19 20:29:32,886 INFO [train.py:1028] (0/2) Epoch 4, batch 300, loss[loss=0.4533, simple_loss=0.4155, pruned_loss=0.2456, over 13166.00 frames. ], tot_loss[loss=0.4424, simple_loss=0.4091, pruned_loss=0.2379, over 2009714.17 frames. ], batch size: 112, lr: 1.49e-02, grad_scale: 2.0 2024-06-19 20:29:34,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-19 20:29:50,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.05 vs. limit=22.5 2024-06-19 20:30:04,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56246.666666666664, ans=0.1 2024-06-19 20:30:10,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=56265.0, ans=0.1 2024-06-19 20:30:10,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=56265.0, ans=0.125 2024-06-19 20:30:12,643 INFO [train.py:1028] (0/2) Epoch 4, batch 350, loss[loss=0.4185, simple_loss=0.3964, pruned_loss=0.2203, over 12943.00 frames. ], tot_loss[loss=0.44, simple_loss=0.4077, pruned_loss=0.2361, over 2139147.35 frames. ], batch size: 33, lr: 1.49e-02, grad_scale: 1.0 2024-06-19 20:30:16,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=56283.333333333336, ans=0.125 2024-06-19 20:30:17,144 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.418e+03 2.108e+03 2.509e+03 2.928e+03 4.711e+03, threshold=5.017e+03, percent-clipped=0.0 2024-06-19 20:30:30,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=56320.0, ans=0.125 2024-06-19 20:30:30,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.36 vs. limit=22.5 2024-06-19 20:30:45,878 INFO [train.py:1028] (0/2) Epoch 4, batch 400, loss[loss=0.4192, simple_loss=0.3969, pruned_loss=0.2207, over 13247.00 frames. ], tot_loss[loss=0.4396, simple_loss=0.4077, pruned_loss=0.2358, over 2239606.74 frames. ], batch size: 63, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:30:49,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=56375.0, ans=0.125 2024-06-19 20:31:02,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-19 20:31:18,868 INFO [train.py:1028] (0/2) Epoch 4, batch 450, loss[loss=0.4451, simple_loss=0.4231, pruned_loss=0.2336, over 13170.00 frames. ], tot_loss[loss=0.4385, simple_loss=0.4074, pruned_loss=0.2349, over 2314050.66 frames. ], batch size: 67, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:31:20,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=56466.666666666664, ans=0.125 2024-06-19 20:31:23,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=56466.666666666664, ans=0.0 2024-06-19 20:31:24,789 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.143e+03 2.272e+03 2.756e+03 3.458e+03 1.257e+04, threshold=5.512e+03, percent-clipped=7.0 2024-06-19 20:31:32,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56503.333333333336, ans=0.125 2024-06-19 20:31:34,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=56503.333333333336, ans=0.125 2024-06-19 20:31:39,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=56521.666666666664, ans=0.0 2024-06-19 20:31:40,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56521.666666666664, ans=0.125 2024-06-19 20:31:42,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=56521.666666666664, ans=0.125 2024-06-19 20:31:54,858 INFO [train.py:1028] (0/2) Epoch 4, batch 500, loss[loss=0.4137, simple_loss=0.3804, pruned_loss=0.2235, over 13115.00 frames. ], tot_loss[loss=0.4376, simple_loss=0.4074, pruned_loss=0.2339, over 2375809.24 frames. ], batch size: 121, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:31:57,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=56558.333333333336, ans=0.125 2024-06-19 20:32:02,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=56576.666666666664, ans=0.125 2024-06-19 20:32:13,343 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.76 vs. limit=10.0 2024-06-19 20:32:18,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2024-06-19 20:32:23,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2024-06-19 20:32:24,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=56613.333333333336, ans=0.025 2024-06-19 20:32:27,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.22 vs. limit=6.0 2024-06-19 20:32:31,906 INFO [train.py:1028] (0/2) Epoch 4, batch 550, loss[loss=0.4716, simple_loss=0.4258, pruned_loss=0.2587, over 12888.00 frames. ], tot_loss[loss=0.436, simple_loss=0.4064, pruned_loss=0.2328, over 2421039.33 frames. ], batch size: 158, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:32:32,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=56650.0, ans=0.2 2024-06-19 20:32:37,775 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.029e+03 1.953e+03 2.313e+03 2.714e+03 8.504e+03, threshold=4.627e+03, percent-clipped=2.0 2024-06-19 20:32:37,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=56668.333333333336, ans=0.125 2024-06-19 20:32:46,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=56686.666666666664, ans=0.09899494936611666 2024-06-19 20:32:48,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=56686.666666666664, ans=0.125 2024-06-19 20:32:48,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.38 vs. limit=15.0 2024-06-19 20:32:53,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.52 vs. limit=22.5 2024-06-19 20:32:54,889 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.460e+02 2024-06-19 20:32:55,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.78 vs. limit=15.0 2024-06-19 20:32:55,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.63 vs. limit=6.0 2024-06-19 20:33:02,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=56723.333333333336, ans=0.125 2024-06-19 20:33:03,765 INFO [train.py:1028] (0/2) Epoch 4, batch 600, loss[loss=0.4197, simple_loss=0.3861, pruned_loss=0.2266, over 13031.00 frames. ], tot_loss[loss=0.4339, simple_loss=0.4051, pruned_loss=0.2313, over 2458010.72 frames. ], batch size: 144, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:33:05,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=56741.666666666664, ans=10.0 2024-06-19 20:33:05,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.81 vs. limit=22.5 2024-06-19 20:33:10,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=56760.0, ans=0.125 2024-06-19 20:33:12,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=56760.0, ans=0.2 2024-06-19 20:33:16,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.24 vs. limit=15.0 2024-06-19 20:33:20,268 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.74 vs. limit=10.0 2024-06-19 20:33:36,316 INFO [train.py:1028] (0/2) Epoch 4, batch 650, loss[loss=0.42, simple_loss=0.3991, pruned_loss=0.2205, over 13182.00 frames. ], tot_loss[loss=0.431, simple_loss=0.4034, pruned_loss=0.2294, over 2489149.73 frames. ], batch size: 59, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:33:43,272 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+03 2.641e+03 3.156e+03 3.707e+03 5.750e+03, threshold=6.312e+03, percent-clipped=6.0 2024-06-19 20:33:53,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=56870.0, ans=0.025 2024-06-19 20:33:54,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=56870.0, ans=0.125 2024-06-19 20:34:02,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=56888.333333333336, ans=0.0 2024-06-19 20:34:06,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2024-06-19 20:34:09,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.72 vs. limit=15.0 2024-06-19 20:34:11,345 INFO [train.py:1028] (0/2) Epoch 4, batch 700, loss[loss=0.373, simple_loss=0.3643, pruned_loss=0.1909, over 13307.00 frames. ], tot_loss[loss=0.4301, simple_loss=0.4025, pruned_loss=0.2289, over 2511293.03 frames. ], batch size: 46, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:34:24,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=56943.333333333336, ans=0.125 2024-06-19 20:34:31,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=56961.666666666664, ans=0.07 2024-06-19 20:34:35,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=56980.0, ans=0.125 2024-06-19 20:34:35,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=56980.0, ans=0.2 2024-06-19 20:34:39,516 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.824e+00 2024-06-19 20:34:45,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=56998.333333333336, ans=6.0 2024-06-19 20:34:46,577 INFO [train.py:1028] (0/2) Epoch 4, batch 750, loss[loss=0.4504, simple_loss=0.4275, pruned_loss=0.2366, over 13282.00 frames. ], tot_loss[loss=0.4294, simple_loss=0.4026, pruned_loss=0.2281, over 2525972.80 frames. ], batch size: 63, lr: 1.48e-02, grad_scale: 0.5 2024-06-19 20:34:54,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.251e+03 2.528e+03 2.905e+03 3.443e+03 8.093e+03, threshold=5.809e+03, percent-clipped=1.0 2024-06-19 20:34:54,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57035.0, ans=0.1 2024-06-19 20:35:05,251 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.773e+02 2024-06-19 20:35:18,795 INFO [train.py:1028] (0/2) Epoch 4, batch 800, loss[loss=0.4484, simple_loss=0.4227, pruned_loss=0.2371, over 12950.00 frames. ], tot_loss[loss=0.4295, simple_loss=0.4028, pruned_loss=0.2281, over 2539276.60 frames. ], batch size: 36, lr: 1.48e-02, grad_scale: 1.0 2024-06-19 20:35:20,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=57108.333333333336, ans=0.2 2024-06-19 20:35:26,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=57126.666666666664, ans=0.0 2024-06-19 20:35:27,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2024-06-19 20:35:27,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.25 vs. limit=15.0 2024-06-19 20:35:32,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2024-06-19 20:35:32,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=57145.0, ans=0.2 2024-06-19 20:35:40,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=57163.333333333336, ans=0.0 2024-06-19 20:35:51,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=57200.0, ans=0.125 2024-06-19 20:35:52,394 INFO [train.py:1028] (0/2) Epoch 4, batch 850, loss[loss=0.4298, simple_loss=0.4083, pruned_loss=0.2256, over 13114.00 frames. ], tot_loss[loss=0.4269, simple_loss=0.401, pruned_loss=0.2264, over 2551168.63 frames. ], batch size: 95, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:36:00,589 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.380e+03 2.481e+03 3.066e+03 3.692e+03 4.698e+03, threshold=6.132e+03, percent-clipped=0.0 2024-06-19 20:36:00,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=57218.333333333336, ans=0.0 2024-06-19 20:36:02,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=57218.333333333336, ans=0.125 2024-06-19 20:36:06,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=57218.333333333336, ans=0.125 2024-06-19 20:36:10,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2024-06-19 20:36:11,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57236.666666666664, ans=0.1 2024-06-19 20:36:11,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=57236.666666666664, ans=0.125 2024-06-19 20:36:12,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=57236.666666666664, ans=0.125 2024-06-19 20:36:24,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.29 vs. limit=15.0 2024-06-19 20:36:24,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.12 vs. limit=10.0 2024-06-19 20:36:27,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.58 vs. limit=12.0 2024-06-19 20:36:30,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=57291.666666666664, ans=0.125 2024-06-19 20:36:30,762 INFO [train.py:1028] (0/2) Epoch 4, batch 900, loss[loss=0.4054, simple_loss=0.3873, pruned_loss=0.2118, over 12954.00 frames. ], tot_loss[loss=0.4278, simple_loss=0.4013, pruned_loss=0.2271, over 2557004.41 frames. ], batch size: 36, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:36:35,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=57291.666666666664, ans=0.125 2024-06-19 20:36:42,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.33 vs. limit=12.0 2024-06-19 20:36:44,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.31 vs. limit=15.0 2024-06-19 20:36:50,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=15.50 vs. limit=15.0 2024-06-19 20:37:01,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-19 20:37:03,361 INFO [train.py:1028] (0/2) Epoch 4, batch 950, loss[loss=0.4114, simple_loss=0.3941, pruned_loss=0.2144, over 12858.00 frames. ], tot_loss[loss=0.4299, simple_loss=0.4026, pruned_loss=0.2286, over 2560076.13 frames. ], batch size: 39, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:37:12,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=57401.666666666664, ans=0.0 2024-06-19 20:37:12,756 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+03 3.110e+03 3.597e+03 4.266e+03 6.315e+03, threshold=7.195e+03, percent-clipped=2.0 2024-06-19 20:37:14,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=57401.666666666664, ans=0.125 2024-06-19 20:37:15,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=22.5 2024-06-19 20:37:21,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2024-06-19 20:37:24,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=57438.333333333336, ans=0.125 2024-06-19 20:37:27,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=57438.333333333336, ans=0.125 2024-06-19 20:37:33,642 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:37:35,394 INFO [train.py:1028] (0/2) Epoch 4, batch 1000, loss[loss=0.3738, simple_loss=0.3695, pruned_loss=0.189, over 13018.00 frames. ], tot_loss[loss=0.4296, simple_loss=0.4018, pruned_loss=0.2287, over 2562625.67 frames. ], batch size: 48, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:37:50,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=57511.666666666664, ans=0.2 2024-06-19 20:37:55,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=57530.0, ans=0.0 2024-06-19 20:37:58,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.93 vs. limit=15.0 2024-06-19 20:38:07,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57548.333333333336, ans=0.1 2024-06-19 20:38:08,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=57548.333333333336, ans=15.0 2024-06-19 20:38:11,634 INFO [train.py:1028] (0/2) Epoch 4, batch 1050, loss[loss=0.4153, simple_loss=0.3947, pruned_loss=0.218, over 13185.00 frames. ], tot_loss[loss=0.4295, simple_loss=0.402, pruned_loss=0.2284, over 2566297.05 frames. ], batch size: 77, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:38:13,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2024-06-19 20:38:15,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=57566.666666666664, ans=0.0 2024-06-19 20:38:25,099 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+03 2.753e+03 3.133e+03 3.636e+03 1.136e+04, threshold=6.267e+03, percent-clipped=1.0 2024-06-19 20:38:25,174 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=8.524e+02 2024-06-19 20:38:41,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=57640.0, ans=0.0 2024-06-19 20:38:42,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=57640.0, ans=0.2 2024-06-19 20:38:48,074 INFO [train.py:1028] (0/2) Epoch 4, batch 1100, loss[loss=0.4475, simple_loss=0.4142, pruned_loss=0.2404, over 13255.00 frames. ], tot_loss[loss=0.4313, simple_loss=0.4035, pruned_loss=0.2295, over 2570817.62 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:38:50,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=57658.333333333336, ans=0.0 2024-06-19 20:38:50,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.79 vs. limit=15.0 2024-06-19 20:38:55,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.06 vs. limit=10.0 2024-06-19 20:38:57,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.09 vs. limit=22.5 2024-06-19 20:38:59,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=57676.666666666664, ans=0.025 2024-06-19 20:39:03,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=22.70 vs. limit=15.0 2024-06-19 20:39:03,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=57695.0, ans=0.0 2024-06-19 20:39:05,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=12.0 2024-06-19 20:39:10,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=57713.333333333336, ans=0.125 2024-06-19 20:39:10,944 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 20:39:13,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=57713.333333333336, ans=0.025 2024-06-19 20:39:16,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=20.68 vs. limit=15.0 2024-06-19 20:39:17,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=57731.666666666664, ans=0.2 2024-06-19 20:39:21,062 INFO [train.py:1028] (0/2) Epoch 4, batch 1150, loss[loss=0.3944, simple_loss=0.38, pruned_loss=0.2044, over 13231.00 frames. ], tot_loss[loss=0.4323, simple_loss=0.4044, pruned_loss=0.2301, over 2571410.59 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 0.5 2024-06-19 20:39:21,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=57750.0, ans=0.2 2024-06-19 20:39:31,215 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+03 2.438e+03 2.816e+03 3.261e+03 1.133e+04, threshold=5.631e+03, percent-clipped=1.0 2024-06-19 20:39:44,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=57805.0, ans=0.125 2024-06-19 20:39:46,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2024-06-19 20:39:48,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=57823.333333333336, ans=0.0 2024-06-19 20:39:53,046 INFO [train.py:1028] (0/2) Epoch 4, batch 1200, loss[loss=0.3863, simple_loss=0.3777, pruned_loss=0.1974, over 13161.00 frames. ], tot_loss[loss=0.4308, simple_loss=0.4033, pruned_loss=0.2291, over 2573122.05 frames. ], batch size: 77, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:39:54,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=57841.666666666664, ans=0.125 2024-06-19 20:39:56,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=57841.666666666664, ans=0.125 2024-06-19 20:40:01,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=57860.0, ans=0.025 2024-06-19 20:40:09,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.56 vs. limit=15.0 2024-06-19 20:40:25,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-19 20:40:28,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=57915.0, ans=0.125 2024-06-19 20:40:30,775 INFO [train.py:1028] (0/2) Epoch 4, batch 1250, loss[loss=0.4276, simple_loss=0.3978, pruned_loss=0.2287, over 13177.00 frames. ], tot_loss[loss=0.4286, simple_loss=0.402, pruned_loss=0.2276, over 2582844.79 frames. ], batch size: 112, lr: 1.47e-02, grad_scale: 1.0 2024-06-19 20:40:32,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=57933.333333333336, ans=0.025 2024-06-19 20:40:36,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=57933.333333333336, ans=0.0 2024-06-19 20:40:36,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=57951.666666666664, ans=0.125 2024-06-19 20:40:39,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2024-06-19 20:40:41,026 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+03 2.370e+03 2.802e+03 3.132e+03 7.023e+03, threshold=5.604e+03, percent-clipped=2.0 2024-06-19 20:40:47,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=57970.0, ans=0.0 2024-06-19 20:41:02,770 INFO [train.py:1028] (0/2) Epoch 4, batch 1300, loss[loss=0.4697, simple_loss=0.4205, pruned_loss=0.2594, over 12713.00 frames. ], tot_loss[loss=0.4287, simple_loss=0.4022, pruned_loss=0.2276, over 2582333.79 frames. ], batch size: 176, lr: 1.46e-02, grad_scale: 2.0 2024-06-19 20:41:08,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=58043.333333333336, ans=0.125 2024-06-19 20:41:19,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=58061.666666666664, ans=0.125 2024-06-19 20:41:19,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=58061.666666666664, ans=0.0 2024-06-19 20:41:20,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=58080.0, ans=0.0 2024-06-19 20:41:24,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58080.0, ans=0.1 2024-06-19 20:41:34,223 INFO [train.py:1028] (0/2) Epoch 4, batch 1350, loss[loss=0.428, simple_loss=0.4073, pruned_loss=0.2243, over 13206.00 frames. ], tot_loss[loss=0.4266, simple_loss=0.401, pruned_loss=0.2261, over 2584679.52 frames. ], batch size: 59, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:41:37,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.43 vs. limit=15.0 2024-06-19 20:41:40,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58135.0, ans=0.1 2024-06-19 20:41:43,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.61 vs. limit=15.0 2024-06-19 20:41:45,305 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.704e+02 2.022e+03 2.308e+03 2.650e+03 4.836e+03, threshold=4.617e+03, percent-clipped=0.0 2024-06-19 20:41:54,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58171.666666666664, ans=0.125 2024-06-19 20:41:56,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.78 vs. limit=15.0 2024-06-19 20:42:01,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.83 vs. limit=22.5 2024-06-19 20:42:01,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=58190.0, ans=0.125 2024-06-19 20:42:10,062 INFO [train.py:1028] (0/2) Epoch 4, batch 1400, loss[loss=0.4592, simple_loss=0.4231, pruned_loss=0.2476, over 12598.00 frames. ], tot_loss[loss=0.4286, simple_loss=0.4023, pruned_loss=0.2274, over 2586269.50 frames. ], batch size: 26, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:42:24,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=58226.666666666664, ans=0.125 2024-06-19 20:42:30,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=12.0 2024-06-19 20:42:39,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=58281.666666666664, ans=0.125 2024-06-19 20:42:42,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=12.0 2024-06-19 20:42:45,516 INFO [train.py:1028] (0/2) Epoch 4, batch 1450, loss[loss=0.4095, simple_loss=0.3758, pruned_loss=0.2216, over 13127.00 frames. ], tot_loss[loss=0.4294, simple_loss=0.4024, pruned_loss=0.2282, over 2586087.32 frames. ], batch size: 121, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:42:54,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.33 vs. limit=15.0 2024-06-19 20:42:54,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=58318.333333333336, ans=0.0 2024-06-19 20:42:57,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2024-06-19 20:42:57,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.23 vs. limit=15.0 2024-06-19 20:42:57,369 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.375e+03 2.488e+03 2.825e+03 3.343e+03 6.478e+03, threshold=5.650e+03, percent-clipped=3.0 2024-06-19 20:43:07,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=58355.0, ans=0.125 2024-06-19 20:43:07,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.29 vs. limit=10.0 2024-06-19 20:43:16,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=58373.333333333336, ans=0.125 2024-06-19 20:43:17,995 INFO [train.py:1028] (0/2) Epoch 4, batch 1500, loss[loss=0.4371, simple_loss=0.4086, pruned_loss=0.2328, over 13216.00 frames. ], tot_loss[loss=0.4293, simple_loss=0.4023, pruned_loss=0.2281, over 2589134.41 frames. ], batch size: 83, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:43:18,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=58391.666666666664, ans=0.09899494936611666 2024-06-19 20:43:20,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58391.666666666664, ans=0.1 2024-06-19 20:43:20,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.55 vs. limit=10.0 2024-06-19 20:43:36,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=58446.666666666664, ans=0.0 2024-06-19 20:43:49,835 INFO [train.py:1028] (0/2) Epoch 4, batch 1550, loss[loss=0.4165, simple_loss=0.3963, pruned_loss=0.2183, over 13013.00 frames. ], tot_loss[loss=0.43, simple_loss=0.403, pruned_loss=0.2285, over 2584584.04 frames. ], batch size: 102, lr: 1.46e-02, grad_scale: 0.5 2024-06-19 20:44:01,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=58501.666666666664, ans=0.125 2024-06-19 20:44:03,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=58501.666666666664, ans=0.125 2024-06-19 20:44:05,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.447e+03 2.182e+03 2.598e+03 3.060e+03 8.900e+03, threshold=5.197e+03, percent-clipped=2.0 2024-06-19 20:44:06,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58520.0, ans=0.1 2024-06-19 20:44:08,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=58520.0, ans=0.125 2024-06-19 20:44:17,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.50 vs. limit=10.0 2024-06-19 20:44:17,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58538.333333333336, ans=0.1 2024-06-19 20:44:17,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=58538.333333333336, ans=0.0 2024-06-19 20:44:27,926 INFO [train.py:1028] (0/2) Epoch 4, batch 1600, loss[loss=0.3862, simple_loss=0.3719, pruned_loss=0.2003, over 13148.00 frames. ], tot_loss[loss=0.4292, simple_loss=0.403, pruned_loss=0.2277, over 2579634.35 frames. ], batch size: 77, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:44:43,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=58611.666666666664, ans=0.125 2024-06-19 20:44:59,031 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-32000.pt 2024-06-19 20:45:04,733 INFO [train.py:1028] (0/2) Epoch 4, batch 1650, loss[loss=0.4076, simple_loss=0.3788, pruned_loss=0.2182, over 13131.00 frames. ], tot_loss[loss=0.4275, simple_loss=0.4018, pruned_loss=0.2266, over 2576456.98 frames. ], batch size: 95, lr: 1.46e-02, grad_scale: 1.0 2024-06-19 20:45:17,413 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.303e+03 2.008e+03 2.319e+03 2.954e+03 7.579e+03, threshold=4.637e+03, percent-clipped=3.0 2024-06-19 20:45:24,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58721.666666666664, ans=0.125 2024-06-19 20:45:32,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=58740.0, ans=0.125 2024-06-19 20:45:32,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=58740.0, ans=0.1 2024-06-19 20:45:35,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=58740.0, ans=0.125 2024-06-19 20:45:37,504 INFO [train.py:1028] (0/2) Epoch 4, batch 1700, loss[loss=0.3935, simple_loss=0.3943, pruned_loss=0.1963, over 12965.00 frames. ], tot_loss[loss=0.426, simple_loss=0.4013, pruned_loss=0.2253, over 2581922.21 frames. ], batch size: 26, lr: 1.46e-02, grad_scale: 2.0 2024-06-19 20:45:37,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=58758.333333333336, ans=0.1 2024-06-19 20:45:39,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=58758.333333333336, ans=0.0 2024-06-19 20:45:42,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2024-06-19 20:45:44,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2024-06-19 20:45:45,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=58776.666666666664, ans=0.125 2024-06-19 20:45:46,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=58776.666666666664, ans=0.125 2024-06-19 20:45:47,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=58776.666666666664, ans=0.125 2024-06-19 20:45:48,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-19 20:45:49,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=58776.666666666664, ans=0.0 2024-06-19 20:45:49,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58795.0, ans=0.1 2024-06-19 20:46:04,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=58813.333333333336, ans=0.0 2024-06-19 20:46:04,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.07 vs. limit=10.0 2024-06-19 20:46:09,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.68 vs. limit=15.0 2024-06-19 20:46:12,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.07 vs. limit=6.0 2024-06-19 20:46:12,609 INFO [train.py:1028] (0/2) Epoch 4, batch 1750, loss[loss=0.4198, simple_loss=0.4078, pruned_loss=0.2159, over 12552.00 frames. ], tot_loss[loss=0.4242, simple_loss=0.4, pruned_loss=0.2242, over 2582287.88 frames. ], batch size: 22, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:46:20,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=58850.0, ans=0.125 2024-06-19 20:46:24,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.05 vs. limit=15.0 2024-06-19 20:46:29,545 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+03 2.292e+03 2.681e+03 3.172e+03 5.645e+03, threshold=5.361e+03, percent-clipped=1.0 2024-06-19 20:46:30,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=58886.666666666664, ans=0.2 2024-06-19 20:46:34,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=58905.0, ans=0.04949747468305833 2024-06-19 20:46:38,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58905.0, ans=0.1 2024-06-19 20:46:39,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2024-06-19 20:46:47,985 INFO [train.py:1028] (0/2) Epoch 4, batch 1800, loss[loss=0.3636, simple_loss=0.3676, pruned_loss=0.1798, over 13225.00 frames. ], tot_loss[loss=0.4227, simple_loss=0.3993, pruned_loss=0.2231, over 2582485.77 frames. ], batch size: 67, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:46:55,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=58960.0, ans=0.04949747468305833 2024-06-19 20:46:59,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58960.0, ans=0.1 2024-06-19 20:46:59,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-19 20:47:12,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=59015.0, ans=0.0 2024-06-19 20:47:17,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2024-06-19 20:47:20,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-19 20:47:20,245 INFO [train.py:1028] (0/2) Epoch 4, batch 1850, loss[loss=0.4146, simple_loss=0.3959, pruned_loss=0.2166, over 13200.00 frames. ], tot_loss[loss=0.4227, simple_loss=0.3993, pruned_loss=0.223, over 2584373.90 frames. ], batch size: 83, lr: 1.45e-02, grad_scale: 0.5 2024-06-19 20:47:22,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.47 vs. limit=22.5 2024-06-19 20:47:22,985 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.338e+02 2024-06-19 20:47:34,823 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+03 2.723e+03 3.072e+03 3.920e+03 6.542e+03, threshold=6.144e+03, percent-clipped=4.0 2024-06-19 20:47:46,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59106.666666666664, ans=0.1 2024-06-19 20:47:50,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59106.666666666664, ans=0.1 2024-06-19 20:47:50,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=59106.666666666664, ans=0.125 2024-06-19 20:47:52,230 INFO [train.py:1028] (0/2) Epoch 4, batch 1900, loss[loss=0.4424, simple_loss=0.4117, pruned_loss=0.2365, over 13187.00 frames. ], tot_loss[loss=0.4214, simple_loss=0.3982, pruned_loss=0.2223, over 2587196.81 frames. ], batch size: 95, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:47:56,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59125.0, ans=0.1 2024-06-19 20:48:05,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=59143.333333333336, ans=0.125 2024-06-19 20:48:10,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.45 vs. limit=15.0 2024-06-19 20:48:26,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=59198.333333333336, ans=0.0 2024-06-19 20:48:30,727 INFO [train.py:1028] (0/2) Epoch 4, batch 1950, loss[loss=0.4041, simple_loss=0.3954, pruned_loss=0.2064, over 13289.00 frames. ], tot_loss[loss=0.4204, simple_loss=0.3973, pruned_loss=0.2217, over 2592024.79 frames. ], batch size: 52, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:48:38,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2024-06-19 20:48:45,510 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.350e+03 2.221e+03 2.489e+03 2.873e+03 3.745e+03, threshold=4.978e+03, percent-clipped=0.0 2024-06-19 20:48:47,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.45 vs. limit=15.0 2024-06-19 20:48:50,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.06 vs. limit=15.0 2024-06-19 20:48:57,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.78 vs. limit=22.5 2024-06-19 20:49:02,728 INFO [train.py:1028] (0/2) Epoch 4, batch 2000, loss[loss=0.4155, simple_loss=0.4053, pruned_loss=0.2129, over 12522.00 frames. ], tot_loss[loss=0.4193, simple_loss=0.3964, pruned_loss=0.2211, over 2588199.83 frames. ], batch size: 22, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:49:07,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.81 vs. limit=15.0 2024-06-19 20:49:13,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=59326.666666666664, ans=0.2 2024-06-19 20:49:16,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=59345.0, ans=0.125 2024-06-19 20:49:18,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-19 20:49:24,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-19 20:49:32,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=59381.666666666664, ans=0.0 2024-06-19 20:49:34,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.47 vs. limit=22.5 2024-06-19 20:49:35,681 INFO [train.py:1028] (0/2) Epoch 4, batch 2050, loss[loss=0.4244, simple_loss=0.4084, pruned_loss=0.2202, over 12670.00 frames. ], tot_loss[loss=0.4211, simple_loss=0.3977, pruned_loss=0.2222, over 2583848.92 frames. ], batch size: 29, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:49:38,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=59400.0, ans=0.2 2024-06-19 20:49:39,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=59400.0, ans=0.125 2024-06-19 20:49:42,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=59418.333333333336, ans=0.125 2024-06-19 20:49:50,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59436.666666666664, ans=0.125 2024-06-19 20:49:51,322 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.377e+03 2.106e+03 2.613e+03 2.993e+03 5.515e+03, threshold=5.226e+03, percent-clipped=2.0 2024-06-19 20:49:54,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=59436.666666666664, ans=0.2 2024-06-19 20:49:54,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59455.0, ans=0.1 2024-06-19 20:49:56,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59455.0, ans=0.1 2024-06-19 20:50:02,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=59455.0, ans=0.125 2024-06-19 20:50:11,111 INFO [train.py:1028] (0/2) Epoch 4, batch 2100, loss[loss=0.3891, simple_loss=0.3847, pruned_loss=0.1967, over 13199.00 frames. ], tot_loss[loss=0.4191, simple_loss=0.3968, pruned_loss=0.2207, over 2585823.19 frames. ], batch size: 59, lr: 1.45e-02, grad_scale: 2.0 2024-06-19 20:50:11,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59491.666666666664, ans=0.1 2024-06-19 20:50:20,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59491.666666666664, ans=0.0 2024-06-19 20:50:24,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=59510.0, ans=0.0 2024-06-19 20:50:24,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2024-06-19 20:50:25,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59510.0, ans=0.1 2024-06-19 20:50:28,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=59528.333333333336, ans=0.2 2024-06-19 20:50:46,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=59565.0, ans=0.015 2024-06-19 20:50:48,242 INFO [train.py:1028] (0/2) Epoch 4, batch 2150, loss[loss=0.3759, simple_loss=0.3711, pruned_loss=0.1903, over 13360.00 frames. ], tot_loss[loss=0.4173, simple_loss=0.3956, pruned_loss=0.2195, over 2588991.95 frames. ], batch size: 52, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:50:55,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2024-06-19 20:51:02,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2024-06-19 20:51:03,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.75 vs. limit=15.0 2024-06-19 20:51:05,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+03 2.633e+03 2.956e+03 3.400e+03 4.939e+03, threshold=5.911e+03, percent-clipped=0.0 2024-06-19 20:51:10,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.26 vs. limit=15.0 2024-06-19 20:51:10,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-19 20:51:16,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59656.666666666664, ans=0.1 2024-06-19 20:51:20,654 INFO [train.py:1028] (0/2) Epoch 4, batch 2200, loss[loss=0.4209, simple_loss=0.4036, pruned_loss=0.2191, over 13160.00 frames. ], tot_loss[loss=0.4176, simple_loss=0.3959, pruned_loss=0.2197, over 2589371.04 frames. ], batch size: 83, lr: 1.45e-02, grad_scale: 1.0 2024-06-19 20:51:23,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=59675.0, ans=0.0 2024-06-19 20:51:39,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59711.666666666664, ans=0.125 2024-06-19 20:51:40,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=59730.0, ans=0.125 2024-06-19 20:51:41,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=59730.0, ans=0.0 2024-06-19 20:51:42,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.61 vs. limit=15.0 2024-06-19 20:51:43,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=59730.0, ans=0.0 2024-06-19 20:51:44,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.29 vs. limit=15.0 2024-06-19 20:51:45,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=59730.0, ans=0.125 2024-06-19 20:51:47,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=16.61 vs. limit=15.0 2024-06-19 20:51:53,464 INFO [train.py:1028] (0/2) Epoch 4, batch 2250, loss[loss=0.4275, simple_loss=0.4064, pruned_loss=0.2243, over 13241.00 frames. ], tot_loss[loss=0.4178, simple_loss=0.3959, pruned_loss=0.2198, over 2588501.01 frames. ], batch size: 63, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:51:56,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=59766.666666666664, ans=0.0 2024-06-19 20:52:01,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=59785.0, ans=0.125 2024-06-19 20:52:05,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-19 20:52:12,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=59803.333333333336, ans=0.07 2024-06-19 20:52:12,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=59803.333333333336, ans=0.0 2024-06-19 20:52:12,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+03 2.388e+03 2.751e+03 3.153e+03 1.252e+04, threshold=5.501e+03, percent-clipped=2.0 2024-06-19 20:52:14,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59803.333333333336, ans=0.0 2024-06-19 20:52:15,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59821.666666666664, ans=0.1 2024-06-19 20:52:25,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=59840.0, ans=0.09899494936611666 2024-06-19 20:52:28,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59840.0, ans=0.0 2024-06-19 20:52:31,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.24 vs. limit=22.5 2024-06-19 20:52:31,441 INFO [train.py:1028] (0/2) Epoch 4, batch 2300, loss[loss=0.381, simple_loss=0.3671, pruned_loss=0.1975, over 12885.00 frames. ], tot_loss[loss=0.4171, simple_loss=0.3958, pruned_loss=0.2192, over 2582957.83 frames. ], batch size: 33, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:52:36,931 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.245e+01 2024-06-19 20:52:46,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-19 20:53:02,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59931.666666666664, ans=0.1 2024-06-19 20:53:04,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.83 vs. limit=22.5 2024-06-19 20:53:04,307 INFO [train.py:1028] (0/2) Epoch 4, batch 2350, loss[loss=0.4305, simple_loss=0.4003, pruned_loss=0.2304, over 13172.00 frames. ], tot_loss[loss=0.4164, simple_loss=0.3951, pruned_loss=0.2188, over 2585926.01 frames. ], batch size: 67, lr: 1.44e-02, grad_scale: 0.5 2024-06-19 20:53:08,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=59950.0, ans=0.125 2024-06-19 20:53:10,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=59968.333333333336, ans=0.125 2024-06-19 20:53:13,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=15.0 2024-06-19 20:53:19,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=59986.666666666664, ans=0.0 2024-06-19 20:53:22,863 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.333e+03 2.025e+03 2.323e+03 2.891e+03 1.142e+04, threshold=4.646e+03, percent-clipped=3.0 2024-06-19 20:53:37,513 INFO [train.py:1028] (0/2) Epoch 4, batch 2400, loss[loss=0.3936, simple_loss=0.3816, pruned_loss=0.2027, over 13296.00 frames. ], tot_loss[loss=0.4149, simple_loss=0.3937, pruned_loss=0.2181, over 2589129.76 frames. ], batch size: 46, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:53:48,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=60060.0, ans=0.05 2024-06-19 20:53:55,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=60078.333333333336, ans=0.125 2024-06-19 20:53:55,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60078.333333333336, ans=0.125 2024-06-19 20:53:57,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60078.333333333336, ans=0.1 2024-06-19 20:53:58,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=60078.333333333336, ans=0.125 2024-06-19 20:54:01,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=60096.666666666664, ans=0.125 2024-06-19 20:54:05,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=60096.666666666664, ans=0.125 2024-06-19 20:54:13,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=60115.0, ans=0.0 2024-06-19 20:54:16,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.90 vs. limit=22.5 2024-06-19 20:54:16,450 INFO [train.py:1028] (0/2) Epoch 4, batch 2450, loss[loss=0.4182, simple_loss=0.3926, pruned_loss=0.2219, over 13250.00 frames. ], tot_loss[loss=0.4148, simple_loss=0.3931, pruned_loss=0.2182, over 2585318.82 frames. ], batch size: 63, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:54:16,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.87 vs. limit=15.0 2024-06-19 20:54:23,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60151.666666666664, ans=0.125 2024-06-19 20:54:30,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=60170.0, ans=0.125 2024-06-19 20:54:31,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=15.0 2024-06-19 20:54:32,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=60170.0, ans=0.125 2024-06-19 20:54:34,583 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.335e+03 1.966e+03 2.409e+03 2.805e+03 5.249e+03, threshold=4.817e+03, percent-clipped=2.0 2024-06-19 20:54:48,694 INFO [train.py:1028] (0/2) Epoch 4, batch 2500, loss[loss=0.3931, simple_loss=0.3712, pruned_loss=0.2075, over 13293.00 frames. ], tot_loss[loss=0.4139, simple_loss=0.3923, pruned_loss=0.2177, over 2587725.73 frames. ], batch size: 83, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:54:51,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-19 20:54:58,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=60243.333333333336, ans=0.125 2024-06-19 20:55:03,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60261.666666666664, ans=0.125 2024-06-19 20:55:19,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=60298.333333333336, ans=0.0 2024-06-19 20:55:20,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=60298.333333333336, ans=10.0 2024-06-19 20:55:21,518 INFO [train.py:1028] (0/2) Epoch 4, batch 2550, loss[loss=0.4445, simple_loss=0.4201, pruned_loss=0.2345, over 12665.00 frames. ], tot_loss[loss=0.4136, simple_loss=0.3916, pruned_loss=0.2178, over 2587566.16 frames. ], batch size: 22, lr: 1.44e-02, grad_scale: 2.0 2024-06-19 20:55:25,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.45 vs. limit=22.5 2024-06-19 20:55:36,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=60353.333333333336, ans=0.2 2024-06-19 20:55:39,530 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.403e+03 1.997e+03 2.318e+03 2.716e+03 3.973e+03, threshold=4.636e+03, percent-clipped=0.0 2024-06-19 20:55:39,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=60353.333333333336, ans=0.125 2024-06-19 20:55:50,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=60390.0, ans=0.0 2024-06-19 20:55:57,003 INFO [train.py:1028] (0/2) Epoch 4, batch 2600, loss[loss=0.363, simple_loss=0.3619, pruned_loss=0.182, over 13239.00 frames. ], tot_loss[loss=0.4116, simple_loss=0.3898, pruned_loss=0.2167, over 2588614.57 frames. ], batch size: 52, lr: 1.44e-02, grad_scale: 4.0 2024-06-19 20:55:59,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.36 vs. limit=12.0 2024-06-19 20:56:18,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60445.0, ans=0.1 2024-06-19 20:56:32,941 INFO [train.py:1028] (0/2) Epoch 4, batch 2650, loss[loss=0.4001, simple_loss=0.3718, pruned_loss=0.2142, over 13015.00 frames. ], tot_loss[loss=0.4094, simple_loss=0.3878, pruned_loss=0.2156, over 2589226.59 frames. ], batch size: 144, lr: 1.44e-02, grad_scale: 1.0 2024-06-19 20:56:43,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=60518.333333333336, ans=0.125 2024-06-19 20:56:46,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.82 vs. limit=15.0 2024-06-19 20:56:50,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.76 vs. limit=10.0 2024-06-19 20:56:51,920 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.362e+02 1.906e+03 2.316e+03 2.609e+03 6.786e+03, threshold=4.632e+03, percent-clipped=2.0 2024-06-19 20:56:52,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2024-06-19 20:56:59,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=15.0 2024-06-19 20:57:04,658 INFO [train.py:1028] (0/2) Epoch 4, batch 2700, loss[loss=0.4021, simple_loss=0.379, pruned_loss=0.2126, over 13244.00 frames. ], tot_loss[loss=0.4055, simple_loss=0.3843, pruned_loss=0.2133, over 2585519.88 frames. ], batch size: 89, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 20:57:20,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=60628.333333333336, ans=0.0 2024-06-19 20:57:25,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.47 vs. limit=10.0 2024-06-19 20:57:37,680 INFO [train.py:1028] (0/2) Epoch 4, batch 2750, loss[loss=0.4009, simple_loss=0.3769, pruned_loss=0.2125, over 13316.00 frames. ], tot_loss[loss=0.4026, simple_loss=0.3824, pruned_loss=0.2114, over 2582469.59 frames. ], batch size: 43, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 20:57:49,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=60701.666666666664, ans=0.2 2024-06-19 20:57:56,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=60720.0, ans=0.0 2024-06-19 20:58:02,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.23 vs. limit=6.0 2024-06-19 20:58:04,145 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.049e+03 1.709e+03 1.960e+03 2.292e+03 4.889e+03, threshold=3.920e+03, percent-clipped=2.0 2024-06-19 20:58:16,849 INFO [train.py:1028] (0/2) Epoch 4, batch 2800, loss[loss=0.4481, simple_loss=0.397, pruned_loss=0.2496, over 10692.00 frames. ], tot_loss[loss=0.4036, simple_loss=0.3826, pruned_loss=0.2123, over 2578901.88 frames. ], batch size: 303, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 20:58:21,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60775.0, ans=0.125 2024-06-19 20:58:22,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=60775.0, ans=0.0 2024-06-19 20:58:26,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=60793.333333333336, ans=0.025 2024-06-19 20:58:38,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.62 vs. limit=22.5 2024-06-19 20:58:48,994 INFO [train.py:1028] (0/2) Epoch 4, batch 2850, loss[loss=0.3533, simple_loss=0.3465, pruned_loss=0.1801, over 13311.00 frames. ], tot_loss[loss=0.4017, simple_loss=0.3806, pruned_loss=0.2114, over 2576770.77 frames. ], batch size: 49, lr: 1.43e-02, grad_scale: 0.5 2024-06-19 20:59:06,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=60903.333333333336, ans=0.0 2024-06-19 20:59:07,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60903.333333333336, ans=0.1 2024-06-19 20:59:08,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60921.666666666664, ans=0.125 2024-06-19 20:59:10,286 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 1.890e+03 2.221e+03 2.593e+03 5.081e+03, threshold=4.442e+03, percent-clipped=2.0 2024-06-19 20:59:10,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-19 20:59:12,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60921.666666666664, ans=0.1 2024-06-19 20:59:20,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=60958.333333333336, ans=0.5 2024-06-19 20:59:21,012 INFO [train.py:1028] (0/2) Epoch 4, batch 2900, loss[loss=0.3544, simple_loss=0.3514, pruned_loss=0.1788, over 13126.00 frames. ], tot_loss[loss=0.3971, simple_loss=0.3771, pruned_loss=0.2085, over 2584744.15 frames. ], batch size: 55, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 20:59:21,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60958.333333333336, ans=0.125 2024-06-19 20:59:28,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60976.666666666664, ans=0.125 2024-06-19 20:59:32,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.08 vs. limit=22.5 2024-06-19 20:59:38,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=60995.0, ans=0.025 2024-06-19 20:59:47,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=12.0 2024-06-19 20:59:48,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=61013.333333333336, ans=0.0 2024-06-19 21:00:00,411 INFO [train.py:1028] (0/2) Epoch 4, batch 2950, loss[loss=0.3905, simple_loss=0.3774, pruned_loss=0.2018, over 13267.00 frames. ], tot_loss[loss=0.3973, simple_loss=0.3773, pruned_loss=0.2086, over 2578356.66 frames. ], batch size: 43, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:00:01,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=61050.0, ans=0.2 2024-06-19 21:00:10,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.94 vs. limit=22.5 2024-06-19 21:00:15,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.87 vs. limit=22.5 2024-06-19 21:00:23,179 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.460e+03 2.287e+03 2.728e+03 3.410e+03 6.168e+03, threshold=5.455e+03, percent-clipped=6.0 2024-06-19 21:00:28,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61123.333333333336, ans=0.1 2024-06-19 21:00:30,350 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:00:35,219 INFO [train.py:1028] (0/2) Epoch 4, batch 3000, loss[loss=0.3901, simple_loss=0.3787, pruned_loss=0.2007, over 13215.00 frames. ], tot_loss[loss=0.3956, simple_loss=0.376, pruned_loss=0.2077, over 2578168.21 frames. ], batch size: 59, lr: 1.43e-02, grad_scale: 2.0 2024-06-19 21:00:35,220 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 21:00:43,217 INFO [train.py:1060] (0/2) Epoch 4, validation: loss=0.2904, simple_loss=0.3277, pruned_loss=0.1266, over 351949.00 frames. 2024-06-19 21:00:43,218 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 21:00:59,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=61178.333333333336, ans=0.0 2024-06-19 21:01:01,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.80 vs. limit=15.0 2024-06-19 21:01:03,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=61196.666666666664, ans=0.95 2024-06-19 21:01:09,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=61196.666666666664, ans=0.125 2024-06-19 21:01:15,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=61215.0, ans=0.125 2024-06-19 21:01:17,460 INFO [train.py:1028] (0/2) Epoch 4, batch 3050, loss[loss=0.3771, simple_loss=0.3609, pruned_loss=0.1967, over 13282.00 frames. ], tot_loss[loss=0.3974, simple_loss=0.3765, pruned_loss=0.2091, over 2578731.96 frames. ], batch size: 46, lr: 1.43e-02, grad_scale: 0.5 2024-06-19 21:01:18,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=61233.333333333336, ans=0.125 2024-06-19 21:01:23,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-19 21:01:24,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=61251.666666666664, ans=0.0 2024-06-19 21:01:25,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=61251.666666666664, ans=0.125 2024-06-19 21:01:35,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=61270.0, ans=0.125 2024-06-19 21:01:40,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=61288.333333333336, ans=0.125 2024-06-19 21:01:43,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2024-06-19 21:01:44,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+03 2.692e+03 3.128e+03 3.646e+03 9.171e+03, threshold=6.257e+03, percent-clipped=5.0 2024-06-19 21:01:54,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61306.666666666664, ans=0.1 2024-06-19 21:01:57,051 INFO [train.py:1028] (0/2) Epoch 4, batch 3100, loss[loss=0.4491, simple_loss=0.4044, pruned_loss=0.2469, over 13009.00 frames. ], tot_loss[loss=0.3952, simple_loss=0.3753, pruned_loss=0.2075, over 2579379.50 frames. ], batch size: 144, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:01:58,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=61325.0, ans=0.125 2024-06-19 21:02:10,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=61361.666666666664, ans=0.0 2024-06-19 21:02:24,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.64 vs. limit=15.0 2024-06-19 21:02:25,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=61398.333333333336, ans=0.125 2024-06-19 21:02:27,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=61398.333333333336, ans=0.0 2024-06-19 21:02:31,155 INFO [train.py:1028] (0/2) Epoch 4, batch 3150, loss[loss=0.3718, simple_loss=0.3555, pruned_loss=0.194, over 12935.00 frames. ], tot_loss[loss=0.3917, simple_loss=0.373, pruned_loss=0.2052, over 2581019.70 frames. ], batch size: 158, lr: 1.43e-02, grad_scale: 1.0 2024-06-19 21:02:40,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=61435.0, ans=0.125 2024-06-19 21:02:46,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2024-06-19 21:02:54,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.337e+03 1.865e+03 2.412e+03 2.952e+03 6.529e+03, threshold=4.823e+03, percent-clipped=1.0 2024-06-19 21:02:57,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.05 vs. limit=22.5 2024-06-19 21:02:59,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=61490.0, ans=0.2 2024-06-19 21:03:01,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=61490.0, ans=0.125 2024-06-19 21:03:04,393 INFO [train.py:1028] (0/2) Epoch 4, batch 3200, loss[loss=0.3776, simple_loss=0.3655, pruned_loss=0.1949, over 13174.00 frames. ], tot_loss[loss=0.392, simple_loss=0.3728, pruned_loss=0.2056, over 2581032.38 frames. ], batch size: 55, lr: 1.42e-02, grad_scale: 2.0 2024-06-19 21:03:12,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.87 vs. limit=15.0 2024-06-19 21:03:13,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=61526.666666666664, ans=0.0 2024-06-19 21:03:15,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2024-06-19 21:03:15,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.97 vs. limit=5.0 2024-06-19 21:03:17,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=61545.0, ans=22.5 2024-06-19 21:03:24,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=61563.333333333336, ans=0.125 2024-06-19 21:03:36,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=61581.666666666664, ans=0.125 2024-06-19 21:03:37,312 INFO [train.py:1028] (0/2) Epoch 4, batch 3250, loss[loss=0.3928, simple_loss=0.3771, pruned_loss=0.2042, over 13247.00 frames. ], tot_loss[loss=0.392, simple_loss=0.3725, pruned_loss=0.2058, over 2586045.25 frames. ], batch size: 72, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:04:08,499 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+03 2.277e+03 2.851e+03 3.448e+03 1.123e+04, threshold=5.701e+03, percent-clipped=6.0 2024-06-19 21:04:13,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=61673.333333333336, ans=0.0 2024-06-19 21:04:14,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=12.0 2024-06-19 21:04:17,682 INFO [train.py:1028] (0/2) Epoch 4, batch 3300, loss[loss=0.4235, simple_loss=0.387, pruned_loss=0.2299, over 12751.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.3696, pruned_loss=0.2031, over 2581651.28 frames. ], batch size: 176, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:04:18,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=61691.666666666664, ans=0.5 2024-06-19 21:04:23,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.82 vs. limit=10.0 2024-06-19 21:04:23,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=61710.0, ans=0.0 2024-06-19 21:04:28,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.59 vs. limit=15.0 2024-06-19 21:04:33,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=61728.333333333336, ans=0.0 2024-06-19 21:04:38,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=61746.666666666664, ans=0.125 2024-06-19 21:04:42,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=61746.666666666664, ans=0.025 2024-06-19 21:04:43,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=61765.0, ans=0.125 2024-06-19 21:04:50,371 INFO [train.py:1028] (0/2) Epoch 4, batch 3350, loss[loss=0.417, simple_loss=0.3808, pruned_loss=0.2266, over 12899.00 frames. ], tot_loss[loss=0.3892, simple_loss=0.3701, pruned_loss=0.2042, over 2576218.77 frames. ], batch size: 158, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:04:51,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=61783.333333333336, ans=0.125 2024-06-19 21:04:59,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=61801.666666666664, ans=0.2 2024-06-19 21:05:00,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=61801.666666666664, ans=0.125 2024-06-19 21:05:06,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.62 vs. limit=22.5 2024-06-19 21:05:10,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61838.333333333336, ans=0.1 2024-06-19 21:05:15,273 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.133e+03 2.106e+03 2.428e+03 2.907e+03 4.989e+03, threshold=4.857e+03, percent-clipped=0.0 2024-06-19 21:05:24,015 INFO [train.py:1028] (0/2) Epoch 4, batch 3400, loss[loss=0.4037, simple_loss=0.3891, pruned_loss=0.2092, over 12746.00 frames. ], tot_loss[loss=0.3897, simple_loss=0.3702, pruned_loss=0.2046, over 2574126.97 frames. ], batch size: 22, lr: 1.42e-02, grad_scale: 2.0 2024-06-19 21:05:24,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=61875.0, ans=0.2 2024-06-19 21:05:29,231 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.708e+00 2024-06-19 21:05:35,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=61893.333333333336, ans=0.125 2024-06-19 21:05:41,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.06 vs. limit=15.0 2024-06-19 21:05:41,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61893.333333333336, ans=0.125 2024-06-19 21:05:50,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=61930.0, ans=0.025 2024-06-19 21:05:53,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=61930.0, ans=0.125 2024-06-19 21:06:03,709 INFO [train.py:1028] (0/2) Epoch 4, batch 3450, loss[loss=0.4032, simple_loss=0.3744, pruned_loss=0.2159, over 12774.00 frames. ], tot_loss[loss=0.3869, simple_loss=0.3683, pruned_loss=0.2027, over 2575735.36 frames. ], batch size: 176, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:06:10,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=61985.0, ans=0.0 2024-06-19 21:06:10,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.77 vs. limit=15.0 2024-06-19 21:06:12,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=12.0 2024-06-19 21:06:14,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=61985.0, ans=0.0 2024-06-19 21:06:16,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=62003.333333333336, ans=0.0 2024-06-19 21:06:26,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62021.666666666664, ans=0.1 2024-06-19 21:06:28,673 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+03 2.805e+03 3.415e+03 3.858e+03 6.822e+03, threshold=6.829e+03, percent-clipped=6.0 2024-06-19 21:06:29,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62040.0, ans=0.1 2024-06-19 21:06:32,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2024-06-19 21:06:36,599 INFO [train.py:1028] (0/2) Epoch 4, batch 3500, loss[loss=0.3964, simple_loss=0.3726, pruned_loss=0.2101, over 12912.00 frames. ], tot_loss[loss=0.3863, simple_loss=0.3678, pruned_loss=0.2024, over 2574893.33 frames. ], batch size: 33, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:06:48,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.26 vs. limit=15.0 2024-06-19 21:06:48,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62076.666666666664, ans=0.125 2024-06-19 21:06:51,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.25 vs. limit=6.0 2024-06-19 21:06:57,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62113.333333333336, ans=0.1 2024-06-19 21:06:59,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=62113.333333333336, ans=0.0 2024-06-19 21:07:01,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=62113.333333333336, ans=0.125 2024-06-19 21:07:07,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=62131.666666666664, ans=0.035 2024-06-19 21:07:09,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=62150.0, ans=0.0 2024-06-19 21:07:09,732 INFO [train.py:1028] (0/2) Epoch 4, batch 3550, loss[loss=0.3641, simple_loss=0.3481, pruned_loss=0.19, over 13087.00 frames. ], tot_loss[loss=0.3854, simple_loss=0.3672, pruned_loss=0.2018, over 2575952.55 frames. ], batch size: 95, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:07:13,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=62150.0, ans=0.0 2024-06-19 21:07:29,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62205.0, ans=0.1 2024-06-19 21:07:39,948 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+03 2.877e+03 3.345e+03 3.975e+03 8.231e+03, threshold=6.690e+03, percent-clipped=2.0 2024-06-19 21:07:40,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2024-06-19 21:07:41,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.81 vs. limit=15.0 2024-06-19 21:07:41,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62223.333333333336, ans=0.125 2024-06-19 21:07:49,560 INFO [train.py:1028] (0/2) Epoch 4, batch 3600, loss[loss=0.3554, simple_loss=0.3444, pruned_loss=0.1832, over 12998.00 frames. ], tot_loss[loss=0.3836, simple_loss=0.3654, pruned_loss=0.2009, over 2579479.65 frames. ], batch size: 48, lr: 1.42e-02, grad_scale: 1.0 2024-06-19 21:07:55,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62241.666666666664, ans=0.1 2024-06-19 21:07:57,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=62260.0, ans=0.04949747468305833 2024-06-19 21:07:58,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.73 vs. limit=22.5 2024-06-19 21:07:58,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=62260.0, ans=0.125 2024-06-19 21:07:59,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=33.70 vs. limit=15.0 2024-06-19 21:08:02,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=62260.0, ans=0.125 2024-06-19 21:08:16,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=62315.0, ans=0.05 2024-06-19 21:08:23,007 INFO [train.py:1028] (0/2) Epoch 4, batch 3650, loss[loss=0.3463, simple_loss=0.3458, pruned_loss=0.1734, over 13012.00 frames. ], tot_loss[loss=0.3838, simple_loss=0.3654, pruned_loss=0.2011, over 2579462.21 frames. ], batch size: 102, lr: 1.42e-02, grad_scale: 0.5 2024-06-19 21:08:25,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.34 vs. limit=10.0 2024-06-19 21:08:30,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=62351.666666666664, ans=0.125 2024-06-19 21:08:32,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62351.666666666664, ans=0.1 2024-06-19 21:08:36,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.53 vs. limit=22.5 2024-06-19 21:08:48,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.48 vs. limit=15.0 2024-06-19 21:08:50,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.241e+03 3.186e+03 4.030e+03 5.180e+03 1.373e+04, threshold=8.061e+03, percent-clipped=10.0 2024-06-19 21:08:52,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62406.666666666664, ans=0.125 2024-06-19 21:08:53,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62406.666666666664, ans=0.125 2024-06-19 21:08:55,730 INFO [train.py:1028] (0/2) Epoch 4, batch 3700, loss[loss=0.391, simple_loss=0.3754, pruned_loss=0.2033, over 13222.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.3646, pruned_loss=0.2005, over 2583553.07 frames. ], batch size: 72, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:09:05,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.00 vs. limit=22.5 2024-06-19 21:09:08,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=62461.666666666664, ans=0.0 2024-06-19 21:09:12,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.24 vs. limit=22.5 2024-06-19 21:09:12,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.36 vs. limit=22.5 2024-06-19 21:09:15,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=62480.0, ans=0.125 2024-06-19 21:09:17,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62480.0, ans=0.125 2024-06-19 21:09:17,778 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.91 vs. limit=15.0 2024-06-19 21:09:28,043 INFO [train.py:1028] (0/2) Epoch 4, batch 3750, loss[loss=0.413, simple_loss=0.3971, pruned_loss=0.2144, over 12754.00 frames. ], tot_loss[loss=0.3802, simple_loss=0.3629, pruned_loss=0.1987, over 2585263.30 frames. ], batch size: 22, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:09:51,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=62553.333333333336, ans=0.125 2024-06-19 21:09:52,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=15.0 2024-06-19 21:09:52,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=62553.333333333336, ans=0.0 2024-06-19 21:09:53,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=62571.666666666664, ans=0.025 2024-06-19 21:09:58,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62571.666666666664, ans=0.1 2024-06-19 21:09:58,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=62571.666666666664, ans=0.2 2024-06-19 21:09:58,989 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=1.787e+02 2024-06-19 21:10:01,308 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.259e+03 2.529e+03 3.002e+03 3.700e+03 1.036e+04, threshold=6.004e+03, percent-clipped=1.0 2024-06-19 21:10:04,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=62590.0, ans=0.125 2024-06-19 21:10:06,491 INFO [train.py:1028] (0/2) Epoch 4, batch 3800, loss[loss=0.3721, simple_loss=0.3459, pruned_loss=0.1992, over 13229.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.3626, pruned_loss=0.1986, over 2583221.92 frames. ], batch size: 83, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:10:07,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=62608.333333333336, ans=0.125 2024-06-19 21:10:20,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=62645.0, ans=0.125 2024-06-19 21:10:30,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=62663.333333333336, ans=0.125 2024-06-19 21:10:39,434 INFO [train.py:1028] (0/2) Epoch 4, batch 3850, loss[loss=0.3791, simple_loss=0.3553, pruned_loss=0.2014, over 13093.00 frames. ], tot_loss[loss=0.3795, simple_loss=0.3625, pruned_loss=0.1983, over 2583652.06 frames. ], batch size: 144, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:10:42,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=62700.0, ans=0.125 2024-06-19 21:10:50,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=62718.333333333336, ans=0.0 2024-06-19 21:10:52,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=62736.666666666664, ans=0.125 2024-06-19 21:10:53,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=62736.666666666664, ans=0.125 2024-06-19 21:11:02,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=62755.0, ans=0.125 2024-06-19 21:11:07,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.328e+03 2.275e+03 2.699e+03 3.244e+03 1.376e+04, threshold=5.398e+03, percent-clipped=1.0 2024-06-19 21:11:09,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=62773.333333333336, ans=10.0 2024-06-19 21:11:11,581 INFO [train.py:1028] (0/2) Epoch 4, batch 3900, loss[loss=0.3914, simple_loss=0.3743, pruned_loss=0.2042, over 13221.00 frames. ], tot_loss[loss=0.3786, simple_loss=0.3619, pruned_loss=0.1977, over 2587254.84 frames. ], batch size: 83, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:11:15,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.59 vs. limit=12.0 2024-06-19 21:11:20,968 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:11:26,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=62828.333333333336, ans=0.125 2024-06-19 21:11:27,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=62828.333333333336, ans=0.2 2024-06-19 21:11:28,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62828.333333333336, ans=0.1 2024-06-19 21:11:30,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=62846.666666666664, ans=0.0 2024-06-19 21:11:33,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62846.666666666664, ans=0.1 2024-06-19 21:11:44,618 INFO [train.py:1028] (0/2) Epoch 4, batch 3950, loss[loss=0.3724, simple_loss=0.3503, pruned_loss=0.1972, over 13130.00 frames. ], tot_loss[loss=0.3758, simple_loss=0.3601, pruned_loss=0.1957, over 2589003.43 frames. ], batch size: 132, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:11:51,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=62883.333333333336, ans=0.0 2024-06-19 21:11:53,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=62883.333333333336, ans=0.125 2024-06-19 21:12:03,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=62901.666666666664, ans=0.025 2024-06-19 21:12:14,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=62938.333333333336, ans=0.2 2024-06-19 21:12:18,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.89 vs. limit=15.0 2024-06-19 21:12:21,929 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 1.742e+03 2.139e+03 2.640e+03 7.983e+03, threshold=4.279e+03, percent-clipped=3.0 2024-06-19 21:12:22,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2024-06-19 21:12:25,840 INFO [train.py:1028] (0/2) Epoch 4, batch 4000, loss[loss=0.3883, simple_loss=0.3713, pruned_loss=0.2026, over 12939.00 frames. ], tot_loss[loss=0.3753, simple_loss=0.3597, pruned_loss=0.1954, over 2583952.61 frames. ], batch size: 39, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:12:27,897 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.02 vs. limit=15.0 2024-06-19 21:12:28,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=62975.0, ans=0.125 2024-06-19 21:12:30,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=62975.0, ans=0.125 2024-06-19 21:12:35,258 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-19 21:12:39,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=63011.666666666664, ans=0.125 2024-06-19 21:12:42,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=63011.666666666664, ans=0.125 2024-06-19 21:12:43,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=63011.666666666664, ans=0.0 2024-06-19 21:12:50,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=63030.0, ans=0.125 2024-06-19 21:12:52,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.70 vs. limit=15.0 2024-06-19 21:12:53,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=63048.333333333336, ans=0.0 2024-06-19 21:12:56,073 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.068e+00 2024-06-19 21:12:59,729 INFO [train.py:1028] (0/2) Epoch 4, batch 4050, loss[loss=0.4199, simple_loss=0.3693, pruned_loss=0.2353, over 11049.00 frames. ], tot_loss[loss=0.3772, simple_loss=0.3605, pruned_loss=0.1969, over 2581379.15 frames. ], batch size: 304, lr: 1.41e-02, grad_scale: 0.5 2024-06-19 21:13:06,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=63085.0, ans=0.125 2024-06-19 21:13:07,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=63085.0, ans=0.125 2024-06-19 21:13:11,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63085.0, ans=0.1 2024-06-19 21:13:12,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=63103.333333333336, ans=10.0 2024-06-19 21:13:17,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=63103.333333333336, ans=0.125 2024-06-19 21:13:18,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=63121.666666666664, ans=0.0 2024-06-19 21:13:22,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2024-06-19 21:13:22,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.63 vs. limit=10.0 2024-06-19 21:13:28,671 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+03 2.471e+03 3.094e+03 3.730e+03 1.174e+04, threshold=6.188e+03, percent-clipped=13.0 2024-06-19 21:13:32,073 INFO [train.py:1028] (0/2) Epoch 4, batch 4100, loss[loss=0.3862, simple_loss=0.3593, pruned_loss=0.2066, over 13136.00 frames. ], tot_loss[loss=0.378, simple_loss=0.3607, pruned_loss=0.1976, over 2577119.06 frames. ], batch size: 103, lr: 1.41e-02, grad_scale: 1.0 2024-06-19 21:13:33,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=63158.333333333336, ans=0.2 2024-06-19 21:13:34,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63158.333333333336, ans=0.125 2024-06-19 21:13:35,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=63158.333333333336, ans=0.05 2024-06-19 21:13:44,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63195.0, ans=0.1 2024-06-19 21:14:05,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-19 21:14:06,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=63231.666666666664, ans=0.125 2024-06-19 21:14:11,930 INFO [train.py:1028] (0/2) Epoch 4, batch 4150, loss[loss=0.3443, simple_loss=0.3384, pruned_loss=0.1751, over 13147.00 frames. ], tot_loss[loss=0.3784, simple_loss=0.3611, pruned_loss=0.1978, over 2574511.94 frames. ], batch size: 55, lr: 1.41e-02, grad_scale: 0.125 2024-06-19 21:14:12,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=63250.0, ans=0.0 2024-06-19 21:14:14,295 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:14:29,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=63286.666666666664, ans=0.0 2024-06-19 21:14:31,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=63305.0, ans=0.125 2024-06-19 21:14:35,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=63305.0, ans=0.025 2024-06-19 21:14:44,222 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+03 3.638e+03 4.303e+03 5.195e+03 1.635e+04, threshold=8.606e+03, percent-clipped=14.0 2024-06-19 21:14:45,551 INFO [train.py:1028] (0/2) Epoch 4, batch 4200, loss[loss=0.3482, simple_loss=0.3273, pruned_loss=0.1845, over 13027.00 frames. ], tot_loss[loss=0.3774, simple_loss=0.3602, pruned_loss=0.1973, over 2577538.40 frames. ], batch size: 102, lr: 1.40e-02, grad_scale: 0.25 2024-06-19 21:14:54,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-19 21:14:55,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=63360.0, ans=0.2 2024-06-19 21:15:01,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=63378.333333333336, ans=0.0 2024-06-19 21:15:01,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=63378.333333333336, ans=0.0 2024-06-19 21:15:03,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=63378.333333333336, ans=0.025 2024-06-19 21:15:06,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.50 vs. limit=15.0 2024-06-19 21:15:18,078 INFO [train.py:1028] (0/2) Epoch 4, batch 4250, loss[loss=0.3405, simple_loss=0.3444, pruned_loss=0.1683, over 13276.00 frames. ], tot_loss[loss=0.377, simple_loss=0.3599, pruned_loss=0.1971, over 2580422.19 frames. ], batch size: 46, lr: 1.40e-02, grad_scale: 0.25 2024-06-19 21:15:18,998 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:15:20,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=35.28 vs. limit=15.0 2024-06-19 21:15:25,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.79 vs. limit=10.0 2024-06-19 21:15:31,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=63470.0, ans=0.125 2024-06-19 21:15:41,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=63488.333333333336, ans=0.0 2024-06-19 21:15:43,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=63488.333333333336, ans=0.125 2024-06-19 21:15:44,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=63506.666666666664, ans=0.125 2024-06-19 21:15:49,964 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.261e+03 3.933e+03 4.476e+03 5.321e+03 1.337e+04, threshold=8.953e+03, percent-clipped=6.0 2024-06-19 21:15:51,294 INFO [train.py:1028] (0/2) Epoch 4, batch 4300, loss[loss=0.3761, simple_loss=0.3644, pruned_loss=0.1939, over 13112.00 frames. ], tot_loss[loss=0.3754, simple_loss=0.3587, pruned_loss=0.196, over 2581130.75 frames. ], batch size: 59, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:16:05,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2024-06-19 21:16:13,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.29 vs. limit=10.0 2024-06-19 21:16:31,464 INFO [train.py:1028] (0/2) Epoch 4, batch 4350, loss[loss=0.3573, simple_loss=0.3495, pruned_loss=0.1825, over 13188.00 frames. ], tot_loss[loss=0.3727, simple_loss=0.3568, pruned_loss=0.1943, over 2585369.28 frames. ], batch size: 59, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:16:41,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=63635.0, ans=0.0 2024-06-19 21:16:42,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=63635.0, ans=0.025 2024-06-19 21:16:46,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63653.333333333336, ans=0.1 2024-06-19 21:16:49,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=63653.333333333336, ans=0.125 2024-06-19 21:17:01,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.36 vs. limit=15.0 2024-06-19 21:17:02,652 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+03 2.710e+03 3.104e+03 3.775e+03 7.105e+03, threshold=6.208e+03, percent-clipped=0.0 2024-06-19 21:17:03,989 INFO [train.py:1028] (0/2) Epoch 4, batch 4400, loss[loss=0.3492, simple_loss=0.3406, pruned_loss=0.1789, over 13212.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.3563, pruned_loss=0.1936, over 2585843.06 frames. ], batch size: 83, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:17:08,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=63708.333333333336, ans=0.125 2024-06-19 21:17:17,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.83 vs. limit=10.0 2024-06-19 21:17:18,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63745.0, ans=0.1 2024-06-19 21:17:26,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.81 vs. limit=15.0 2024-06-19 21:17:29,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2024-06-19 21:17:33,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=63781.666666666664, ans=0.2 2024-06-19 21:17:37,715 INFO [train.py:1028] (0/2) Epoch 4, batch 4450, loss[loss=0.3086, simple_loss=0.3173, pruned_loss=0.1499, over 12934.00 frames. ], tot_loss[loss=0.3727, simple_loss=0.3568, pruned_loss=0.1943, over 2579762.38 frames. ], batch size: 33, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:17:38,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=63800.0, ans=0.2 2024-06-19 21:17:38,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-19 21:17:56,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=63836.666666666664, ans=0.125 2024-06-19 21:18:05,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=63855.0, ans=0.0 2024-06-19 21:18:10,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=63873.333333333336, ans=0.025 2024-06-19 21:18:16,488 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.330e+03 2.603e+03 3.054e+03 3.614e+03 9.678e+03, threshold=6.108e+03, percent-clipped=2.0 2024-06-19 21:18:17,117 INFO [train.py:1028] (0/2) Epoch 4, batch 4500, loss[loss=0.3373, simple_loss=0.3282, pruned_loss=0.1732, over 13222.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.3547, pruned_loss=0.1929, over 2584507.44 frames. ], batch size: 89, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:18:17,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=63891.666666666664, ans=0.0 2024-06-19 21:18:30,474 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:18:32,109 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.08 vs. limit=22.5 2024-06-19 21:18:35,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=63928.333333333336, ans=0.125 2024-06-19 21:18:45,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=63965.0, ans=0.0 2024-06-19 21:18:49,710 INFO [train.py:1028] (0/2) Epoch 4, batch 4550, loss[loss=0.3322, simple_loss=0.3361, pruned_loss=0.1641, over 13282.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.3549, pruned_loss=0.1932, over 2588812.05 frames. ], batch size: 52, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:18:53,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=63983.333333333336, ans=0.1 2024-06-19 21:18:56,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.90 vs. limit=15.0 2024-06-19 21:19:14,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=64038.333333333336, ans=12.0 2024-06-19 21:19:18,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=64056.666666666664, ans=0.2 2024-06-19 21:19:22,412 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+03 2.867e+03 3.374e+03 3.870e+03 6.926e+03, threshold=6.747e+03, percent-clipped=1.0 2024-06-19 21:19:22,449 INFO [train.py:1028] (0/2) Epoch 4, batch 4600, loss[loss=0.4257, simple_loss=0.3895, pruned_loss=0.2309, over 12503.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.3563, pruned_loss=0.1944, over 2583995.22 frames. ], batch size: 202, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:19:23,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=64075.0, ans=0.125 2024-06-19 21:19:24,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=12.0 2024-06-19 21:19:26,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=64075.0, ans=0.125 2024-06-19 21:19:28,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=64075.0, ans=0.0 2024-06-19 21:19:35,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=64111.666666666664, ans=0.125 2024-06-19 21:19:47,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=64130.0, ans=0.0 2024-06-19 21:19:50,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=64148.333333333336, ans=0.0 2024-06-19 21:20:03,704 INFO [train.py:1028] (0/2) Epoch 4, batch 4650, loss[loss=0.3672, simple_loss=0.3437, pruned_loss=0.1954, over 13059.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.3547, pruned_loss=0.1931, over 2587429.43 frames. ], batch size: 132, lr: 1.40e-02, grad_scale: 0.5 2024-06-19 21:20:04,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2024-06-19 21:20:10,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=64185.0, ans=0.07 2024-06-19 21:20:10,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=64185.0, ans=0.125 2024-06-19 21:20:15,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=15.0 2024-06-19 21:20:20,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=64203.333333333336, ans=0.125 2024-06-19 21:20:20,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=64203.333333333336, ans=0.125 2024-06-19 21:20:25,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=64221.666666666664, ans=0.0 2024-06-19 21:20:25,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=64221.666666666664, ans=0.0 2024-06-19 21:20:36,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=15.0 2024-06-19 21:20:37,071 INFO [train.py:1028] (0/2) Epoch 4, batch 4700, loss[loss=0.3596, simple_loss=0.358, pruned_loss=0.1806, over 12766.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.3551, pruned_loss=0.1932, over 2583201.32 frames. ], batch size: 26, lr: 1.40e-02, grad_scale: 1.0 2024-06-19 21:20:37,693 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+03 2.629e+03 3.060e+03 3.626e+03 6.484e+03, threshold=6.121e+03, percent-clipped=0.0 2024-06-19 21:20:39,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=64258.333333333336, ans=0.07 2024-06-19 21:20:43,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=64276.666666666664, ans=0.2 2024-06-19 21:20:45,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-19 21:20:47,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=12.0 2024-06-19 21:20:48,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64276.666666666664, ans=0.1 2024-06-19 21:20:52,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.99 vs. limit=15.0 2024-06-19 21:20:53,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=64295.0, ans=0.125 2024-06-19 21:20:57,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=64313.333333333336, ans=10.0 2024-06-19 21:20:59,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=64313.333333333336, ans=0.125 2024-06-19 21:21:02,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-19 21:21:03,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64331.666666666664, ans=0.125 2024-06-19 21:21:05,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=64331.666666666664, ans=0.125 2024-06-19 21:21:08,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=64331.666666666664, ans=0.125 2024-06-19 21:21:09,690 INFO [train.py:1028] (0/2) Epoch 4, batch 4750, loss[loss=0.4515, simple_loss=0.4015, pruned_loss=0.2507, over 12618.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.3545, pruned_loss=0.1932, over 2580164.83 frames. ], batch size: 202, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:21:10,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=64350.0, ans=0.2 2024-06-19 21:21:15,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=8.0 2024-06-19 21:21:24,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64386.666666666664, ans=0.1 2024-06-19 21:21:26,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.40 vs. limit=15.0 2024-06-19 21:21:30,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=64405.0, ans=0.0 2024-06-19 21:21:43,781 INFO [train.py:1028] (0/2) Epoch 4, batch 4800, loss[loss=0.3626, simple_loss=0.3501, pruned_loss=0.1875, over 13216.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3538, pruned_loss=0.1923, over 2576379.56 frames. ], batch size: 63, lr: 1.39e-02, grad_scale: 2.0 2024-06-19 21:21:44,354 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+03 2.492e+03 2.950e+03 3.618e+03 4.767e+03, threshold=5.900e+03, percent-clipped=0.0 2024-06-19 21:21:47,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=64441.666666666664, ans=0.0 2024-06-19 21:21:52,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.62 vs. limit=15.0 2024-06-19 21:22:03,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=15.0 2024-06-19 21:22:09,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=15.0 2024-06-19 21:22:10,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=64496.666666666664, ans=15.0 2024-06-19 21:22:13,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=64496.666666666664, ans=0.025 2024-06-19 21:22:13,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2024-06-19 21:22:15,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=64515.0, ans=0.0 2024-06-19 21:22:23,120 INFO [train.py:1028] (0/2) Epoch 4, batch 4850, loss[loss=0.3267, simple_loss=0.319, pruned_loss=0.1672, over 13212.00 frames. ], tot_loss[loss=0.3688, simple_loss=0.3535, pruned_loss=0.1921, over 2575162.77 frames. ], batch size: 89, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:22:24,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=64533.333333333336, ans=0.125 2024-06-19 21:22:31,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=64551.666666666664, ans=10.0 2024-06-19 21:22:35,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=64551.666666666664, ans=0.0 2024-06-19 21:22:40,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64570.0, ans=0.1 2024-06-19 21:22:48,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.23 vs. limit=15.0 2024-06-19 21:22:58,038 INFO [train.py:1028] (0/2) Epoch 4, batch 4900, loss[loss=0.3216, simple_loss=0.3189, pruned_loss=0.1622, over 13246.00 frames. ], tot_loss[loss=0.3683, simple_loss=0.3531, pruned_loss=0.1918, over 2577690.57 frames. ], batch size: 59, lr: 1.39e-02, grad_scale: 2.0 2024-06-19 21:22:59,423 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+03 2.923e+03 3.236e+03 3.747e+03 8.319e+03, threshold=6.472e+03, percent-clipped=2.0 2024-06-19 21:23:06,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=64643.333333333336, ans=0.125 2024-06-19 21:23:12,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.00 vs. limit=15.0 2024-06-19 21:23:14,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.91 vs. limit=5.0 2024-06-19 21:23:16,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=64661.666666666664, ans=0.125 2024-06-19 21:23:20,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=64680.0, ans=0.125 2024-06-19 21:23:25,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64698.333333333336, ans=0.125 2024-06-19 21:23:31,155 INFO [train.py:1028] (0/2) Epoch 4, batch 4950, loss[loss=0.4199, simple_loss=0.3755, pruned_loss=0.2321, over 11001.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.354, pruned_loss=0.1932, over 2571345.85 frames. ], batch size: 304, lr: 1.39e-02, grad_scale: 0.25 2024-06-19 21:23:31,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=64716.666666666664, ans=0.0 2024-06-19 21:23:32,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.92 vs. limit=15.0 2024-06-19 21:23:33,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=64716.666666666664, ans=0.1 2024-06-19 21:23:37,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=64735.0, ans=0.2 2024-06-19 21:23:39,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=64735.0, ans=0.0 2024-06-19 21:23:51,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64771.666666666664, ans=0.1 2024-06-19 21:23:57,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2024-06-19 21:24:10,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=64808.333333333336, ans=0.125 2024-06-19 21:24:10,893 INFO [train.py:1028] (0/2) Epoch 4, batch 5000, loss[loss=0.3565, simple_loss=0.3354, pruned_loss=0.1888, over 13188.00 frames. ], tot_loss[loss=0.37, simple_loss=0.354, pruned_loss=0.193, over 2575136.53 frames. ], batch size: 95, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:24:14,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+03 3.493e+03 4.281e+03 5.212e+03 1.088e+04, threshold=8.563e+03, percent-clipped=11.0 2024-06-19 21:24:23,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.10 vs. limit=6.0 2024-06-19 21:24:33,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-19 21:24:39,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64881.666666666664, ans=0.125 2024-06-19 21:24:45,345 INFO [train.py:1028] (0/2) Epoch 4, batch 5050, loss[loss=0.3198, simple_loss=0.3239, pruned_loss=0.1579, over 12784.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.3533, pruned_loss=0.1919, over 2573231.28 frames. ], batch size: 36, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:24:46,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64900.0, ans=0.1 2024-06-19 21:24:48,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=64900.0, ans=0.0 2024-06-19 21:24:55,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=64918.333333333336, ans=0.05 2024-06-19 21:25:00,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=64936.666666666664, ans=0.125 2024-06-19 21:25:18,776 INFO [train.py:1028] (0/2) Epoch 4, batch 5100, loss[loss=0.4163, simple_loss=0.3928, pruned_loss=0.2199, over 12914.00 frames. ], tot_loss[loss=0.3687, simple_loss=0.3531, pruned_loss=0.1921, over 2569084.19 frames. ], batch size: 39, lr: 1.39e-02, grad_scale: 1.0 2024-06-19 21:25:20,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64991.666666666664, ans=0.1 2024-06-19 21:25:22,143 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+03 2.934e+03 3.565e+03 4.127e+03 1.136e+04, threshold=7.129e+03, percent-clipped=2.0 2024-06-19 21:25:29,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-19 21:25:40,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=65046.666666666664, ans=0.0 2024-06-19 21:25:44,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.53 vs. limit=15.0 2024-06-19 21:25:59,019 INFO [train.py:1028] (0/2) Epoch 4, batch 5150, loss[loss=0.3538, simple_loss=0.3295, pruned_loss=0.189, over 13108.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.3533, pruned_loss=0.193, over 2571455.51 frames. ], batch size: 132, lr: 1.39e-02, grad_scale: 0.25 2024-06-19 21:25:59,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=65083.333333333336, ans=0.0 2024-06-19 21:25:59,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=65083.333333333336, ans=0.125 2024-06-19 21:26:01,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.59 vs. limit=12.0 2024-06-19 21:26:03,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65083.333333333336, ans=0.1 2024-06-19 21:26:06,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=65101.666666666664, ans=0.0 2024-06-19 21:26:08,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.48 vs. limit=10.0 2024-06-19 21:26:12,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=65120.0, ans=0.125 2024-06-19 21:26:20,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.00 vs. limit=22.5 2024-06-19 21:26:32,442 INFO [train.py:1028] (0/2) Epoch 4, batch 5200, loss[loss=0.3415, simple_loss=0.3315, pruned_loss=0.1757, over 13125.00 frames. ], tot_loss[loss=0.371, simple_loss=0.354, pruned_loss=0.194, over 2574542.57 frames. ], batch size: 95, lr: 1.39e-02, grad_scale: 0.5 2024-06-19 21:26:37,109 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.071e+03 3.522e+03 4.311e+03 5.463e+03 1.297e+04, threshold=8.621e+03, percent-clipped=5.0 2024-06-19 21:26:38,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=65193.333333333336, ans=0.0 2024-06-19 21:26:42,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65193.333333333336, ans=0.1 2024-06-19 21:26:48,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=65211.666666666664, ans=0.125 2024-06-19 21:26:55,554 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:26:56,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=65230.0, ans=0.125 2024-06-19 21:26:59,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=65248.333333333336, ans=0.125 2024-06-19 21:27:05,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=65248.333333333336, ans=0.125 2024-06-19 21:27:06,636 INFO [train.py:1028] (0/2) Epoch 4, batch 5250, loss[loss=0.38, simple_loss=0.3703, pruned_loss=0.1949, over 13233.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.3535, pruned_loss=0.1931, over 2570076.59 frames. ], batch size: 52, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:27:08,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=65266.666666666664, ans=0.05 2024-06-19 21:27:11,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65266.666666666664, ans=0.1 2024-06-19 21:27:25,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2024-06-19 21:27:29,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2024-06-19 21:27:29,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2024-06-19 21:27:40,010 INFO [train.py:1028] (0/2) Epoch 4, batch 5300, loss[loss=0.3658, simple_loss=0.3449, pruned_loss=0.1933, over 13050.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3532, pruned_loss=0.1926, over 2567152.36 frames. ], batch size: 144, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:27:44,474 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.307e+03 1.849e+03 2.406e+03 2.853e+03 6.653e+03, threshold=4.811e+03, percent-clipped=0.0 2024-06-19 21:27:50,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=65376.666666666664, ans=0.2 2024-06-19 21:27:51,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=65376.666666666664, ans=0.0 2024-06-19 21:27:54,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=65376.666666666664, ans=0.0 2024-06-19 21:28:03,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=65395.0, ans=0.2 2024-06-19 21:28:04,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65395.0, ans=0.125 2024-06-19 21:28:12,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=65431.666666666664, ans=0.2 2024-06-19 21:28:14,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.02 vs. limit=15.0 2024-06-19 21:28:19,599 INFO [train.py:1028] (0/2) Epoch 4, batch 5350, loss[loss=0.4004, simple_loss=0.3898, pruned_loss=0.2055, over 10877.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.3519, pruned_loss=0.1912, over 2572532.41 frames. ], batch size: 16, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:28:21,240 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=3.666e+00 2024-06-19 21:28:21,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.34 vs. limit=10.0 2024-06-19 21:28:30,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=65468.333333333336, ans=0.2 2024-06-19 21:28:52,269 INFO [train.py:1028] (0/2) Epoch 4, batch 5400, loss[loss=0.3906, simple_loss=0.3612, pruned_loss=0.21, over 12258.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.3518, pruned_loss=0.1915, over 2564749.16 frames. ], batch size: 240, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:28:53,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=65541.66666666667, ans=0.0 2024-06-19 21:28:56,914 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.388e+02 1.652e+03 2.033e+03 2.523e+03 6.863e+03, threshold=4.065e+03, percent-clipped=2.0 2024-06-19 21:29:02,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=65560.0, ans=0.125 2024-06-19 21:29:06,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=65578.33333333333, ans=0.0 2024-06-19 21:29:15,387 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:29:18,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.61 vs. limit=15.0 2024-06-19 21:29:26,090 INFO [train.py:1028] (0/2) Epoch 4, batch 5450, loss[loss=0.322, simple_loss=0.3238, pruned_loss=0.1602, over 12475.00 frames. ], tot_loss[loss=0.365, simple_loss=0.3506, pruned_loss=0.1897, over 2568468.59 frames. ], batch size: 25, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:29:27,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=65633.33333333333, ans=15.0 2024-06-19 21:29:37,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65651.66666666667, ans=0.1 2024-06-19 21:29:44,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=65670.0, ans=0.0 2024-06-19 21:29:45,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=65670.0, ans=0.125 2024-06-19 21:29:47,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.90 vs. limit=15.0 2024-06-19 21:29:57,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=65688.33333333333, ans=0.95 2024-06-19 21:29:57,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=65688.33333333333, ans=0.125 2024-06-19 21:29:59,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=65706.66666666667, ans=0.04949747468305833 2024-06-19 21:30:06,863 INFO [train.py:1028] (0/2) Epoch 4, batch 5500, loss[loss=0.4294, simple_loss=0.3828, pruned_loss=0.238, over 12175.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.35, pruned_loss=0.1889, over 2561751.08 frames. ], batch size: 241, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:30:07,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=65725.0, ans=0.125 2024-06-19 21:30:12,246 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.189e+03 1.515e+03 1.855e+03 2.473e+03 4.819e+03, threshold=3.710e+03, percent-clipped=1.0 2024-06-19 21:30:14,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=65743.33333333333, ans=0.2 2024-06-19 21:30:18,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.65 vs. limit=15.0 2024-06-19 21:30:22,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=15.0 2024-06-19 21:30:22,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=65761.66666666667, ans=0.0 2024-06-19 21:30:28,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.78 vs. limit=10.0 2024-06-19 21:30:29,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=65780.0, ans=0.025 2024-06-19 21:30:37,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=65798.33333333333, ans=0.125 2024-06-19 21:30:39,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=65798.33333333333, ans=10.0 2024-06-19 21:30:41,467 INFO [train.py:1028] (0/2) Epoch 4, batch 5550, loss[loss=0.3423, simple_loss=0.3437, pruned_loss=0.1704, over 13238.00 frames. ], tot_loss[loss=0.3616, simple_loss=0.3485, pruned_loss=0.1873, over 2565660.73 frames. ], batch size: 43, lr: 1.38e-02, grad_scale: 2.0 2024-06-19 21:30:53,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=65835.0, ans=0.2 2024-06-19 21:30:56,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65853.33333333333, ans=0.1 2024-06-19 21:31:08,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65890.0, ans=0.1 2024-06-19 21:31:13,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-19 21:31:15,279 INFO [train.py:1028] (0/2) Epoch 4, batch 5600, loss[loss=0.3255, simple_loss=0.331, pruned_loss=0.16, over 13250.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.3474, pruned_loss=0.1864, over 2568165.62 frames. ], batch size: 89, lr: 1.38e-02, grad_scale: 4.0 2024-06-19 21:31:22,365 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.243e+03 1.951e+03 2.330e+03 2.751e+03 7.079e+03, threshold=4.660e+03, percent-clipped=6.0 2024-06-19 21:31:40,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=65963.33333333333, ans=0.035 2024-06-19 21:31:52,459 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-36000.pt 2024-06-19 21:31:58,607 INFO [train.py:1028] (0/2) Epoch 4, batch 5650, loss[loss=0.4261, simple_loss=0.3854, pruned_loss=0.2334, over 12557.00 frames. ], tot_loss[loss=0.3614, simple_loss=0.3485, pruned_loss=0.1871, over 2574407.79 frames. ], batch size: 202, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:32:00,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=66000.0, ans=0.0 2024-06-19 21:32:28,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=66055.0, ans=0.125 2024-06-19 21:32:36,054 INFO [train.py:1028] (0/2) Epoch 4, batch 5700, loss[loss=0.3497, simple_loss=0.3426, pruned_loss=0.1784, over 13276.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.3478, pruned_loss=0.1865, over 2578472.92 frames. ], batch size: 63, lr: 1.38e-02, grad_scale: 1.0 2024-06-19 21:32:41,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.63 vs. limit=15.0 2024-06-19 21:32:43,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=66110.0, ans=0.0 2024-06-19 21:32:43,640 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+03 2.699e+03 3.054e+03 3.650e+03 9.422e+03, threshold=6.107e+03, percent-clipped=5.0 2024-06-19 21:32:51,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=66128.33333333333, ans=0.0 2024-06-19 21:32:55,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66146.66666666667, ans=0.1 2024-06-19 21:32:58,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66146.66666666667, ans=0.1 2024-06-19 21:33:09,116 INFO [train.py:1028] (0/2) Epoch 4, batch 5750, loss[loss=0.3973, simple_loss=0.3636, pruned_loss=0.2155, over 12753.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.3502, pruned_loss=0.1884, over 2580025.73 frames. ], batch size: 176, lr: 1.38e-02, grad_scale: 0.5 2024-06-19 21:33:10,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.41 vs. limit=22.5 2024-06-19 21:33:15,799 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=12.0 2024-06-19 21:33:19,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=66201.66666666667, ans=0.125 2024-06-19 21:33:26,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-19 21:33:32,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=12.0 2024-06-19 21:33:34,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.86 vs. limit=22.5 2024-06-19 21:33:35,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=66256.66666666667, ans=0.0 2024-06-19 21:33:36,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=66256.66666666667, ans=0.0 2024-06-19 21:33:39,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.05 vs. limit=22.5 2024-06-19 21:33:40,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2024-06-19 21:33:42,670 INFO [train.py:1028] (0/2) Epoch 4, batch 5800, loss[loss=0.4347, simple_loss=0.4008, pruned_loss=0.2343, over 12811.00 frames. ], tot_loss[loss=0.3669, simple_loss=0.3526, pruned_loss=0.1906, over 2578197.08 frames. ], batch size: 177, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:33:54,399 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+03 2.716e+03 3.254e+03 3.972e+03 6.133e+03, threshold=6.508e+03, percent-clipped=1.0 2024-06-19 21:33:56,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=66293.33333333333, ans=0.0 2024-06-19 21:34:06,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=66311.66666666667, ans=0.125 2024-06-19 21:34:14,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=66330.0, ans=0.5 2024-06-19 21:34:15,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=66348.33333333333, ans=0.125 2024-06-19 21:34:18,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66348.33333333333, ans=0.1 2024-06-19 21:34:21,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-19 21:34:22,981 INFO [train.py:1028] (0/2) Epoch 4, batch 5850, loss[loss=0.4858, simple_loss=0.4338, pruned_loss=0.2689, over 12590.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.3572, pruned_loss=0.1939, over 2576295.68 frames. ], batch size: 202, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:34:29,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=66385.0, ans=0.125 2024-06-19 21:34:49,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-19 21:34:55,847 INFO [train.py:1028] (0/2) Epoch 4, batch 5900, loss[loss=0.3512, simple_loss=0.3341, pruned_loss=0.1842, over 13135.00 frames. ], tot_loss[loss=0.3767, simple_loss=0.3604, pruned_loss=0.1964, over 2577541.01 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:34:55,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66458.33333333333, ans=0.125 2024-06-19 21:34:57,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=66458.33333333333, ans=0.125 2024-06-19 21:35:04,474 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+03 2.944e+03 3.366e+03 4.074e+03 8.277e+03, threshold=6.731e+03, percent-clipped=1.0 2024-06-19 21:35:12,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2024-06-19 21:35:13,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=66495.0, ans=0.0 2024-06-19 21:35:26,232 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:35:28,687 INFO [train.py:1028] (0/2) Epoch 4, batch 5950, loss[loss=0.3641, simple_loss=0.3454, pruned_loss=0.1914, over 13086.00 frames. ], tot_loss[loss=0.3784, simple_loss=0.3622, pruned_loss=0.1973, over 2581813.54 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:35:29,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66550.0, ans=0.125 2024-06-19 21:35:33,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66550.0, ans=0.1 2024-06-19 21:35:33,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=66550.0, ans=0.025 2024-06-19 21:35:37,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=66568.33333333333, ans=0.0 2024-06-19 21:35:40,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=66568.33333333333, ans=0.1 2024-06-19 21:35:47,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2024-06-19 21:35:54,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=66605.0, ans=0.2 2024-06-19 21:35:57,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66605.0, ans=0.1 2024-06-19 21:35:59,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66623.33333333333, ans=0.1 2024-06-19 21:36:05,032 INFO [train.py:1028] (0/2) Epoch 4, batch 6000, loss[loss=0.5314, simple_loss=0.4589, pruned_loss=0.302, over 12170.00 frames. ], tot_loss[loss=0.3798, simple_loss=0.3636, pruned_loss=0.198, over 2575919.91 frames. ], batch size: 241, lr: 1.37e-02, grad_scale: 2.0 2024-06-19 21:36:05,033 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 21:36:14,129 INFO [train.py:1060] (0/2) Epoch 4, validation: loss=0.278, simple_loss=0.3202, pruned_loss=0.1178, over 351949.00 frames. 2024-06-19 21:36:14,130 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 21:36:17,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=66641.66666666667, ans=0.125 2024-06-19 21:36:20,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=66660.0, ans=0.0 2024-06-19 21:36:23,401 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.266e+03 2.500e+03 3.044e+03 3.496e+03 9.206e+03, threshold=6.087e+03, percent-clipped=1.0 2024-06-19 21:36:31,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=66678.33333333333, ans=0.125 2024-06-19 21:36:37,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=66696.66666666667, ans=0.125 2024-06-19 21:36:39,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=66696.66666666667, ans=0.025 2024-06-19 21:36:43,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66715.0, ans=0.1 2024-06-19 21:36:47,914 INFO [train.py:1028] (0/2) Epoch 4, batch 6050, loss[loss=0.3503, simple_loss=0.3511, pruned_loss=0.1748, over 12948.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.3651, pruned_loss=0.1986, over 2578527.59 frames. ], batch size: 39, lr: 1.37e-02, grad_scale: 0.5 2024-06-19 21:37:02,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=66770.0, ans=0.0 2024-06-19 21:37:08,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=66788.33333333333, ans=0.125 2024-06-19 21:37:21,164 INFO [train.py:1028] (0/2) Epoch 4, batch 6100, loss[loss=0.3799, simple_loss=0.3623, pruned_loss=0.1988, over 13117.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.3671, pruned_loss=0.1997, over 2580197.55 frames. ], batch size: 121, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:37:29,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=66843.33333333333, ans=0.0 2024-06-19 21:37:31,457 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+03 2.403e+03 3.010e+03 3.642e+03 7.150e+03, threshold=6.021e+03, percent-clipped=3.0 2024-06-19 21:37:34,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=66861.66666666667, ans=0.0 2024-06-19 21:37:37,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=66861.66666666667, ans=0.125 2024-06-19 21:37:37,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2024-06-19 21:37:40,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-19 21:37:44,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=66880.0, ans=0.0 2024-06-19 21:37:49,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=66898.33333333333, ans=0.0 2024-06-19 21:37:51,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=66898.33333333333, ans=0.2 2024-06-19 21:37:58,317 INFO [train.py:1028] (0/2) Epoch 4, batch 6150, loss[loss=0.4246, simple_loss=0.3828, pruned_loss=0.2332, over 10710.00 frames. ], tot_loss[loss=0.3846, simple_loss=0.3685, pruned_loss=0.2004, over 2578408.00 frames. ], batch size: 303, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:38:00,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.77 vs. limit=15.0 2024-06-19 21:38:09,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=66935.0, ans=0.025 2024-06-19 21:38:14,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66953.33333333333, ans=0.125 2024-06-19 21:38:16,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=12.0 2024-06-19 21:38:19,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=15.0 2024-06-19 21:38:28,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=30.94 vs. limit=22.5 2024-06-19 21:38:31,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66990.0, ans=0.1 2024-06-19 21:38:32,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=66990.0, ans=0.125 2024-06-19 21:38:35,105 INFO [train.py:1028] (0/2) Epoch 4, batch 6200, loss[loss=0.4204, simple_loss=0.4033, pruned_loss=0.2187, over 13254.00 frames. ], tot_loss[loss=0.3871, simple_loss=0.3707, pruned_loss=0.2017, over 2575267.53 frames. ], batch size: 89, lr: 1.37e-02, grad_scale: 2.0 2024-06-19 21:38:36,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67008.33333333333, ans=0.1 2024-06-19 21:38:38,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=67008.33333333333, ans=0.125 2024-06-19 21:38:46,138 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.474e+03 2.565e+03 2.960e+03 3.317e+03 1.164e+04, threshold=5.920e+03, percent-clipped=3.0 2024-06-19 21:38:49,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=67045.0, ans=0.125 2024-06-19 21:38:50,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.76 vs. limit=6.0 2024-06-19 21:38:50,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=67045.0, ans=0.0 2024-06-19 21:38:57,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=67063.33333333333, ans=0.2 2024-06-19 21:39:06,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-19 21:39:09,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=67100.0, ans=0.025 2024-06-19 21:39:09,601 INFO [train.py:1028] (0/2) Epoch 4, batch 6250, loss[loss=0.4201, simple_loss=0.3973, pruned_loss=0.2215, over 13227.00 frames. ], tot_loss[loss=0.3912, simple_loss=0.3736, pruned_loss=0.2044, over 2569321.78 frames. ], batch size: 83, lr: 1.37e-02, grad_scale: 0.5 2024-06-19 21:39:15,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.27 vs. limit=10.0 2024-06-19 21:39:17,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-19 21:39:19,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67118.33333333333, ans=0.1 2024-06-19 21:39:30,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=67155.0, ans=0.0 2024-06-19 21:39:42,136 INFO [train.py:1028] (0/2) Epoch 4, batch 6300, loss[loss=0.3201, simple_loss=0.332, pruned_loss=0.1541, over 11888.00 frames. ], tot_loss[loss=0.3923, simple_loss=0.3749, pruned_loss=0.2049, over 2564244.64 frames. ], batch size: 17, lr: 1.37e-02, grad_scale: 1.0 2024-06-19 21:39:46,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=67191.66666666667, ans=0.0 2024-06-19 21:39:57,487 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.533e+02 1.884e+03 2.214e+03 2.766e+03 4.435e+03, threshold=4.429e+03, percent-clipped=0.0 2024-06-19 21:39:57,738 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:40:01,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=67228.33333333333, ans=0.0 2024-06-19 21:40:01,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=67228.33333333333, ans=0.125 2024-06-19 21:40:02,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.78 vs. limit=15.0 2024-06-19 21:40:04,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2024-06-19 21:40:18,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67265.0, ans=0.1 2024-06-19 21:40:22,129 INFO [train.py:1028] (0/2) Epoch 4, batch 6350, loss[loss=0.4618, simple_loss=0.4252, pruned_loss=0.2492, over 12571.00 frames. ], tot_loss[loss=0.3918, simple_loss=0.3758, pruned_loss=0.2039, over 2573245.97 frames. ], batch size: 202, lr: 1.36e-02, grad_scale: 0.5 2024-06-19 21:40:23,287 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.84 vs. limit=22.5 2024-06-19 21:40:35,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.40 vs. limit=15.0 2024-06-19 21:40:50,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67356.66666666667, ans=0.1 2024-06-19 21:40:55,324 INFO [train.py:1028] (0/2) Epoch 4, batch 6400, loss[loss=0.3664, simple_loss=0.3606, pruned_loss=0.1861, over 13223.00 frames. ], tot_loss[loss=0.3928, simple_loss=0.3772, pruned_loss=0.2042, over 2574669.09 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:40:57,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.06 vs. limit=22.5 2024-06-19 21:41:07,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.545e+02 1.445e+03 1.729e+03 2.045e+03 4.800e+03, threshold=3.458e+03, percent-clipped=1.0 2024-06-19 21:41:09,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=67411.66666666667, ans=15.0 2024-06-19 21:41:14,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=67411.66666666667, ans=0.0 2024-06-19 21:41:19,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67430.0, ans=0.125 2024-06-19 21:41:22,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.65 vs. limit=10.0 2024-06-19 21:41:28,502 INFO [train.py:1028] (0/2) Epoch 4, batch 6450, loss[loss=0.4856, simple_loss=0.4476, pruned_loss=0.2618, over 12454.00 frames. ], tot_loss[loss=0.3939, simple_loss=0.3786, pruned_loss=0.2046, over 2580580.72 frames. ], batch size: 202, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:41:29,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.78 vs. limit=6.0 2024-06-19 21:41:29,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67466.66666666667, ans=0.125 2024-06-19 21:41:53,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=67540.0, ans=0.125 2024-06-19 21:42:00,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2024-06-19 21:42:03,772 INFO [train.py:1028] (0/2) Epoch 4, batch 6500, loss[loss=0.4282, simple_loss=0.3851, pruned_loss=0.2356, over 10583.00 frames. ], tot_loss[loss=0.3963, simple_loss=0.3808, pruned_loss=0.206, over 2583859.98 frames. ], batch size: 303, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:42:06,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=67558.33333333333, ans=0.125 2024-06-19 21:42:14,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=67576.66666666667, ans=0.125 2024-06-19 21:42:15,442 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.023e+03 1.514e+03 1.786e+03 2.259e+03 8.139e+03, threshold=3.572e+03, percent-clipped=2.0 2024-06-19 21:42:38,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-19 21:42:40,138 INFO [train.py:1028] (0/2) Epoch 4, batch 6550, loss[loss=0.3187, simple_loss=0.3344, pruned_loss=0.1515, over 12773.00 frames. ], tot_loss[loss=0.3962, simple_loss=0.3814, pruned_loss=0.2055, over 2587645.57 frames. ], batch size: 22, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:42:42,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=67650.0, ans=0.125 2024-06-19 21:42:57,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=67686.66666666667, ans=0.2 2024-06-19 21:43:04,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=67705.0, ans=0.2 2024-06-19 21:43:09,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 21:43:13,281 INFO [train.py:1028] (0/2) Epoch 4, batch 6600, loss[loss=0.4151, simple_loss=0.3987, pruned_loss=0.2157, over 13222.00 frames. ], tot_loss[loss=0.3968, simple_loss=0.3818, pruned_loss=0.2059, over 2590791.84 frames. ], batch size: 72, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:43:20,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=67760.0, ans=0.2 2024-06-19 21:43:21,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=67760.0, ans=0.0 2024-06-19 21:43:26,535 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.249e+03 1.975e+03 2.513e+03 3.181e+03 7.798e+03, threshold=5.026e+03, percent-clipped=18.0 2024-06-19 21:43:32,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67778.33333333333, ans=0.1 2024-06-19 21:43:35,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=67796.66666666667, ans=0.125 2024-06-19 21:43:46,578 INFO [train.py:1028] (0/2) Epoch 4, batch 6650, loss[loss=0.4182, simple_loss=0.3957, pruned_loss=0.2203, over 12937.00 frames. ], tot_loss[loss=0.4, simple_loss=0.3847, pruned_loss=0.2077, over 2585355.15 frames. ], batch size: 158, lr: 1.36e-02, grad_scale: 1.0 2024-06-19 21:44:07,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.82 vs. limit=22.5 2024-06-19 21:44:08,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.22 vs. limit=22.5 2024-06-19 21:44:12,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=67888.33333333333, ans=0.0 2024-06-19 21:44:26,853 INFO [train.py:1028] (0/2) Epoch 4, batch 6700, loss[loss=0.4302, simple_loss=0.4033, pruned_loss=0.2285, over 12820.00 frames. ], tot_loss[loss=0.4019, simple_loss=0.3864, pruned_loss=0.2086, over 2585020.74 frames. ], batch size: 177, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:44:40,256 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.099e+03 1.874e+03 2.204e+03 2.515e+03 4.654e+03, threshold=4.407e+03, percent-clipped=0.0 2024-06-19 21:44:43,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=67961.66666666667, ans=0.125 2024-06-19 21:44:49,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67980.0, ans=0.125 2024-06-19 21:45:00,107 INFO [train.py:1028] (0/2) Epoch 4, batch 6750, loss[loss=0.5262, simple_loss=0.4684, pruned_loss=0.292, over 12236.00 frames. ], tot_loss[loss=0.4035, simple_loss=0.3877, pruned_loss=0.2097, over 2578951.25 frames. ], batch size: 240, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:45:01,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=68016.66666666667, ans=0.125 2024-06-19 21:45:09,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.07 vs. limit=22.5 2024-06-19 21:45:16,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=68053.33333333333, ans=0.0 2024-06-19 21:45:18,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68071.66666666667, ans=0.1 2024-06-19 21:45:30,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=68090.0, ans=0.125 2024-06-19 21:45:31,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.21 vs. limit=10.0 2024-06-19 21:45:32,716 INFO [train.py:1028] (0/2) Epoch 4, batch 6800, loss[loss=0.4081, simple_loss=0.3931, pruned_loss=0.2115, over 13184.00 frames. ], tot_loss[loss=0.4047, simple_loss=0.389, pruned_loss=0.2102, over 2581853.77 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 4.0 2024-06-19 21:45:34,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68108.33333333333, ans=0.1 2024-06-19 21:45:41,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.04 vs. limit=15.0 2024-06-19 21:45:46,561 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.154e+03 1.846e+03 2.129e+03 2.592e+03 3.733e+03, threshold=4.258e+03, percent-clipped=0.0 2024-06-19 21:45:46,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=68145.0, ans=0.125 2024-06-19 21:45:52,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=68163.33333333333, ans=0.125 2024-06-19 21:45:53,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=68163.33333333333, ans=0.125 2024-06-19 21:46:09,014 INFO [train.py:1028] (0/2) Epoch 4, batch 6850, loss[loss=0.4032, simple_loss=0.4016, pruned_loss=0.2025, over 13266.00 frames. ], tot_loss[loss=0.4047, simple_loss=0.3898, pruned_loss=0.2098, over 2585894.08 frames. ], batch size: 63, lr: 1.36e-02, grad_scale: 2.0 2024-06-19 21:46:15,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=68218.33333333333, ans=0.1 2024-06-19 21:46:20,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=68218.33333333333, ans=0.2 2024-06-19 21:46:37,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=68273.33333333333, ans=0.125 2024-06-19 21:46:40,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.27 vs. limit=15.0 2024-06-19 21:46:44,932 INFO [train.py:1028] (0/2) Epoch 4, batch 6900, loss[loss=0.4198, simple_loss=0.3971, pruned_loss=0.2212, over 13314.00 frames. ], tot_loss[loss=0.4073, simple_loss=0.3919, pruned_loss=0.2113, over 2588064.41 frames. ], batch size: 49, lr: 1.35e-02, grad_scale: 4.0 2024-06-19 21:46:45,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=68291.66666666667, ans=0.125 2024-06-19 21:46:58,867 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.105e+03 1.822e+03 2.239e+03 2.712e+03 4.035e+03, threshold=4.478e+03, percent-clipped=0.0 2024-06-19 21:47:08,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.80 vs. limit=22.5 2024-06-19 21:47:08,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.98 vs. limit=15.0 2024-06-19 21:47:10,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.16 vs. limit=10.0 2024-06-19 21:47:17,771 INFO [train.py:1028] (0/2) Epoch 4, batch 6950, loss[loss=0.3705, simple_loss=0.3552, pruned_loss=0.1929, over 11379.00 frames. ], tot_loss[loss=0.4059, simple_loss=0.3911, pruned_loss=0.2103, over 2581628.77 frames. ], batch size: 16, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:47:20,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=68383.33333333333, ans=0.125 2024-06-19 21:47:27,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-19 21:47:28,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.19 vs. limit=15.0 2024-06-19 21:47:35,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=68420.0, ans=0.0 2024-06-19 21:47:37,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.86 vs. limit=15.0 2024-06-19 21:47:45,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=68456.66666666667, ans=0.0 2024-06-19 21:47:46,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=68456.66666666667, ans=0.125 2024-06-19 21:47:51,437 INFO [train.py:1028] (0/2) Epoch 4, batch 7000, loss[loss=0.4277, simple_loss=0.4009, pruned_loss=0.2273, over 12958.00 frames. ], tot_loss[loss=0.4057, simple_loss=0.3914, pruned_loss=0.21, over 2578675.33 frames. ], batch size: 158, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:47:51,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=68475.0, ans=0.0 2024-06-19 21:48:00,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=68493.33333333333, ans=0.2 2024-06-19 21:48:05,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68511.66666666667, ans=0.125 2024-06-19 21:48:07,797 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.135e+03 1.680e+03 2.027e+03 2.444e+03 5.504e+03, threshold=4.055e+03, percent-clipped=2.0 2024-06-19 21:48:29,114 INFO [train.py:1028] (0/2) Epoch 4, batch 7050, loss[loss=0.4569, simple_loss=0.4231, pruned_loss=0.2454, over 12836.00 frames. ], tot_loss[loss=0.4065, simple_loss=0.3925, pruned_loss=0.2102, over 2585156.90 frames. ], batch size: 176, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:48:37,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=68566.66666666667, ans=0.125 2024-06-19 21:48:46,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=68603.33333333333, ans=0.0 2024-06-19 21:49:05,189 INFO [train.py:1028] (0/2) Epoch 4, batch 7100, loss[loss=0.4163, simple_loss=0.3998, pruned_loss=0.2164, over 13148.00 frames. ], tot_loss[loss=0.4076, simple_loss=0.3933, pruned_loss=0.211, over 2576469.06 frames. ], batch size: 112, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:49:21,467 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.039e+03 1.808e+03 2.162e+03 2.614e+03 7.204e+03, threshold=4.324e+03, percent-clipped=4.0 2024-06-19 21:49:24,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=12.0 2024-06-19 21:49:25,245 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-19 21:49:25,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=68713.33333333333, ans=0.125 2024-06-19 21:49:38,143 INFO [train.py:1028] (0/2) Epoch 4, batch 7150, loss[loss=0.4869, simple_loss=0.4434, pruned_loss=0.2652, over 12547.00 frames. ], tot_loss[loss=0.4099, simple_loss=0.3953, pruned_loss=0.2123, over 2573977.52 frames. ], batch size: 202, lr: 1.35e-02, grad_scale: 0.5 2024-06-19 21:49:39,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=68750.0, ans=0.0 2024-06-19 21:49:43,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=68750.0, ans=0.125 2024-06-19 21:49:43,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.05 vs. limit=10.0 2024-06-19 21:49:50,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=68768.33333333333, ans=0.125 2024-06-19 21:49:50,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=68786.66666666667, ans=0.2 2024-06-19 21:49:59,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.58 vs. limit=22.5 2024-06-19 21:50:09,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.88 vs. limit=22.5 2024-06-19 21:50:10,513 INFO [train.py:1028] (0/2) Epoch 4, batch 7200, loss[loss=0.4602, simple_loss=0.4378, pruned_loss=0.2413, over 13157.00 frames. ], tot_loss[loss=0.4097, simple_loss=0.3955, pruned_loss=0.2119, over 2579121.07 frames. ], batch size: 112, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:50:10,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=68841.66666666667, ans=0.0 2024-06-19 21:50:30,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 2.089e+03 2.454e+03 3.037e+03 5.608e+03, threshold=4.907e+03, percent-clipped=3.0 2024-06-19 21:50:40,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=68915.0, ans=0.125 2024-06-19 21:50:50,103 INFO [train.py:1028] (0/2) Epoch 4, batch 7250, loss[loss=0.369, simple_loss=0.3688, pruned_loss=0.1846, over 12961.00 frames. ], tot_loss[loss=0.4097, simple_loss=0.3961, pruned_loss=0.2117, over 2579985.31 frames. ], batch size: 36, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:50:56,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=68951.66666666667, ans=15.0 2024-06-19 21:50:58,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=68951.66666666667, ans=0.0 2024-06-19 21:51:09,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=68988.33333333333, ans=0.125 2024-06-19 21:51:16,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69006.66666666667, ans=0.125 2024-06-19 21:51:19,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=69006.66666666667, ans=0.0 2024-06-19 21:51:22,271 INFO [train.py:1028] (0/2) Epoch 4, batch 7300, loss[loss=0.4114, simple_loss=0.3973, pruned_loss=0.2128, over 12926.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.397, pruned_loss=0.2123, over 2580222.16 frames. ], batch size: 36, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:51:24,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=69025.0, ans=0.125 2024-06-19 21:51:29,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69043.33333333333, ans=0.1 2024-06-19 21:51:30,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.58 vs. limit=10.0 2024-06-19 21:51:40,160 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.500e+03 2.294e+03 2.660e+03 3.164e+03 5.742e+03, threshold=5.319e+03, percent-clipped=2.0 2024-06-19 21:51:47,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=69080.0, ans=0.125 2024-06-19 21:51:48,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69098.33333333333, ans=0.1 2024-06-19 21:51:50,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=69098.33333333333, ans=0.125 2024-06-19 21:51:52,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.09 vs. limit=22.5 2024-06-19 21:51:54,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69098.33333333333, ans=0.1 2024-06-19 21:51:55,325 INFO [train.py:1028] (0/2) Epoch 4, batch 7350, loss[loss=0.4394, simple_loss=0.4223, pruned_loss=0.2283, over 13331.00 frames. ], tot_loss[loss=0.411, simple_loss=0.3978, pruned_loss=0.2121, over 2580710.73 frames. ], batch size: 46, lr: 1.35e-02, grad_scale: 1.0 2024-06-19 21:51:56,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=69116.66666666667, ans=0.125 2024-06-19 21:52:00,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=69116.66666666667, ans=0.125 2024-06-19 21:52:04,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=69135.0, ans=0.125 2024-06-19 21:52:16,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69171.66666666667, ans=0.1 2024-06-19 21:52:28,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2024-06-19 21:52:32,049 INFO [train.py:1028] (0/2) Epoch 4, batch 7400, loss[loss=0.4629, simple_loss=0.4479, pruned_loss=0.239, over 13238.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.3965, pruned_loss=0.2106, over 2586964.13 frames. ], batch size: 63, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:52:32,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.33 vs. limit=15.0 2024-06-19 21:52:34,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=69208.33333333333, ans=0.1 2024-06-19 21:52:36,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=69208.33333333333, ans=0.125 2024-06-19 21:52:40,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69226.66666666667, ans=0.1 2024-06-19 21:52:41,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69226.66666666667, ans=0.125 2024-06-19 21:52:53,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.663e+02 1.704e+03 2.009e+03 2.398e+03 3.760e+03, threshold=4.019e+03, percent-clipped=0.0 2024-06-19 21:53:06,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=69281.66666666667, ans=0.125 2024-06-19 21:53:09,098 INFO [train.py:1028] (0/2) Epoch 4, batch 7450, loss[loss=0.3535, simple_loss=0.3567, pruned_loss=0.1752, over 12738.00 frames. ], tot_loss[loss=0.4077, simple_loss=0.3961, pruned_loss=0.2097, over 2580617.33 frames. ], batch size: 29, lr: 1.35e-02, grad_scale: 2.0 2024-06-19 21:53:10,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=69300.0, ans=0.2 2024-06-19 21:53:11,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69300.0, ans=0.1 2024-06-19 21:53:12,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.10 vs. limit=15.0 2024-06-19 21:53:15,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.66 vs. limit=22.5 2024-06-19 21:53:20,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69318.33333333333, ans=0.125 2024-06-19 21:53:28,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=69355.0, ans=0.0 2024-06-19 21:53:28,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=69355.0, ans=0.2 2024-06-19 21:53:29,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=69355.0, ans=0.0 2024-06-19 21:53:29,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=69355.0, ans=0.125 2024-06-19 21:53:40,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=69373.33333333333, ans=0.0 2024-06-19 21:53:42,143 INFO [train.py:1028] (0/2) Epoch 4, batch 7500, loss[loss=0.4626, simple_loss=0.4225, pruned_loss=0.2513, over 10427.00 frames. ], tot_loss[loss=0.4098, simple_loss=0.3977, pruned_loss=0.2109, over 2578674.57 frames. ], batch size: 303, lr: 1.34e-02, grad_scale: 4.0 2024-06-19 21:53:43,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69391.66666666667, ans=0.125 2024-06-19 21:53:46,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=69391.66666666667, ans=0.025 2024-06-19 21:53:47,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2024-06-19 21:53:56,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=69428.33333333333, ans=0.0 2024-06-19 21:54:01,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.362e+02 1.565e+03 1.843e+03 2.306e+03 3.859e+03, threshold=3.686e+03, percent-clipped=0.0 2024-06-19 21:54:01,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=69446.66666666667, ans=0.125 2024-06-19 21:54:01,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=69446.66666666667, ans=0.125 2024-06-19 21:54:03,608 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.23 vs. limit=22.5 2024-06-19 21:54:18,354 INFO [train.py:1028] (0/2) Epoch 4, batch 7550, loss[loss=0.4415, simple_loss=0.4114, pruned_loss=0.2358, over 12936.00 frames. ], tot_loss[loss=0.4125, simple_loss=0.3995, pruned_loss=0.2127, over 2578005.95 frames. ], batch size: 158, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:54:19,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.58 vs. limit=10.0 2024-06-19 21:54:22,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=69483.33333333333, ans=0.125 2024-06-19 21:54:25,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=12.0 2024-06-19 21:54:26,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=69501.66666666667, ans=0.1 2024-06-19 21:54:28,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=69501.66666666667, ans=0.125 2024-06-19 21:54:38,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69538.33333333333, ans=0.125 2024-06-19 21:54:44,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69556.66666666667, ans=0.125 2024-06-19 21:54:54,828 INFO [train.py:1028] (0/2) Epoch 4, batch 7600, loss[loss=0.4118, simple_loss=0.4017, pruned_loss=0.2109, over 13211.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.3991, pruned_loss=0.2119, over 2577057.27 frames. ], batch size: 83, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:55:03,734 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2024-06-19 21:55:05,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=15.0 2024-06-19 21:55:08,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=69611.66666666667, ans=0.2 2024-06-19 21:55:14,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.917e+02 1.453e+03 1.737e+03 2.118e+03 4.377e+03, threshold=3.474e+03, percent-clipped=3.0 2024-06-19 21:55:21,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=69648.33333333333, ans=0.125 2024-06-19 21:55:27,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=69666.66666666667, ans=0.125 2024-06-19 21:55:28,327 INFO [train.py:1028] (0/2) Epoch 4, batch 7650, loss[loss=0.3663, simple_loss=0.3719, pruned_loss=0.1804, over 12873.00 frames. ], tot_loss[loss=0.4116, simple_loss=0.3994, pruned_loss=0.2119, over 2572067.81 frames. ], batch size: 33, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:55:31,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2024-06-19 21:55:42,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.98 vs. limit=15.0 2024-06-19 21:56:01,637 INFO [train.py:1028] (0/2) Epoch 4, batch 7700, loss[loss=0.3917, simple_loss=0.3984, pruned_loss=0.1925, over 13316.00 frames. ], tot_loss[loss=0.4113, simple_loss=0.3993, pruned_loss=0.2116, over 2568076.80 frames. ], batch size: 63, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:56:04,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.87 vs. limit=10.0 2024-06-19 21:56:15,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=12.0 2024-06-19 21:56:23,841 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.436e+02 1.234e+03 1.419e+03 1.739e+03 3.552e+03, threshold=2.838e+03, percent-clipped=1.0 2024-06-19 21:56:34,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=69831.66666666667, ans=0.2 2024-06-19 21:56:34,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=69831.66666666667, ans=0.125 2024-06-19 21:56:36,553 INFO [train.py:1028] (0/2) Epoch 4, batch 7750, loss[loss=0.3915, simple_loss=0.3952, pruned_loss=0.1939, over 13227.00 frames. ], tot_loss[loss=0.4111, simple_loss=0.3992, pruned_loss=0.2115, over 2572378.17 frames. ], batch size: 72, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:56:45,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-19 21:56:49,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.66 vs. limit=10.0 2024-06-19 21:56:55,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=69886.66666666667, ans=0.125 2024-06-19 21:56:57,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=69886.66666666667, ans=0.125 2024-06-19 21:57:09,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2024-06-19 21:57:10,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69923.33333333333, ans=0.125 2024-06-19 21:57:13,329 INFO [train.py:1028] (0/2) Epoch 4, batch 7800, loss[loss=0.4076, simple_loss=0.3959, pruned_loss=0.2096, over 13136.00 frames. ], tot_loss[loss=0.4096, simple_loss=0.3986, pruned_loss=0.2103, over 2577462.36 frames. ], batch size: 95, lr: 1.34e-02, grad_scale: 4.0 2024-06-19 21:57:18,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=69941.66666666667, ans=0.0 2024-06-19 21:57:19,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=69960.0, ans=0.2 2024-06-19 21:57:22,065 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 21:57:23,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=69960.0, ans=0.09899494936611666 2024-06-19 21:57:26,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69978.33333333333, ans=0.1 2024-06-19 21:57:27,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=69978.33333333333, ans=0.125 2024-06-19 21:57:32,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=69978.33333333333, ans=6.0 2024-06-19 21:57:34,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.683e+02 1.200e+03 1.462e+03 1.756e+03 3.966e+03, threshold=2.924e+03, percent-clipped=6.0 2024-06-19 21:57:35,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=69996.66666666667, ans=0.125 2024-06-19 21:57:40,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=70015.0, ans=0.125 2024-06-19 21:57:46,585 INFO [train.py:1028] (0/2) Epoch 4, batch 7850, loss[loss=0.4213, simple_loss=0.4009, pruned_loss=0.2209, over 12047.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4007, pruned_loss=0.2124, over 2571513.09 frames. ], batch size: 18, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 21:57:46,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=70033.33333333333, ans=0.125 2024-06-19 21:57:50,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=70033.33333333333, ans=0.0 2024-06-19 21:57:52,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2024-06-19 21:57:55,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70051.66666666667, ans=0.0 2024-06-19 21:58:01,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=70070.0, ans=0.125 2024-06-19 21:58:24,459 INFO [train.py:1028] (0/2) Epoch 4, batch 7900, loss[loss=0.3707, simple_loss=0.365, pruned_loss=0.1882, over 13135.00 frames. ], tot_loss[loss=0.4134, simple_loss=0.4013, pruned_loss=0.2128, over 2570721.16 frames. ], batch size: 77, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:58:31,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=70143.33333333333, ans=0.125 2024-06-19 21:58:36,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=70143.33333333333, ans=0.0 2024-06-19 21:58:42,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=70161.66666666667, ans=0.125 2024-06-19 21:58:44,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=70161.66666666667, ans=0.2 2024-06-19 21:58:48,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.483e+02 1.494e+03 1.721e+03 1.956e+03 3.878e+03, threshold=3.441e+03, percent-clipped=2.0 2024-06-19 21:59:00,918 INFO [train.py:1028] (0/2) Epoch 4, batch 7950, loss[loss=0.4266, simple_loss=0.4025, pruned_loss=0.2253, over 10663.00 frames. ], tot_loss[loss=0.4145, simple_loss=0.4026, pruned_loss=0.2132, over 2573953.20 frames. ], batch size: 303, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:59:08,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=70235.0, ans=0.035 2024-06-19 21:59:17,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=70253.33333333333, ans=0.0 2024-06-19 21:59:17,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.83 vs. limit=15.0 2024-06-19 21:59:18,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=70253.33333333333, ans=0.2 2024-06-19 21:59:18,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=70253.33333333333, ans=0.0 2024-06-19 21:59:23,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=70271.66666666667, ans=0.5 2024-06-19 21:59:32,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-19 21:59:34,565 INFO [train.py:1028] (0/2) Epoch 4, batch 8000, loss[loss=0.3764, simple_loss=0.3874, pruned_loss=0.1827, over 12808.00 frames. ], tot_loss[loss=0.4155, simple_loss=0.4039, pruned_loss=0.2135, over 2571119.09 frames. ], batch size: 29, lr: 1.34e-02, grad_scale: 2.0 2024-06-19 21:59:52,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70345.0, ans=0.2 2024-06-19 21:59:53,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=70345.0, ans=0.125 2024-06-19 21:59:54,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=70363.33333333333, ans=0.125 2024-06-19 21:59:55,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=70363.33333333333, ans=0.0 2024-06-19 21:59:56,528 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.115e+03 1.657e+03 1.928e+03 2.330e+03 3.894e+03, threshold=3.857e+03, percent-clipped=4.0 2024-06-19 21:59:56,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=70363.33333333333, ans=0.125 2024-06-19 21:59:58,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=70363.33333333333, ans=0.05 2024-06-19 22:00:11,567 INFO [train.py:1028] (0/2) Epoch 4, batch 8050, loss[loss=0.4358, simple_loss=0.4126, pruned_loss=0.2295, over 13222.00 frames. ], tot_loss[loss=0.4155, simple_loss=0.404, pruned_loss=0.2135, over 2571853.96 frames. ], batch size: 83, lr: 1.34e-02, grad_scale: 1.0 2024-06-19 22:00:18,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70418.33333333333, ans=0.1 2024-06-19 22:00:20,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.39 vs. limit=10.0 2024-06-19 22:00:24,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=70436.66666666667, ans=0.125 2024-06-19 22:00:29,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.60 vs. limit=12.0 2024-06-19 22:00:33,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=12.0 2024-06-19 22:00:42,619 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=15.0 2024-06-19 22:00:49,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=70491.66666666667, ans=0.0 2024-06-19 22:00:50,379 INFO [train.py:1028] (0/2) Epoch 4, batch 8100, loss[loss=0.4105, simple_loss=0.3924, pruned_loss=0.2143, over 13213.00 frames. ], tot_loss[loss=0.4163, simple_loss=0.4044, pruned_loss=0.2141, over 2575846.70 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:00:50,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70491.66666666667, ans=0.0 2024-06-19 22:00:52,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=70491.66666666667, ans=0.125 2024-06-19 22:00:55,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70491.66666666667, ans=0.1 2024-06-19 22:00:56,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70491.66666666667, ans=0.2 2024-06-19 22:00:57,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70510.0, ans=0.1 2024-06-19 22:00:57,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70510.0, ans=0.125 2024-06-19 22:00:58,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-19 22:01:00,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=70510.0, ans=0.09899494936611666 2024-06-19 22:01:02,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.26 vs. limit=15.0 2024-06-19 22:01:02,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.42 vs. limit=15.0 2024-06-19 22:01:09,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2024-06-19 22:01:14,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=70546.66666666667, ans=0.0 2024-06-19 22:01:15,267 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+03 2.350e+03 2.726e+03 3.270e+03 7.314e+03, threshold=5.451e+03, percent-clipped=9.0 2024-06-19 22:01:15,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=70546.66666666667, ans=0.2 2024-06-19 22:01:16,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=12.0 2024-06-19 22:01:16,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70546.66666666667, ans=0.2 2024-06-19 22:01:19,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70565.0, ans=0.125 2024-06-19 22:01:20,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=12.0 2024-06-19 22:01:25,032 INFO [train.py:1028] (0/2) Epoch 4, batch 8150, loss[loss=0.4085, simple_loss=0.3933, pruned_loss=0.2119, over 13138.00 frames. ], tot_loss[loss=0.4169, simple_loss=0.4051, pruned_loss=0.2143, over 2579815.57 frames. ], batch size: 121, lr: 1.33e-02, grad_scale: 0.5 2024-06-19 22:01:28,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=15.0 2024-06-19 22:01:31,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.46 vs. limit=15.0 2024-06-19 22:01:37,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=70601.66666666667, ans=0.09899494936611666 2024-06-19 22:01:44,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=70638.33333333333, ans=0.0 2024-06-19 22:01:45,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=70638.33333333333, ans=0.125 2024-06-19 22:01:52,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70656.66666666667, ans=0.125 2024-06-19 22:01:52,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=70656.66666666667, ans=0.2 2024-06-19 22:01:52,985 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:01:56,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70656.66666666667, ans=0.125 2024-06-19 22:01:57,834 INFO [train.py:1028] (0/2) Epoch 4, batch 8200, loss[loss=0.4163, simple_loss=0.401, pruned_loss=0.2158, over 13139.00 frames. ], tot_loss[loss=0.4139, simple_loss=0.4029, pruned_loss=0.2125, over 2583246.70 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:02:06,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=70693.33333333333, ans=0.0 2024-06-19 22:02:20,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70711.66666666667, ans=0.1 2024-06-19 22:02:22,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=70730.0, ans=0.1 2024-06-19 22:02:23,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=70730.0, ans=0.125 2024-06-19 22:02:26,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70730.0, ans=0.125 2024-06-19 22:02:27,592 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.046e+03 1.630e+03 1.953e+03 2.331e+03 4.319e+03, threshold=3.906e+03, percent-clipped=0.0 2024-06-19 22:02:31,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=70748.33333333333, ans=0.125 2024-06-19 22:02:36,681 INFO [train.py:1028] (0/2) Epoch 4, batch 8250, loss[loss=0.3904, simple_loss=0.3968, pruned_loss=0.1919, over 13228.00 frames. ], tot_loss[loss=0.4135, simple_loss=0.403, pruned_loss=0.212, over 2583376.02 frames. ], batch size: 52, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:02:37,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=70766.66666666667, ans=0.0 2024-06-19 22:02:44,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=70766.66666666667, ans=0.125 2024-06-19 22:02:47,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=70785.0, ans=0.09899494936611666 2024-06-19 22:02:47,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=70785.0, ans=0.125 2024-06-19 22:02:50,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.80 vs. limit=15.0 2024-06-19 22:02:57,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70803.33333333333, ans=0.125 2024-06-19 22:02:57,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.86 vs. limit=15.0 2024-06-19 22:02:57,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70803.33333333333, ans=0.1 2024-06-19 22:03:04,778 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.41 vs. limit=15.0 2024-06-19 22:03:05,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.88 vs. limit=15.0 2024-06-19 22:03:05,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=70840.0, ans=0.0 2024-06-19 22:03:06,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=70840.0, ans=0.2 2024-06-19 22:03:10,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=70840.0, ans=0.125 2024-06-19 22:03:12,773 INFO [train.py:1028] (0/2) Epoch 4, batch 8300, loss[loss=0.423, simple_loss=0.4116, pruned_loss=0.2172, over 12991.00 frames. ], tot_loss[loss=0.412, simple_loss=0.4017, pruned_loss=0.2111, over 2581829.34 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:03:19,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70876.66666666667, ans=0.125 2024-06-19 22:03:23,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.58 vs. limit=10.0 2024-06-19 22:03:24,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70876.66666666667, ans=0.0 2024-06-19 22:03:31,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.25 vs. limit=15.0 2024-06-19 22:03:32,841 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:03:36,067 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.593e+02 1.268e+03 1.516e+03 1.853e+03 2.745e+03, threshold=3.032e+03, percent-clipped=0.0 2024-06-19 22:03:36,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70913.33333333333, ans=0.1 2024-06-19 22:03:41,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70931.66666666667, ans=0.1 2024-06-19 22:03:45,232 INFO [train.py:1028] (0/2) Epoch 4, batch 8350, loss[loss=0.4216, simple_loss=0.4084, pruned_loss=0.2175, over 13159.00 frames. ], tot_loss[loss=0.4102, simple_loss=0.4009, pruned_loss=0.2097, over 2582499.01 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:03:45,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=70950.0, ans=0.1 2024-06-19 22:03:48,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=70950.0, ans=0.04949747468305833 2024-06-19 22:03:57,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70986.66666666667, ans=0.125 2024-06-19 22:04:05,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=71005.0, ans=0.09899494936611666 2024-06-19 22:04:10,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.12 vs. limit=22.5 2024-06-19 22:04:10,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=71005.0, ans=0.125 2024-06-19 22:04:23,640 INFO [train.py:1028] (0/2) Epoch 4, batch 8400, loss[loss=0.3813, simple_loss=0.3714, pruned_loss=0.1956, over 12927.00 frames. ], tot_loss[loss=0.4096, simple_loss=0.4004, pruned_loss=0.2094, over 2579496.67 frames. ], batch size: 39, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:04:30,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=15.0 2024-06-19 22:04:34,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=71060.0, ans=0.125 2024-06-19 22:04:44,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=71096.66666666667, ans=0.125 2024-06-19 22:04:51,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.103e+02 1.430e+03 1.737e+03 2.245e+03 5.249e+03, threshold=3.474e+03, percent-clipped=3.0 2024-06-19 22:04:52,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.24 vs. limit=10.0 2024-06-19 22:04:56,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=71115.0, ans=0.125 2024-06-19 22:04:56,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=71115.0, ans=0.125 2024-06-19 22:04:59,494 INFO [train.py:1028] (0/2) Epoch 4, batch 8450, loss[loss=0.3974, simple_loss=0.3976, pruned_loss=0.1986, over 13167.00 frames. ], tot_loss[loss=0.41, simple_loss=0.4011, pruned_loss=0.2095, over 2580473.23 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:05:08,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.60 vs. limit=12.0 2024-06-19 22:05:09,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=71151.66666666667, ans=0.1 2024-06-19 22:05:20,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=71188.33333333333, ans=0.125 2024-06-19 22:05:25,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2024-06-19 22:05:32,821 INFO [train.py:1028] (0/2) Epoch 4, batch 8500, loss[loss=0.3718, simple_loss=0.3617, pruned_loss=0.191, over 12643.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4016, pruned_loss=0.2098, over 2578642.12 frames. ], batch size: 29, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:05:46,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=71261.66666666667, ans=0.125 2024-06-19 22:05:47,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.99 vs. limit=15.0 2024-06-19 22:05:51,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.64 vs. limit=10.0 2024-06-19 22:05:58,522 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.587e+02 1.540e+03 1.992e+03 2.442e+03 6.048e+03, threshold=3.983e+03, percent-clipped=3.0 2024-06-19 22:06:06,453 INFO [train.py:1028] (0/2) Epoch 4, batch 8550, loss[loss=0.4014, simple_loss=0.3954, pruned_loss=0.2037, over 12449.00 frames. ], tot_loss[loss=0.4104, simple_loss=0.4016, pruned_loss=0.2096, over 2576434.84 frames. ], batch size: 22, lr: 1.33e-02, grad_scale: 1.0 2024-06-19 22:06:09,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=71316.66666666667, ans=0.0 2024-06-19 22:06:12,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=12.0 2024-06-19 22:06:14,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71335.0, ans=0.1 2024-06-19 22:06:20,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=71335.0, ans=0.125 2024-06-19 22:06:20,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=71335.0, ans=0.2 2024-06-19 22:06:21,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=71335.0, ans=0.035 2024-06-19 22:06:26,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71353.33333333333, ans=0.125 2024-06-19 22:06:29,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=71371.66666666667, ans=0.025 2024-06-19 22:06:37,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=71390.0, ans=0.1 2024-06-19 22:06:42,871 INFO [train.py:1028] (0/2) Epoch 4, batch 8600, loss[loss=0.4059, simple_loss=0.3963, pruned_loss=0.2078, over 13137.00 frames. ], tot_loss[loss=0.4126, simple_loss=0.4031, pruned_loss=0.211, over 2573408.67 frames. ], batch size: 112, lr: 1.33e-02, grad_scale: 2.0 2024-06-19 22:06:48,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=12.0 2024-06-19 22:06:51,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71426.66666666667, ans=0.1 2024-06-19 22:07:00,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=71445.0, ans=0.025 2024-06-19 22:07:06,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71463.33333333333, ans=0.1 2024-06-19 22:07:06,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=71463.33333333333, ans=0.2 2024-06-19 22:07:07,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71463.33333333333, ans=0.125 2024-06-19 22:07:13,387 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.182e+03 2.353e+03 2.804e+03 3.263e+03 1.111e+04, threshold=5.609e+03, percent-clipped=5.0 2024-06-19 22:07:15,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.61 vs. limit=15.0 2024-06-19 22:07:19,759 INFO [train.py:1028] (0/2) Epoch 4, batch 8650, loss[loss=0.4188, simple_loss=0.4097, pruned_loss=0.214, over 13053.00 frames. ], tot_loss[loss=0.4136, simple_loss=0.4041, pruned_loss=0.2116, over 2575769.08 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 0.5 2024-06-19 22:07:28,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=71518.33333333333, ans=0.09899494936611666 2024-06-19 22:07:30,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=71518.33333333333, ans=0.125 2024-06-19 22:07:31,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=71518.33333333333, ans=0.2 2024-06-19 22:07:34,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2024-06-19 22:07:37,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=71536.66666666667, ans=0.0 2024-06-19 22:07:41,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=71555.0, ans=0.2 2024-06-19 22:07:42,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71555.0, ans=0.1 2024-06-19 22:07:52,027 INFO [train.py:1028] (0/2) Epoch 4, batch 8700, loss[loss=0.4316, simple_loss=0.4297, pruned_loss=0.2168, over 13229.00 frames. ], tot_loss[loss=0.4149, simple_loss=0.4049, pruned_loss=0.2125, over 2572113.04 frames. ], batch size: 59, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:07:56,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=22.5 2024-06-19 22:07:58,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2024-06-19 22:08:22,994 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.249e+03 1.970e+03 2.431e+03 2.908e+03 9.565e+03, threshold=4.861e+03, percent-clipped=5.0 2024-06-19 22:08:23,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=71665.0, ans=0.0 2024-06-19 22:08:29,002 INFO [train.py:1028] (0/2) Epoch 4, batch 8750, loss[loss=0.4236, simple_loss=0.4012, pruned_loss=0.223, over 13123.00 frames. ], tot_loss[loss=0.4145, simple_loss=0.4049, pruned_loss=0.2121, over 2568680.67 frames. ], batch size: 121, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:08:30,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=71683.33333333333, ans=0.2 2024-06-19 22:08:39,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=71701.66666666667, ans=0.125 2024-06-19 22:08:50,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71738.33333333333, ans=0.1 2024-06-19 22:09:05,308 INFO [train.py:1028] (0/2) Epoch 4, batch 8800, loss[loss=0.4539, simple_loss=0.4409, pruned_loss=0.2334, over 13182.00 frames. ], tot_loss[loss=0.4154, simple_loss=0.4057, pruned_loss=0.2125, over 2574628.52 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:09:16,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=71793.33333333333, ans=0.125 2024-06-19 22:09:23,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=71811.66666666667, ans=0.125 2024-06-19 22:09:26,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2024-06-19 22:09:33,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.055e+02 1.920e+03 2.207e+03 2.670e+03 3.964e+03, threshold=4.414e+03, percent-clipped=0.0 2024-06-19 22:09:34,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=71848.33333333333, ans=0.0 2024-06-19 22:09:34,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=71848.33333333333, ans=0.0 2024-06-19 22:09:38,898 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:09:39,289 INFO [train.py:1028] (0/2) Epoch 4, batch 8850, loss[loss=0.4475, simple_loss=0.4264, pruned_loss=0.2343, over 12584.00 frames. ], tot_loss[loss=0.415, simple_loss=0.405, pruned_loss=0.2125, over 2562386.08 frames. ], batch size: 202, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:09:42,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=71866.66666666667, ans=0.0 2024-06-19 22:09:46,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=71885.0, ans=0.0 2024-06-19 22:09:52,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=71903.33333333333, ans=0.025 2024-06-19 22:09:55,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.66 vs. limit=12.0 2024-06-19 22:09:57,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71903.33333333333, ans=0.1 2024-06-19 22:10:00,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.15 vs. limit=15.0 2024-06-19 22:10:11,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2024-06-19 22:10:12,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=71940.0, ans=0.0 2024-06-19 22:10:12,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-19 22:10:12,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.94 vs. limit=15.0 2024-06-19 22:10:16,121 INFO [train.py:1028] (0/2) Epoch 4, batch 8900, loss[loss=0.3995, simple_loss=0.4017, pruned_loss=0.1986, over 12970.00 frames. ], tot_loss[loss=0.4155, simple_loss=0.4053, pruned_loss=0.2128, over 2561118.40 frames. ], batch size: 33, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:10:19,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=71958.33333333333, ans=0.2 2024-06-19 22:10:25,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=71976.66666666667, ans=0.125 2024-06-19 22:10:28,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=71995.0, ans=0.125 2024-06-19 22:10:29,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=71995.0, ans=0.0 2024-06-19 22:10:31,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71995.0, ans=0.1 2024-06-19 22:10:31,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=71995.0, ans=0.025 2024-06-19 22:10:32,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.45 vs. limit=15.0 2024-06-19 22:10:37,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72013.33333333333, ans=0.1 2024-06-19 22:10:37,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.72 vs. limit=15.0 2024-06-19 22:10:45,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=72013.33333333333, ans=0.0 2024-06-19 22:10:46,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=72013.33333333333, ans=0.1 2024-06-19 22:10:47,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=72031.66666666667, ans=0.04949747468305833 2024-06-19 22:10:49,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.200e+02 1.641e+03 2.088e+03 2.654e+03 1.266e+04, threshold=4.175e+03, percent-clipped=5.0 2024-06-19 22:10:50,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=72031.66666666667, ans=0.125 2024-06-19 22:10:54,227 INFO [train.py:1028] (0/2) Epoch 4, batch 8950, loss[loss=0.4573, simple_loss=0.4314, pruned_loss=0.2416, over 12490.00 frames. ], tot_loss[loss=0.4141, simple_loss=0.4047, pruned_loss=0.2117, over 2561556.51 frames. ], batch size: 202, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:11:01,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=12.0 2024-06-19 22:11:03,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72068.33333333333, ans=0.1 2024-06-19 22:11:04,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=72068.33333333333, ans=0.0 2024-06-19 22:11:08,400 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:11:14,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=72105.0, ans=0.0 2024-06-19 22:11:19,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.70 vs. limit=15.0 2024-06-19 22:11:20,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=72123.33333333333, ans=0.0 2024-06-19 22:11:27,174 INFO [train.py:1028] (0/2) Epoch 4, batch 9000, loss[loss=0.4193, simple_loss=0.4133, pruned_loss=0.2126, over 13281.00 frames. ], tot_loss[loss=0.4125, simple_loss=0.4042, pruned_loss=0.2104, over 2566376.47 frames. ], batch size: 46, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:11:27,174 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 22:11:34,867 INFO [train.py:1060] (0/2) Epoch 4, validation: loss=0.2624, simple_loss=0.3103, pruned_loss=0.1072, over 351949.00 frames. 2024-06-19 22:11:34,868 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 22:11:36,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=72141.66666666667, ans=0.125 2024-06-19 22:11:41,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=72160.0, ans=0.1 2024-06-19 22:11:45,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72160.0, ans=0.1 2024-06-19 22:11:51,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=72178.33333333333, ans=0.0 2024-06-19 22:11:51,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=72178.33333333333, ans=0.125 2024-06-19 22:11:53,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=72178.33333333333, ans=0.09899494936611666 2024-06-19 22:11:59,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2024-06-19 22:11:59,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72196.66666666667, ans=0.1 2024-06-19 22:12:03,723 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.013e+02 1.194e+03 1.439e+03 1.790e+03 3.062e+03, threshold=2.878e+03, percent-clipped=0.0 2024-06-19 22:12:07,688 INFO [train.py:1028] (0/2) Epoch 4, batch 9050, loss[loss=0.4202, simple_loss=0.4043, pruned_loss=0.2181, over 11469.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4044, pruned_loss=0.2106, over 2567212.99 frames. ], batch size: 17, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:12:11,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=72233.33333333333, ans=0.125 2024-06-19 22:12:11,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=72233.33333333333, ans=0.125 2024-06-19 22:12:23,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=72270.0, ans=0.05 2024-06-19 22:12:33,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=72306.66666666667, ans=0.125 2024-06-19 22:12:34,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=72306.66666666667, ans=0.0 2024-06-19 22:12:34,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=72306.66666666667, ans=0.04949747468305833 2024-06-19 22:12:35,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 22:12:38,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72306.66666666667, ans=0.1 2024-06-19 22:12:39,974 INFO [train.py:1028] (0/2) Epoch 4, batch 9100, loss[loss=0.3744, simple_loss=0.3827, pruned_loss=0.183, over 13273.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4031, pruned_loss=0.209, over 2568380.34 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:12:45,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=12.0 2024-06-19 22:12:48,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2024-06-19 22:12:49,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.01 vs. limit=22.5 2024-06-19 22:13:01,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=72380.0, ans=0.0 2024-06-19 22:13:02,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=72380.0, ans=0.1 2024-06-19 22:13:11,809 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.000e+02 1.268e+03 1.670e+03 2.168e+03 5.720e+03, threshold=3.339e+03, percent-clipped=8.0 2024-06-19 22:13:14,955 INFO [train.py:1028] (0/2) Epoch 4, batch 9150, loss[loss=0.4166, simple_loss=0.411, pruned_loss=0.2111, over 13158.00 frames. ], tot_loss[loss=0.4118, simple_loss=0.404, pruned_loss=0.2098, over 2569615.79 frames. ], batch size: 77, lr: 1.32e-02, grad_scale: 1.0 2024-06-19 22:13:22,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2024-06-19 22:13:35,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72471.66666666667, ans=0.125 2024-06-19 22:13:42,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=72490.0, ans=0.125 2024-06-19 22:13:45,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=72490.0, ans=0.2 2024-06-19 22:13:46,692 INFO [train.py:1028] (0/2) Epoch 4, batch 9200, loss[loss=0.3673, simple_loss=0.3819, pruned_loss=0.1763, over 12920.00 frames. ], tot_loss[loss=0.4123, simple_loss=0.4053, pruned_loss=0.2097, over 2571960.01 frames. ], batch size: 36, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:13:46,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=12.0 2024-06-19 22:13:52,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.52 vs. limit=6.0 2024-06-19 22:14:05,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=72545.0, ans=0.04949747468305833 2024-06-19 22:14:05,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=72545.0, ans=0.025 2024-06-19 22:14:06,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=15.0 2024-06-19 22:14:13,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=72563.33333333333, ans=0.125 2024-06-19 22:14:16,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=72581.66666666667, ans=0.02 2024-06-19 22:14:17,847 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.700e+02 1.527e+03 1.844e+03 2.145e+03 3.216e+03, threshold=3.688e+03, percent-clipped=0.0 2024-06-19 22:14:18,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=72581.66666666667, ans=0.125 2024-06-19 22:14:20,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=72581.66666666667, ans=0.125 2024-06-19 22:14:21,220 INFO [train.py:1028] (0/2) Epoch 4, batch 9250, loss[loss=0.4305, simple_loss=0.4215, pruned_loss=0.2198, over 13211.00 frames. ], tot_loss[loss=0.412, simple_loss=0.4051, pruned_loss=0.2094, over 2573700.74 frames. ], batch size: 67, lr: 1.32e-02, grad_scale: 2.0 2024-06-19 22:14:22,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=72600.0, ans=0.2 2024-06-19 22:14:26,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.32 vs. limit=15.0 2024-06-19 22:14:26,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=15.0 2024-06-19 22:14:29,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=72618.33333333333, ans=0.95 2024-06-19 22:14:43,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=72655.0, ans=0.025 2024-06-19 22:14:47,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=72673.33333333333, ans=0.0 2024-06-19 22:14:52,674 INFO [train.py:1028] (0/2) Epoch 4, batch 9300, loss[loss=0.3666, simple_loss=0.372, pruned_loss=0.1806, over 12984.00 frames. ], tot_loss[loss=0.4096, simple_loss=0.4037, pruned_loss=0.2078, over 2572070.03 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 4.0 2024-06-19 22:14:59,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=72710.0, ans=0.125 2024-06-19 22:15:02,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=72710.0, ans=0.2 2024-06-19 22:15:21,678 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.206e+02 1.715e+03 2.037e+03 2.565e+03 4.082e+03, threshold=4.073e+03, percent-clipped=4.0 2024-06-19 22:15:23,528 INFO [train.py:1028] (0/2) Epoch 4, batch 9350, loss[loss=0.397, simple_loss=0.3965, pruned_loss=0.1987, over 12509.00 frames. ], tot_loss[loss=0.4099, simple_loss=0.4035, pruned_loss=0.2082, over 2568602.43 frames. ], batch size: 22, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:15:34,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=72801.66666666667, ans=0.0 2024-06-19 22:15:42,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.01 vs. limit=10.0 2024-06-19 22:15:54,906 INFO [train.py:1028] (0/2) Epoch 4, batch 9400, loss[loss=0.4133, simple_loss=0.4049, pruned_loss=0.2108, over 13254.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4048, pruned_loss=0.2098, over 2568149.57 frames. ], batch size: 52, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:16:09,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=72911.66666666667, ans=0.125 2024-06-19 22:16:25,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.32 vs. limit=15.0 2024-06-19 22:16:26,677 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+03 2.438e+03 3.083e+03 3.703e+03 5.863e+03, threshold=6.166e+03, percent-clipped=12.0 2024-06-19 22:16:28,018 INFO [train.py:1028] (0/2) Epoch 4, batch 9450, loss[loss=0.4093, simple_loss=0.3956, pruned_loss=0.2115, over 12515.00 frames. ], tot_loss[loss=0.4141, simple_loss=0.4056, pruned_loss=0.2113, over 2568265.92 frames. ], batch size: 22, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:16:29,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=72966.66666666667, ans=0.0 2024-06-19 22:16:33,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-19 22:16:38,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=72985.0, ans=0.125 2024-06-19 22:16:38,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72985.0, ans=0.1 2024-06-19 22:16:45,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.44 vs. limit=15.0 2024-06-19 22:16:52,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=73040.0, ans=0.025 2024-06-19 22:16:52,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=73040.0, ans=0.0 2024-06-19 22:16:58,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.59 vs. limit=15.0 2024-06-19 22:16:58,764 INFO [train.py:1028] (0/2) Epoch 4, batch 9500, loss[loss=0.4124, simple_loss=0.4047, pruned_loss=0.2101, over 13260.00 frames. ], tot_loss[loss=0.4126, simple_loss=0.405, pruned_loss=0.2102, over 2575969.18 frames. ], batch size: 43, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:17:08,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73076.66666666667, ans=0.1 2024-06-19 22:17:17,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=73095.0, ans=0.0 2024-06-19 22:17:23,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.98 vs. limit=15.0 2024-06-19 22:17:29,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=73131.66666666667, ans=0.0 2024-06-19 22:17:30,599 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 1.893e+03 2.366e+03 3.061e+03 5.566e+03, threshold=4.732e+03, percent-clipped=0.0 2024-06-19 22:17:31,238 INFO [train.py:1028] (0/2) Epoch 4, batch 9550, loss[loss=0.3586, simple_loss=0.3644, pruned_loss=0.1763, over 13297.00 frames. ], tot_loss[loss=0.4126, simple_loss=0.4047, pruned_loss=0.2103, over 2572591.36 frames. ], batch size: 40, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:17:38,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-19 22:17:38,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=73168.33333333333, ans=0.125 2024-06-19 22:17:39,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=73168.33333333333, ans=0.125 2024-06-19 22:17:49,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=73186.66666666667, ans=22.5 2024-06-19 22:17:49,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=73205.0, ans=0.125 2024-06-19 22:17:50,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73205.0, ans=0.125 2024-06-19 22:17:54,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2024-06-19 22:17:58,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=73223.33333333333, ans=0.0 2024-06-19 22:18:02,527 INFO [train.py:1028] (0/2) Epoch 4, batch 9600, loss[loss=0.4439, simple_loss=0.4073, pruned_loss=0.2403, over 10359.00 frames. ], tot_loss[loss=0.4118, simple_loss=0.404, pruned_loss=0.2099, over 2570045.83 frames. ], batch size: 303, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:18:06,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=73241.66666666667, ans=0.125 2024-06-19 22:18:10,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=73260.0, ans=0.125 2024-06-19 22:18:14,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=73278.33333333333, ans=0.0 2024-06-19 22:18:14,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2024-06-19 22:18:16,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.46 vs. limit=12.0 2024-06-19 22:18:17,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=73278.33333333333, ans=12.0 2024-06-19 22:18:20,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=73296.66666666667, ans=0.125 2024-06-19 22:18:23,356 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:18:27,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=73315.0, ans=0.2 2024-06-19 22:18:32,494 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.105e+03 1.763e+03 2.190e+03 2.546e+03 3.338e+03, threshold=4.380e+03, percent-clipped=0.0 2024-06-19 22:18:32,670 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-40000.pt 2024-06-19 22:18:38,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=73333.33333333333, ans=0.125 2024-06-19 22:18:38,522 INFO [train.py:1028] (0/2) Epoch 4, batch 9650, loss[loss=0.3993, simple_loss=0.3867, pruned_loss=0.2059, over 13077.00 frames. ], tot_loss[loss=0.4131, simple_loss=0.4042, pruned_loss=0.211, over 2561641.45 frames. ], batch size: 132, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:18:43,674 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.28 vs. limit=10.0 2024-06-19 22:18:46,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=73351.66666666667, ans=0.125 2024-06-19 22:18:56,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=73370.0, ans=15.0 2024-06-19 22:18:59,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=73388.33333333333, ans=0.125 2024-06-19 22:19:03,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=73388.33333333333, ans=0.125 2024-06-19 22:19:03,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73406.66666666667, ans=0.125 2024-06-19 22:19:04,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=15.0 2024-06-19 22:19:10,377 INFO [train.py:1028] (0/2) Epoch 4, batch 9700, loss[loss=0.3951, simple_loss=0.39, pruned_loss=0.2, over 12987.00 frames. ], tot_loss[loss=0.4125, simple_loss=0.4032, pruned_loss=0.2109, over 2556306.64 frames. ], batch size: 144, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:19:16,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-19 22:19:33,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2024-06-19 22:19:41,947 INFO [train.py:1028] (0/2) Epoch 4, batch 9750, loss[loss=0.3737, simple_loss=0.3633, pruned_loss=0.192, over 13058.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.4005, pruned_loss=0.2086, over 2553376.16 frames. ], batch size: 132, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:19:42,476 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.263e+03 2.159e+03 2.709e+03 3.187e+03 5.514e+03, threshold=5.418e+03, percent-clipped=9.0 2024-06-19 22:19:51,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73535.0, ans=0.125 2024-06-19 22:20:00,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=73571.66666666667, ans=0.0 2024-06-19 22:20:03,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=73571.66666666667, ans=0.2 2024-06-19 22:20:12,783 INFO [train.py:1028] (0/2) Epoch 4, batch 9800, loss[loss=0.4307, simple_loss=0.4236, pruned_loss=0.2189, over 12929.00 frames. ], tot_loss[loss=0.4068, simple_loss=0.3992, pruned_loss=0.2071, over 2545737.35 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 2.0 2024-06-19 22:20:15,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=73608.33333333333, ans=0.125 2024-06-19 22:20:17,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=73608.33333333333, ans=0.125 2024-06-19 22:20:25,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=73645.0, ans=0.125 2024-06-19 22:20:25,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=73645.0, ans=0.09899494936611666 2024-06-19 22:20:29,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=73645.0, ans=0.0 2024-06-19 22:20:29,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2024-06-19 22:20:34,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=73663.33333333333, ans=0.2 2024-06-19 22:20:43,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=73681.66666666667, ans=0.0 2024-06-19 22:20:44,957 INFO [train.py:1028] (0/2) Epoch 4, batch 9850, loss[loss=0.4112, simple_loss=0.4028, pruned_loss=0.2098, over 13139.00 frames. ], tot_loss[loss=0.4052, simple_loss=0.398, pruned_loss=0.2062, over 2538428.07 frames. ], batch size: 103, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:20:46,076 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+03 3.182e+03 3.725e+03 4.492e+03 7.979e+03, threshold=7.451e+03, percent-clipped=7.0 2024-06-19 22:20:53,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=73718.33333333333, ans=0.0 2024-06-19 22:20:56,734 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2024-06-19 22:20:58,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.78 vs. limit=22.5 2024-06-19 22:20:59,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=73736.66666666667, ans=0.2 2024-06-19 22:21:01,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.99 vs. limit=15.0 2024-06-19 22:21:11,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=15.0 2024-06-19 22:21:11,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.32 vs. limit=15.0 2024-06-19 22:21:16,613 INFO [train.py:1028] (0/2) Epoch 4, batch 9900, loss[loss=0.3477, simple_loss=0.3621, pruned_loss=0.1666, over 12946.00 frames. ], tot_loss[loss=0.4065, simple_loss=0.3983, pruned_loss=0.2073, over 2531198.68 frames. ], batch size: 39, lr: 1.31e-02, grad_scale: 1.0 2024-06-19 22:21:21,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.02 vs. limit=15.0 2024-06-19 22:21:24,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=73810.0, ans=0.0 2024-06-19 22:21:39,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73846.66666666667, ans=0.125 2024-06-19 22:21:47,307 INFO [train.py:1028] (0/2) Epoch 4, batch 9950, loss[loss=0.4269, simple_loss=0.4154, pruned_loss=0.2192, over 12740.00 frames. ], tot_loss[loss=0.4071, simple_loss=0.3982, pruned_loss=0.208, over 2525485.85 frames. ], batch size: 29, lr: 1.30e-02, grad_scale: 0.5 2024-06-19 22:21:49,651 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+03 3.513e+03 4.367e+03 5.141e+03 1.511e+04, threshold=8.734e+03, percent-clipped=9.0 2024-06-19 22:21:59,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.46 vs. limit=15.0 2024-06-19 22:22:18,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=73975.0, ans=0.125 2024-06-19 22:22:19,459 INFO [train.py:1028] (0/2) Epoch 4, batch 10000, loss[loss=0.3736, simple_loss=0.3766, pruned_loss=0.1853, over 12653.00 frames. ], tot_loss[loss=0.4085, simple_loss=0.3985, pruned_loss=0.2093, over 2485379.67 frames. ], batch size: 22, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:22:19,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.67 vs. limit=10.0 2024-06-19 22:22:22,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=73975.0, ans=0.0 2024-06-19 22:22:23,123 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.80 vs. limit=22.5 2024-06-19 22:22:26,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=73993.33333333333, ans=0.0 2024-06-19 22:22:26,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.16 vs. limit=10.0 2024-06-19 22:22:28,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=73993.33333333333, ans=0.0 2024-06-19 22:22:28,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=73993.33333333333, ans=0.125 2024-06-19 22:22:29,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=73993.33333333333, ans=0.0 2024-06-19 22:22:30,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=73993.33333333333, ans=0.125 2024-06-19 22:22:36,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.21 vs. limit=10.0 2024-06-19 22:22:40,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2024-06-19 22:22:47,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74048.33333333333, ans=0.1 2024-06-19 22:22:49,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=74048.33333333333, ans=0.1 2024-06-19 22:22:51,261 INFO [train.py:1028] (0/2) Epoch 4, batch 10050, loss[loss=0.4186, simple_loss=0.4113, pruned_loss=0.2129, over 12492.00 frames. ], tot_loss[loss=0.4125, simple_loss=0.4002, pruned_loss=0.2125, over 2444021.37 frames. ], batch size: 22, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:22:54,098 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+03 2.745e+03 3.637e+03 4.570e+03 1.066e+04, threshold=7.273e+03, percent-clipped=3.0 2024-06-19 22:22:55,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.46 vs. limit=15.0 2024-06-19 22:22:57,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=74085.0, ans=0.125 2024-06-19 22:22:57,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=10.0 2024-06-19 22:22:59,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=74085.0, ans=0.125 2024-06-19 22:23:08,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.95 vs. limit=22.5 2024-06-19 22:23:20,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=74158.33333333333, ans=0.0 2024-06-19 22:23:21,335 INFO [train.py:1028] (0/2) Epoch 4, batch 10100, loss[loss=0.3592, simple_loss=0.3613, pruned_loss=0.1785, over 10953.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.3992, pruned_loss=0.2112, over 2424910.74 frames. ], batch size: 16, lr: 1.30e-02, grad_scale: 1.0 2024-06-19 22:23:29,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.02 vs. limit=22.5 2024-06-19 22:23:35,402 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-4.pt 2024-06-19 22:25:35,950 INFO [train.py:1028] (0/2) Epoch 5, batch 0, loss[loss=0.3544, simple_loss=0.3534, pruned_loss=0.1777, over 12986.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3534, pruned_loss=0.1777, over 12986.00 frames. ], batch size: 36, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:25:35,952 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 22:25:43,155 INFO [train.py:1060] (0/2) Epoch 5, validation: loss=0.2693, simple_loss=0.3155, pruned_loss=0.1116, over 351949.00 frames. 2024-06-19 22:25:43,156 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 22:25:48,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=74191.33333333333, ans=0.2 2024-06-19 22:25:52,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=74209.66666666667, ans=0.125 2024-06-19 22:26:01,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.64 vs. limit=12.0 2024-06-19 22:26:04,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=74228.0, ans=0.125 2024-06-19 22:26:06,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=74246.33333333333, ans=0.0 2024-06-19 22:26:08,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.63 vs. limit=15.0 2024-06-19 22:26:10,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.528e+03 2.542e+03 3.177e+03 3.813e+03 7.832e+03, threshold=6.355e+03, percent-clipped=1.0 2024-06-19 22:26:18,716 INFO [train.py:1028] (0/2) Epoch 5, batch 50, loss[loss=0.3607, simple_loss=0.3655, pruned_loss=0.178, over 12702.00 frames. ], tot_loss[loss=0.3861, simple_loss=0.3767, pruned_loss=0.1977, over 573856.27 frames. ], batch size: 29, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:26:29,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=74301.33333333333, ans=0.125 2024-06-19 22:26:30,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=15.0 2024-06-19 22:26:36,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=74319.66666666667, ans=0.0 2024-06-19 22:26:39,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=74338.0, ans=0.125 2024-06-19 22:26:41,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2024-06-19 22:26:50,373 INFO [train.py:1028] (0/2) Epoch 5, batch 100, loss[loss=0.3715, simple_loss=0.3787, pruned_loss=0.1822, over 13316.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.3742, pruned_loss=0.1954, over 1017226.67 frames. ], batch size: 46, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:26:57,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=74393.0, ans=0.125 2024-06-19 22:26:59,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=74393.0, ans=0.125 2024-06-19 22:27:01,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74393.0, ans=0.125 2024-06-19 22:27:03,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.94 vs. limit=12.0 2024-06-19 22:27:06,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74411.33333333333, ans=0.1 2024-06-19 22:27:14,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=74429.66666666667, ans=0.125 2024-06-19 22:27:15,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=74429.66666666667, ans=0.02 2024-06-19 22:27:19,840 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.071e+03 2.488e+03 2.913e+03 3.534e+03 5.846e+03, threshold=5.825e+03, percent-clipped=0.0 2024-06-19 22:27:26,715 INFO [train.py:1028] (0/2) Epoch 5, batch 150, loss[loss=0.3701, simple_loss=0.3648, pruned_loss=0.1877, over 12641.00 frames. ], tot_loss[loss=0.3781, simple_loss=0.3717, pruned_loss=0.1922, over 1364921.32 frames. ], batch size: 29, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:27:28,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=74466.33333333333, ans=0.0 2024-06-19 22:27:33,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=74484.66666666667, ans=0.125 2024-06-19 22:27:34,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=74484.66666666667, ans=0.125 2024-06-19 22:27:37,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=15.0 2024-06-19 22:27:44,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=74503.0, ans=0.05 2024-06-19 22:27:44,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74503.0, ans=0.125 2024-06-19 22:27:46,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=74521.33333333333, ans=0.125 2024-06-19 22:27:47,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2024-06-19 22:27:58,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=74539.66666666667, ans=0.035 2024-06-19 22:28:01,926 INFO [train.py:1028] (0/2) Epoch 5, batch 200, loss[loss=0.4105, simple_loss=0.3858, pruned_loss=0.2176, over 12488.00 frames. ], tot_loss[loss=0.3757, simple_loss=0.3705, pruned_loss=0.1905, over 1634673.34 frames. ], batch size: 202, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:28:02,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=74558.0, ans=0.07 2024-06-19 22:28:02,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74558.0, ans=0.125 2024-06-19 22:28:06,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=74558.0, ans=0.125 2024-06-19 22:28:08,357 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.703e+02 2024-06-19 22:28:26,445 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.369e+03 2.216e+03 2.599e+03 3.185e+03 5.285e+03, threshold=5.197e+03, percent-clipped=0.0 2024-06-19 22:28:33,758 INFO [train.py:1028] (0/2) Epoch 5, batch 250, loss[loss=0.3583, simple_loss=0.3404, pruned_loss=0.1881, over 12986.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.3694, pruned_loss=0.1896, over 1845595.99 frames. ], batch size: 144, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:28:55,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=74704.66666666667, ans=0.2 2024-06-19 22:29:03,268 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.73 vs. limit=10.0 2024-06-19 22:29:05,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74723.0, ans=0.1 2024-06-19 22:29:06,832 INFO [train.py:1028] (0/2) Epoch 5, batch 300, loss[loss=0.379, simple_loss=0.369, pruned_loss=0.1945, over 13145.00 frames. ], tot_loss[loss=0.3751, simple_loss=0.3702, pruned_loss=0.19, over 2009022.74 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 2.0 2024-06-19 22:29:19,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-19 22:29:29,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.19 vs. limit=15.0 2024-06-19 22:29:36,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=74814.66666666667, ans=0.125 2024-06-19 22:29:37,244 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.640e+02 2.026e+03 2.377e+03 2.866e+03 4.081e+03, threshold=4.755e+03, percent-clipped=0.0 2024-06-19 22:29:39,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=74814.66666666667, ans=0.0 2024-06-19 22:29:42,514 INFO [train.py:1028] (0/2) Epoch 5, batch 350, loss[loss=0.3397, simple_loss=0.3515, pruned_loss=0.164, over 12977.00 frames. ], tot_loss[loss=0.3731, simple_loss=0.3691, pruned_loss=0.1886, over 2138544.39 frames. ], batch size: 33, lr: 1.21e-02, grad_scale: 0.5 2024-06-19 22:29:43,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=74833.0, ans=0.025 2024-06-19 22:29:46,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=15.64 vs. limit=15.0 2024-06-19 22:30:00,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=74869.66666666667, ans=0.0 2024-06-19 22:30:19,586 INFO [train.py:1028] (0/2) Epoch 5, batch 400, loss[loss=0.3485, simple_loss=0.3601, pruned_loss=0.1684, over 13270.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.3687, pruned_loss=0.1875, over 2238867.79 frames. ], batch size: 63, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:30:21,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=74924.66666666667, ans=0.0 2024-06-19 22:30:25,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74943.0, ans=0.1 2024-06-19 22:30:31,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-19 22:30:44,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=12.0 2024-06-19 22:30:46,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=74998.0, ans=0.0 2024-06-19 22:30:48,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+03 2.745e+03 3.195e+03 3.885e+03 8.353e+03, threshold=6.389e+03, percent-clipped=7.0 2024-06-19 22:30:50,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=74998.0, ans=0.125 2024-06-19 22:30:52,851 INFO [train.py:1028] (0/2) Epoch 5, batch 450, loss[loss=0.357, simple_loss=0.3596, pruned_loss=0.1772, over 13271.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.368, pruned_loss=0.1868, over 2313343.39 frames. ], batch size: 67, lr: 1.21e-02, grad_scale: 0.5 2024-06-19 22:30:57,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=75016.33333333333, ans=0.125 2024-06-19 22:30:58,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=15.0 2024-06-19 22:31:04,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=75034.66666666667, ans=0.0 2024-06-19 22:31:16,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=75071.33333333333, ans=0.0 2024-06-19 22:31:26,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75089.66666666667, ans=0.1 2024-06-19 22:31:28,697 INFO [train.py:1028] (0/2) Epoch 5, batch 500, loss[loss=0.3856, simple_loss=0.3732, pruned_loss=0.1989, over 13107.00 frames. ], tot_loss[loss=0.3716, simple_loss=0.3688, pruned_loss=0.1872, over 2376214.57 frames. ], batch size: 121, lr: 1.21e-02, grad_scale: 1.0 2024-06-19 22:31:30,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=75108.0, ans=0.2 2024-06-19 22:31:45,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.36 vs. limit=15.0 2024-06-19 22:31:46,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.97 vs. limit=10.0 2024-06-19 22:31:49,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=75163.0, ans=0.07 2024-06-19 22:31:50,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=75163.0, ans=0.125 2024-06-19 22:31:53,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=75163.0, ans=0.0 2024-06-19 22:31:56,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+03 3.059e+03 3.757e+03 4.671e+03 9.028e+03, threshold=7.514e+03, percent-clipped=3.0 2024-06-19 22:32:01,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=75181.33333333333, ans=0.0 2024-06-19 22:32:04,142 INFO [train.py:1028] (0/2) Epoch 5, batch 550, loss[loss=0.3931, simple_loss=0.3804, pruned_loss=0.2029, over 12911.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.3701, pruned_loss=0.1886, over 2420619.10 frames. ], batch size: 158, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:32:05,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2024-06-19 22:32:24,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.79 vs. limit=15.0 2024-06-19 22:32:24,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=75254.66666666667, ans=0.04949747468305833 2024-06-19 22:32:34,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=8.0 2024-06-19 22:32:36,611 INFO [train.py:1028] (0/2) Epoch 5, batch 600, loss[loss=0.3581, simple_loss=0.3467, pruned_loss=0.1848, over 13042.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.3693, pruned_loss=0.1879, over 2459043.66 frames. ], batch size: 144, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:32:40,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=75291.33333333333, ans=0.2 2024-06-19 22:32:44,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.71 vs. limit=15.0 2024-06-19 22:32:46,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.94 vs. limit=12.0 2024-06-19 22:32:47,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.81 vs. limit=15.0 2024-06-19 22:32:51,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=75328.0, ans=0.125 2024-06-19 22:32:53,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75328.0, ans=0.1 2024-06-19 22:32:57,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=75346.33333333333, ans=0.125 2024-06-19 22:33:04,670 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+03 2.953e+03 3.401e+03 4.172e+03 7.114e+03, threshold=6.802e+03, percent-clipped=0.0 2024-06-19 22:33:05,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-19 22:33:08,727 INFO [train.py:1028] (0/2) Epoch 5, batch 650, loss[loss=0.375, simple_loss=0.3813, pruned_loss=0.1844, over 13234.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.3688, pruned_loss=0.1871, over 2489245.13 frames. ], batch size: 59, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:33:11,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=75383.0, ans=0.125 2024-06-19 22:33:12,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75383.0, ans=0.125 2024-06-19 22:33:14,305 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.864e-01 2024-06-19 22:33:38,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75456.33333333333, ans=0.1 2024-06-19 22:33:44,539 INFO [train.py:1028] (0/2) Epoch 5, batch 700, loss[loss=0.3316, simple_loss=0.3439, pruned_loss=0.1596, over 13270.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.368, pruned_loss=0.1866, over 2511634.23 frames. ], batch size: 46, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:33:50,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=75493.0, ans=0.125 2024-06-19 22:33:53,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=75493.0, ans=0.0 2024-06-19 22:33:58,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=75511.33333333333, ans=0.0 2024-06-19 22:33:59,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75511.33333333333, ans=0.1 2024-06-19 22:34:03,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75529.66666666667, ans=0.1 2024-06-19 22:34:10,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=12.0 2024-06-19 22:34:14,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.57 vs. limit=15.0 2024-06-19 22:34:16,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+03 2.432e+03 2.915e+03 3.440e+03 5.515e+03, threshold=5.830e+03, percent-clipped=0.0 2024-06-19 22:34:20,254 INFO [train.py:1028] (0/2) Epoch 5, batch 750, loss[loss=0.329, simple_loss=0.348, pruned_loss=0.155, over 13267.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.3684, pruned_loss=0.1861, over 2529407.63 frames. ], batch size: 63, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:34:25,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.46 vs. limit=15.0 2024-06-19 22:34:31,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=75584.66666666667, ans=0.0 2024-06-19 22:34:40,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=75621.33333333333, ans=0.025 2024-06-19 22:34:45,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=15.0 2024-06-19 22:34:47,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=75639.66666666667, ans=0.2 2024-06-19 22:34:52,905 INFO [train.py:1028] (0/2) Epoch 5, batch 800, loss[loss=0.3422, simple_loss=0.3476, pruned_loss=0.1684, over 12946.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.3677, pruned_loss=0.1861, over 2541111.24 frames. ], batch size: 36, lr: 1.20e-02, grad_scale: 4.0 2024-06-19 22:35:08,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.61 vs. limit=15.0 2024-06-19 22:35:09,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=75694.66666666667, ans=0.025 2024-06-19 22:35:11,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75694.66666666667, ans=0.125 2024-06-19 22:35:22,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=75731.33333333333, ans=0.0 2024-06-19 22:35:23,796 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.379e+03 3.001e+03 3.685e+03 4.323e+03 7.197e+03, threshold=7.370e+03, percent-clipped=8.0 2024-06-19 22:35:24,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2024-06-19 22:35:25,857 INFO [train.py:1028] (0/2) Epoch 5, batch 850, loss[loss=0.3685, simple_loss=0.3619, pruned_loss=0.1875, over 13125.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.3663, pruned_loss=0.185, over 2550233.95 frames. ], batch size: 95, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:35:41,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2024-06-19 22:35:52,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=75804.66666666667, ans=0.0 2024-06-19 22:35:55,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=75823.0, ans=0.0 2024-06-19 22:35:55,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.59 vs. limit=10.0 2024-06-19 22:36:00,957 INFO [train.py:1028] (0/2) Epoch 5, batch 900, loss[loss=0.331, simple_loss=0.3364, pruned_loss=0.1628, over 12955.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.366, pruned_loss=0.1851, over 2556202.18 frames. ], batch size: 36, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:36:02,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=75841.33333333333, ans=0.125 2024-06-19 22:36:20,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=75878.0, ans=0.125 2024-06-19 22:36:35,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+03 2.761e+03 3.312e+03 4.209e+03 7.224e+03, threshold=6.625e+03, percent-clipped=0.0 2024-06-19 22:36:36,540 INFO [train.py:1028] (0/2) Epoch 5, batch 950, loss[loss=0.3835, simple_loss=0.3876, pruned_loss=0.1897, over 13190.00 frames. ], tot_loss[loss=0.3683, simple_loss=0.3662, pruned_loss=0.1852, over 2560115.90 frames. ], batch size: 40, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:36:41,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75933.0, ans=0.1 2024-06-19 22:36:42,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=75933.0, ans=0.125 2024-06-19 22:36:58,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=75988.0, ans=0.125 2024-06-19 22:36:58,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=75988.0, ans=0.125 2024-06-19 22:37:05,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76006.33333333333, ans=0.1 2024-06-19 22:37:08,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76024.66666666667, ans=0.1 2024-06-19 22:37:09,346 INFO [train.py:1028] (0/2) Epoch 5, batch 1000, loss[loss=0.3854, simple_loss=0.3835, pruned_loss=0.1936, over 13320.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.3653, pruned_loss=0.1848, over 2562543.79 frames. ], batch size: 49, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:37:12,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-19 22:37:15,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=12.0 2024-06-19 22:37:17,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=15.0 2024-06-19 22:37:18,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.48 vs. limit=22.5 2024-06-19 22:37:35,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.22 vs. limit=10.0 2024-06-19 22:37:44,327 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.337e+03 2.595e+03 2.899e+03 3.400e+03 5.036e+03, threshold=5.797e+03, percent-clipped=0.0 2024-06-19 22:37:45,019 INFO [train.py:1028] (0/2) Epoch 5, batch 1050, loss[loss=0.3506, simple_loss=0.3612, pruned_loss=0.17, over 13123.00 frames. ], tot_loss[loss=0.3679, simple_loss=0.3659, pruned_loss=0.1849, over 2565833.89 frames. ], batch size: 77, lr: 1.20e-02, grad_scale: 0.5 2024-06-19 22:37:46,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=76116.33333333333, ans=0.125 2024-06-19 22:37:48,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76116.33333333333, ans=0.125 2024-06-19 22:37:49,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76116.33333333333, ans=0.1 2024-06-19 22:37:50,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.04 vs. limit=6.0 2024-06-19 22:38:04,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=76171.33333333333, ans=0.125 2024-06-19 22:38:10,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=76171.33333333333, ans=10.0 2024-06-19 22:38:16,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=76189.66666666667, ans=0.0 2024-06-19 22:38:21,046 INFO [train.py:1028] (0/2) Epoch 5, batch 1100, loss[loss=0.3839, simple_loss=0.383, pruned_loss=0.1924, over 13288.00 frames. ], tot_loss[loss=0.3683, simple_loss=0.3666, pruned_loss=0.185, over 2570896.04 frames. ], batch size: 52, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:38:26,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=76208.0, ans=0.2 2024-06-19 22:38:30,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=76226.33333333333, ans=0.0 2024-06-19 22:38:35,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.80 vs. limit=22.5 2024-06-19 22:38:37,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76244.66666666667, ans=0.125 2024-06-19 22:38:40,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=76263.0, ans=0.5 2024-06-19 22:38:42,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=76263.0, ans=0.0 2024-06-19 22:38:43,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76263.0, ans=0.125 2024-06-19 22:38:43,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2024-06-19 22:38:43,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=8.0 2024-06-19 22:38:53,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.288e+03 1.994e+03 2.418e+03 2.914e+03 4.189e+03, threshold=4.835e+03, percent-clipped=0.0 2024-06-19 22:38:53,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=76299.66666666667, ans=0.05 2024-06-19 22:38:53,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=76299.66666666667, ans=15.0 2024-06-19 22:38:53,838 INFO [train.py:1028] (0/2) Epoch 5, batch 1150, loss[loss=0.3874, simple_loss=0.3885, pruned_loss=0.1931, over 13258.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.3673, pruned_loss=0.1856, over 2572305.68 frames. ], batch size: 52, lr: 1.20e-02, grad_scale: 1.0 2024-06-19 22:39:03,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=76318.0, ans=0.0 2024-06-19 22:39:17,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.63 vs. limit=22.5 2024-06-19 22:39:28,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=76373.0, ans=0.125 2024-06-19 22:39:29,148 INFO [train.py:1028] (0/2) Epoch 5, batch 1200, loss[loss=0.3672, simple_loss=0.3759, pruned_loss=0.1793, over 13236.00 frames. ], tot_loss[loss=0.368, simple_loss=0.3662, pruned_loss=0.1849, over 2574177.22 frames. ], batch size: 77, lr: 1.20e-02, grad_scale: 2.0 2024-06-19 22:39:32,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.09 vs. limit=15.0 2024-06-19 22:39:39,815 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.84 vs. limit=22.5 2024-06-19 22:39:40,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=76409.66666666667, ans=0.125 2024-06-19 22:39:46,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76428.0, ans=0.125 2024-06-19 22:39:57,104 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.07 vs. limit=15.0 2024-06-19 22:39:58,861 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-19 22:39:59,719 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.331e+03 2.123e+03 2.518e+03 2.920e+03 6.180e+03, threshold=5.035e+03, percent-clipped=1.0 2024-06-19 22:40:00,332 INFO [train.py:1028] (0/2) Epoch 5, batch 1250, loss[loss=0.3651, simple_loss=0.3634, pruned_loss=0.1834, over 13166.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.3647, pruned_loss=0.1829, over 2584108.89 frames. ], batch size: 112, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:40:01,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=76483.0, ans=0.125 2024-06-19 22:40:17,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-19 22:40:17,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76519.66666666667, ans=0.1 2024-06-19 22:40:28,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.97 vs. limit=15.0 2024-06-19 22:40:31,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.38 vs. limit=22.5 2024-06-19 22:40:34,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=76556.33333333333, ans=0.125 2024-06-19 22:40:35,532 INFO [train.py:1028] (0/2) Epoch 5, batch 1300, loss[loss=0.3851, simple_loss=0.3684, pruned_loss=0.2009, over 12721.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.3649, pruned_loss=0.1828, over 2584661.58 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:40:38,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=76574.66666666667, ans=0.025 2024-06-19 22:40:39,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76574.66666666667, ans=0.0 2024-06-19 22:40:49,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=76611.33333333333, ans=0.05 2024-06-19 22:40:51,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=76611.33333333333, ans=0.2 2024-06-19 22:41:01,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.76 vs. limit=22.5 2024-06-19 22:41:02,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=76648.0, ans=0.04949747468305833 2024-06-19 22:41:07,670 INFO [train.py:1028] (0/2) Epoch 5, batch 1350, loss[loss=0.3518, simple_loss=0.361, pruned_loss=0.1713, over 13255.00 frames. ], tot_loss[loss=0.364, simple_loss=0.3643, pruned_loss=0.1819, over 2587974.04 frames. ], batch size: 59, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:41:07,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76666.33333333333, ans=0.125 2024-06-19 22:41:08,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.26 vs. limit=15.0 2024-06-19 22:41:08,304 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.290e+03 2.207e+03 2.591e+03 2.953e+03 4.688e+03, threshold=5.183e+03, percent-clipped=0.0 2024-06-19 22:41:11,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76666.33333333333, ans=0.125 2024-06-19 22:41:14,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.91 vs. limit=15.0 2024-06-19 22:41:23,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=76703.0, ans=0.125 2024-06-19 22:41:24,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76703.0, ans=0.1 2024-06-19 22:41:32,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.97 vs. limit=6.0 2024-06-19 22:41:32,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=76721.33333333333, ans=0.2 2024-06-19 22:41:34,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=76721.33333333333, ans=0.0 2024-06-19 22:41:34,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76721.33333333333, ans=0.125 2024-06-19 22:41:35,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76721.33333333333, ans=0.125 2024-06-19 22:41:38,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=76739.66666666667, ans=0.125 2024-06-19 22:41:38,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.08 vs. limit=15.0 2024-06-19 22:41:43,418 INFO [train.py:1028] (0/2) Epoch 5, batch 1400, loss[loss=0.4381, simple_loss=0.4253, pruned_loss=0.2255, over 12348.00 frames. ], tot_loss[loss=0.3656, simple_loss=0.3654, pruned_loss=0.1829, over 2588525.17 frames. ], batch size: 25, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:41:55,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=15.0 2024-06-19 22:41:56,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76794.66666666667, ans=0.125 2024-06-19 22:42:12,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.16 vs. limit=6.0 2024-06-19 22:42:12,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76831.33333333333, ans=0.1 2024-06-19 22:42:17,925 INFO [train.py:1028] (0/2) Epoch 5, batch 1450, loss[loss=0.3362, simple_loss=0.334, pruned_loss=0.1692, over 13155.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.3662, pruned_loss=0.1843, over 2588669.63 frames. ], batch size: 121, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:42:19,181 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.228e+03 2.164e+03 2.554e+03 2.866e+03 7.464e+03, threshold=5.107e+03, percent-clipped=1.0 2024-06-19 22:42:23,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76849.66666666667, ans=0.125 2024-06-19 22:42:26,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=76868.0, ans=0.2 2024-06-19 22:42:28,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=76868.0, ans=0.07 2024-06-19 22:42:29,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76868.0, ans=0.125 2024-06-19 22:42:34,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.29 vs. limit=22.5 2024-06-19 22:42:40,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=76904.66666666667, ans=0.025 2024-06-19 22:42:42,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=76904.66666666667, ans=0.05 2024-06-19 22:42:44,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=76923.0, ans=0.0 2024-06-19 22:42:44,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=76923.0, ans=0.125 2024-06-19 22:42:49,961 INFO [train.py:1028] (0/2) Epoch 5, batch 1500, loss[loss=0.3286, simple_loss=0.3441, pruned_loss=0.1566, over 13166.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.3664, pruned_loss=0.1844, over 2590538.53 frames. ], batch size: 83, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:42:57,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=76959.66666666667, ans=0.95 2024-06-19 22:42:59,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=76959.66666666667, ans=0.125 2024-06-19 22:42:59,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=76959.66666666667, ans=0.0 2024-06-19 22:43:05,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.55 vs. limit=10.0 2024-06-19 22:43:17,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=77014.66666666667, ans=0.025 2024-06-19 22:43:25,212 INFO [train.py:1028] (0/2) Epoch 5, batch 1550, loss[loss=0.4026, simple_loss=0.3854, pruned_loss=0.2099, over 13007.00 frames. ], tot_loss[loss=0.3677, simple_loss=0.3663, pruned_loss=0.1845, over 2586170.97 frames. ], batch size: 102, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:43:27,138 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+03 2.711e+03 3.073e+03 3.675e+03 7.236e+03, threshold=6.147e+03, percent-clipped=5.0 2024-06-19 22:43:29,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=77033.0, ans=0.025 2024-06-19 22:43:41,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=77069.66666666667, ans=0.5 2024-06-19 22:43:42,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77069.66666666667, ans=0.1 2024-06-19 22:43:48,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.25 vs. limit=15.0 2024-06-19 22:44:00,601 INFO [train.py:1028] (0/2) Epoch 5, batch 1600, loss[loss=0.3734, simple_loss=0.3775, pruned_loss=0.1846, over 13199.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.3659, pruned_loss=0.1845, over 2582265.52 frames. ], batch size: 77, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:44:15,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=77161.33333333333, ans=0.125 2024-06-19 22:44:17,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=77161.33333333333, ans=0.125 2024-06-19 22:44:25,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=77198.0, ans=0.2 2024-06-19 22:44:27,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=77198.0, ans=0.07 2024-06-19 22:44:30,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=77198.0, ans=0.5 2024-06-19 22:44:31,881 INFO [train.py:1028] (0/2) Epoch 5, batch 1650, loss[loss=0.3711, simple_loss=0.3612, pruned_loss=0.1905, over 13184.00 frames. ], tot_loss[loss=0.37, simple_loss=0.3672, pruned_loss=0.1863, over 2577757.84 frames. ], batch size: 95, lr: 1.19e-02, grad_scale: 0.5 2024-06-19 22:44:32,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=77216.33333333333, ans=0.125 2024-06-19 22:44:35,109 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+03 3.212e+03 3.825e+03 4.431e+03 9.177e+03, threshold=7.649e+03, percent-clipped=3.0 2024-06-19 22:44:38,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=77234.66666666667, ans=0.035 2024-06-19 22:44:44,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=77253.0, ans=0.125 2024-06-19 22:44:47,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=77253.0, ans=0.125 2024-06-19 22:44:49,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=77253.0, ans=0.0 2024-06-19 22:44:49,608 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.341e+01 2024-06-19 22:44:52,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=77271.33333333333, ans=0.0 2024-06-19 22:45:04,845 INFO [train.py:1028] (0/2) Epoch 5, batch 1700, loss[loss=0.3602, simple_loss=0.3715, pruned_loss=0.1745, over 12278.00 frames. ], tot_loss[loss=0.371, simple_loss=0.3684, pruned_loss=0.1869, over 2581575.62 frames. ], batch size: 25, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:45:11,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2024-06-19 22:45:12,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77326.33333333333, ans=0.125 2024-06-19 22:45:13,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=77326.33333333333, ans=0.09899494936611666 2024-06-19 22:45:18,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=77344.66666666667, ans=0.125 2024-06-19 22:45:18,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77344.66666666667, ans=0.1 2024-06-19 22:45:19,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=77344.66666666667, ans=0.125 2024-06-19 22:45:35,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=8.0 2024-06-19 22:45:40,224 INFO [train.py:1028] (0/2) Epoch 5, batch 1750, loss[loss=0.382, simple_loss=0.3876, pruned_loss=0.1882, over 12448.00 frames. ], tot_loss[loss=0.372, simple_loss=0.369, pruned_loss=0.1875, over 2581484.51 frames. ], batch size: 22, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:45:43,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.419e+02 1.793e+03 2.253e+03 2.706e+03 4.326e+03, threshold=4.507e+03, percent-clipped=0.0 2024-06-19 22:45:43,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=77399.66666666667, ans=0.025 2024-06-19 22:45:56,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=77436.33333333333, ans=0.125 2024-06-19 22:46:10,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=77473.0, ans=0.0 2024-06-19 22:46:11,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=77473.0, ans=0.125 2024-06-19 22:46:15,417 INFO [train.py:1028] (0/2) Epoch 5, batch 1800, loss[loss=0.3769, simple_loss=0.3798, pruned_loss=0.187, over 13187.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.3677, pruned_loss=0.1863, over 2582268.89 frames. ], batch size: 67, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:46:23,448 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.041e-03 2024-06-19 22:46:27,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=77509.66666666667, ans=0.2 2024-06-19 22:46:29,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=77528.0, ans=0.0 2024-06-19 22:46:43,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=77564.66666666667, ans=0.0 2024-06-19 22:46:44,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77564.66666666667, ans=0.1 2024-06-19 22:46:44,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=77564.66666666667, ans=0.0 2024-06-19 22:46:47,395 INFO [train.py:1028] (0/2) Epoch 5, batch 1850, loss[loss=0.4071, simple_loss=0.3944, pruned_loss=0.2099, over 13170.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.3671, pruned_loss=0.1857, over 2583900.54 frames. ], batch size: 83, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:46:50,635 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.400e+02 1.530e+03 1.810e+03 2.144e+03 3.543e+03, threshold=3.620e+03, percent-clipped=0.0 2024-06-19 22:47:03,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=77619.66666666667, ans=0.125 2024-06-19 22:47:06,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2024-06-19 22:47:06,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2024-06-19 22:47:19,031 INFO [train.py:1028] (0/2) Epoch 5, batch 1900, loss[loss=0.2974, simple_loss=0.3148, pruned_loss=0.14, over 13137.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.3664, pruned_loss=0.185, over 2585614.33 frames. ], batch size: 95, lr: 1.19e-02, grad_scale: 2.0 2024-06-19 22:47:20,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=77674.66666666667, ans=0.125 2024-06-19 22:47:33,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=77693.0, ans=0.125 2024-06-19 22:47:36,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=77711.33333333333, ans=0.125 2024-06-19 22:47:49,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=77729.66666666667, ans=0.125 2024-06-19 22:47:51,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=77748.0, ans=10.0 2024-06-19 22:47:57,885 INFO [train.py:1028] (0/2) Epoch 5, batch 1950, loss[loss=0.3699, simple_loss=0.3724, pruned_loss=0.1837, over 13284.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3647, pruned_loss=0.1835, over 2591312.17 frames. ], batch size: 52, lr: 1.19e-02, grad_scale: 1.0 2024-06-19 22:48:01,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=77766.33333333333, ans=0.2 2024-06-19 22:48:02,250 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.187e+03 1.877e+03 2.140e+03 2.417e+03 3.562e+03, threshold=4.280e+03, percent-clipped=0.0 2024-06-19 22:48:07,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=77784.66666666667, ans=0.125 2024-06-19 22:48:07,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=77784.66666666667, ans=0.125 2024-06-19 22:48:08,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=77784.66666666667, ans=0.0 2024-06-19 22:48:19,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.55 vs. limit=22.5 2024-06-19 22:48:20,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=77821.33333333333, ans=0.09899494936611666 2024-06-19 22:48:24,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77839.66666666667, ans=0.125 2024-06-19 22:48:25,874 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 22:48:29,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77839.66666666667, ans=0.1 2024-06-19 22:48:30,207 INFO [train.py:1028] (0/2) Epoch 5, batch 2000, loss[loss=0.3406, simple_loss=0.3587, pruned_loss=0.1612, over 12529.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3648, pruned_loss=0.1834, over 2587585.96 frames. ], batch size: 22, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:48:32,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=77858.0, ans=0.125 2024-06-19 22:48:36,781 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=8.352e+00 2024-06-19 22:48:39,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77876.33333333333, ans=0.1 2024-06-19 22:48:43,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.05 vs. limit=10.0 2024-06-19 22:48:47,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.19 vs. limit=22.5 2024-06-19 22:49:02,594 INFO [train.py:1028] (0/2) Epoch 5, batch 2050, loss[loss=0.3464, simple_loss=0.3565, pruned_loss=0.1682, over 12863.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.3647, pruned_loss=0.1829, over 2581847.64 frames. ], batch size: 29, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:49:04,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=77949.66666666667, ans=0.2 2024-06-19 22:49:07,840 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.876e+02 1.508e+03 1.777e+03 2.111e+03 4.263e+03, threshold=3.554e+03, percent-clipped=0.0 2024-06-19 22:49:14,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=77968.0, ans=0.2 2024-06-19 22:49:15,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=77986.33333333333, ans=10.0 2024-06-19 22:49:18,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77986.33333333333, ans=0.125 2024-06-19 22:49:38,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=78041.33333333333, ans=0.1 2024-06-19 22:49:38,816 INFO [train.py:1028] (0/2) Epoch 5, batch 2100, loss[loss=0.3456, simple_loss=0.3577, pruned_loss=0.1668, over 13180.00 frames. ], tot_loss[loss=0.3638, simple_loss=0.3641, pruned_loss=0.1818, over 2584396.81 frames. ], batch size: 59, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:49:43,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=20.89 vs. limit=15.0 2024-06-19 22:49:44,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=78041.33333333333, ans=0.2 2024-06-19 22:49:46,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=78059.66666666667, ans=0.125 2024-06-19 22:49:51,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=78078.0, ans=0.025 2024-06-19 22:49:52,287 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=6.806e+00 2024-06-19 22:50:03,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.27 vs. limit=22.5 2024-06-19 22:50:06,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.19 vs. limit=10.0 2024-06-19 22:50:06,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=78114.66666666667, ans=0.0 2024-06-19 22:50:06,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=78114.66666666667, ans=0.125 2024-06-19 22:50:06,993 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=2.537e-03 2024-06-19 22:50:07,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=78114.66666666667, ans=0.07 2024-06-19 22:50:13,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=78114.66666666667, ans=0.0 2024-06-19 22:50:14,178 INFO [train.py:1028] (0/2) Epoch 5, batch 2150, loss[loss=0.3358, simple_loss=0.3526, pruned_loss=0.1595, over 13256.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.3635, pruned_loss=0.1805, over 2586633.40 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:50:17,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=78133.0, ans=0.0 2024-06-19 22:50:20,246 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.131e+03 1.605e+03 1.810e+03 2.282e+03 9.917e+03, threshold=3.620e+03, percent-clipped=4.0 2024-06-19 22:50:23,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78151.33333333333, ans=0.0 2024-06-19 22:50:24,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.12 vs. limit=22.5 2024-06-19 22:50:45,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=78206.33333333333, ans=0.0 2024-06-19 22:50:46,986 INFO [train.py:1028] (0/2) Epoch 5, batch 2200, loss[loss=0.3837, simple_loss=0.3734, pruned_loss=0.197, over 13259.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.364, pruned_loss=0.1811, over 2587620.95 frames. ], batch size: 83, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:50:47,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.02 vs. limit=6.0 2024-06-19 22:50:49,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78224.66666666667, ans=0.1 2024-06-19 22:50:51,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=78224.66666666667, ans=10.0 2024-06-19 22:50:59,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=15.0 2024-06-19 22:51:11,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=78279.66666666667, ans=0.0 2024-06-19 22:51:12,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=78298.0, ans=0.0 2024-06-19 22:51:17,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.31 vs. limit=15.0 2024-06-19 22:51:17,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-19 22:51:19,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=78316.33333333333, ans=0.2 2024-06-19 22:51:19,537 INFO [train.py:1028] (0/2) Epoch 5, batch 2250, loss[loss=0.3688, simple_loss=0.3688, pruned_loss=0.1844, over 13274.00 frames. ], tot_loss[loss=0.362, simple_loss=0.3634, pruned_loss=0.1803, over 2587353.86 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:51:25,204 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.003e+03 1.473e+03 1.702e+03 1.969e+03 3.738e+03, threshold=3.405e+03, percent-clipped=1.0 2024-06-19 22:51:29,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.61 vs. limit=15.0 2024-06-19 22:51:55,132 INFO [train.py:1028] (0/2) Epoch 5, batch 2300, loss[loss=0.3724, simple_loss=0.3728, pruned_loss=0.186, over 12835.00 frames. ], tot_loss[loss=0.36, simple_loss=0.3619, pruned_loss=0.179, over 2580730.30 frames. ], batch size: 33, lr: 1.18e-02, grad_scale: 4.0 2024-06-19 22:51:56,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.39 vs. limit=10.0 2024-06-19 22:52:19,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=78463.0, ans=0.125 2024-06-19 22:52:23,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=78463.0, ans=15.0 2024-06-19 22:52:30,722 INFO [train.py:1028] (0/2) Epoch 5, batch 2350, loss[loss=0.362, simple_loss=0.3669, pruned_loss=0.1786, over 13218.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.3621, pruned_loss=0.1791, over 2584922.84 frames. ], batch size: 67, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:52:37,270 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.573e+02 1.753e+03 1.990e+03 2.317e+03 4.083e+03, threshold=3.979e+03, percent-clipped=3.0 2024-06-19 22:52:38,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=78518.0, ans=0.0 2024-06-19 22:52:39,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=78518.0, ans=0.125 2024-06-19 22:52:42,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=78518.0, ans=0.125 2024-06-19 22:52:50,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=78554.66666666667, ans=0.125 2024-06-19 22:52:50,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=78554.66666666667, ans=0.05 2024-06-19 22:52:53,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=78554.66666666667, ans=0.125 2024-06-19 22:53:03,087 INFO [train.py:1028] (0/2) Epoch 5, batch 2400, loss[loss=0.3383, simple_loss=0.3473, pruned_loss=0.1646, over 13276.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.3611, pruned_loss=0.1789, over 2587459.31 frames. ], batch size: 46, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:53:28,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=78646.33333333333, ans=0.125 2024-06-19 22:53:33,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=78664.66666666667, ans=0.2 2024-06-19 22:53:35,575 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=6.607e+01 2024-06-19 22:53:38,654 INFO [train.py:1028] (0/2) Epoch 5, batch 2450, loss[loss=0.3274, simple_loss=0.3399, pruned_loss=0.1575, over 13259.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3602, pruned_loss=0.1791, over 2583671.60 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:53:42,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=15.0 2024-06-19 22:53:46,526 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.235e+03 1.910e+03 2.342e+03 2.731e+03 5.956e+03, threshold=4.683e+03, percent-clipped=2.0 2024-06-19 22:53:55,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=78719.66666666667, ans=0.04949747468305833 2024-06-19 22:53:58,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=78719.66666666667, ans=0.015 2024-06-19 22:54:00,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=78738.0, ans=0.2 2024-06-19 22:54:03,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78738.0, ans=0.1 2024-06-19 22:54:13,802 INFO [train.py:1028] (0/2) Epoch 5, batch 2500, loss[loss=0.3048, simple_loss=0.3138, pruned_loss=0.1479, over 13211.00 frames. ], tot_loss[loss=0.3577, simple_loss=0.3589, pruned_loss=0.1782, over 2586419.73 frames. ], batch size: 83, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:54:19,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=78774.66666666667, ans=0.0 2024-06-19 22:54:30,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78811.33333333333, ans=0.1 2024-06-19 22:54:37,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=78829.66666666667, ans=0.0 2024-06-19 22:54:43,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.48 vs. limit=15.0 2024-06-19 22:54:44,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=78848.0, ans=0.125 2024-06-19 22:54:46,455 INFO [train.py:1028] (0/2) Epoch 5, batch 2550, loss[loss=0.3389, simple_loss=0.3527, pruned_loss=0.1625, over 12681.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.3572, pruned_loss=0.1772, over 2585547.59 frames. ], batch size: 22, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:54:54,375 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.346e+03 1.883e+03 2.192e+03 2.712e+03 4.016e+03, threshold=4.384e+03, percent-clipped=0.0 2024-06-19 22:55:09,399 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=5.315e+01 2024-06-19 22:55:21,934 INFO [train.py:1028] (0/2) Epoch 5, batch 2600, loss[loss=0.3175, simple_loss=0.3321, pruned_loss=0.1514, over 13311.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3553, pruned_loss=0.1765, over 2585059.69 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 4.0 2024-06-19 22:55:22,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=15.0 2024-06-19 22:55:24,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=78958.0, ans=0.2 2024-06-19 22:55:24,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78958.0, ans=0.1 2024-06-19 22:55:25,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.55 vs. limit=22.5 2024-06-19 22:55:28,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=78976.33333333333, ans=0.125 2024-06-19 22:55:30,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=78976.33333333333, ans=0.025 2024-06-19 22:55:30,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=78976.33333333333, ans=0.025 2024-06-19 22:55:33,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=78976.33333333333, ans=0.0 2024-06-19 22:55:41,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=12.0 2024-06-19 22:55:57,564 INFO [train.py:1028] (0/2) Epoch 5, batch 2650, loss[loss=0.321, simple_loss=0.3167, pruned_loss=0.1626, over 13008.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3532, pruned_loss=0.1753, over 2585018.43 frames. ], batch size: 144, lr: 1.18e-02, grad_scale: 1.0 2024-06-19 22:56:06,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79068.0, ans=0.125 2024-06-19 22:56:06,410 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.025e+03 1.809e+03 2.066e+03 2.309e+03 3.584e+03, threshold=4.133e+03, percent-clipped=0.0 2024-06-19 22:56:14,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.65 vs. limit=10.0 2024-06-19 22:56:15,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=79086.33333333333, ans=0.125 2024-06-19 22:56:24,645 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.259e-02 2024-06-19 22:56:28,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=79123.0, ans=0.125 2024-06-19 22:56:29,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=79141.33333333333, ans=0.2 2024-06-19 22:56:29,608 INFO [train.py:1028] (0/2) Epoch 5, batch 2700, loss[loss=0.3382, simple_loss=0.3433, pruned_loss=0.1665, over 13205.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.3515, pruned_loss=0.1745, over 2582369.19 frames. ], batch size: 89, lr: 1.18e-02, grad_scale: 2.0 2024-06-19 22:56:31,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2024-06-19 22:56:31,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=79141.33333333333, ans=0.125 2024-06-19 22:56:32,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=79141.33333333333, ans=0.07 2024-06-19 22:56:39,222 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.11 vs. limit=22.5 2024-06-19 22:56:42,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=79178.0, ans=0.125 2024-06-19 22:56:46,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-19 22:56:46,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.34 vs. limit=22.5 2024-06-19 22:56:59,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79214.66666666667, ans=0.1 2024-06-19 22:57:02,267 INFO [train.py:1028] (0/2) Epoch 5, batch 2750, loss[loss=0.374, simple_loss=0.3633, pruned_loss=0.1923, over 13318.00 frames. ], tot_loss[loss=0.348, simple_loss=0.35, pruned_loss=0.173, over 2580330.11 frames. ], batch size: 43, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:57:15,298 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.076e+03 1.770e+03 2.077e+03 2.494e+03 5.562e+03, threshold=4.154e+03, percent-clipped=4.0 2024-06-19 22:57:23,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=8.0 2024-06-19 22:57:31,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=79288.0, ans=0.125 2024-06-19 22:57:32,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=79288.0, ans=0.125 2024-06-19 22:57:34,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79306.33333333333, ans=0.125 2024-06-19 22:57:34,766 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.973e+01 2024-06-19 22:57:41,169 INFO [train.py:1028] (0/2) Epoch 5, batch 2800, loss[loss=0.3876, simple_loss=0.3584, pruned_loss=0.2084, over 10780.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.35, pruned_loss=0.1735, over 2579159.78 frames. ], batch size: 303, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:57:42,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=79324.66666666667, ans=0.125 2024-06-19 22:57:43,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=79324.66666666667, ans=0.0 2024-06-19 22:57:47,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=79343.0, ans=0.125 2024-06-19 22:57:48,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=79343.0, ans=0.025 2024-06-19 22:57:54,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79361.33333333333, ans=0.1 2024-06-19 22:58:02,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79379.66666666667, ans=0.1 2024-06-19 22:58:07,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=79398.0, ans=0.05 2024-06-19 22:58:13,154 INFO [train.py:1028] (0/2) Epoch 5, batch 2850, loss[loss=0.3685, simple_loss=0.3768, pruned_loss=0.1801, over 12987.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3497, pruned_loss=0.1738, over 2577527.72 frames. ], batch size: 48, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:58:22,989 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+03 2.131e+03 2.550e+03 2.976e+03 4.403e+03, threshold=5.100e+03, percent-clipped=1.0 2024-06-19 22:58:27,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79453.0, ans=0.125 2024-06-19 22:58:44,411 INFO [train.py:1028] (0/2) Epoch 5, batch 2900, loss[loss=0.3258, simple_loss=0.3298, pruned_loss=0.1609, over 13152.00 frames. ], tot_loss[loss=0.3467, simple_loss=0.3476, pruned_loss=0.1729, over 2585999.63 frames. ], batch size: 55, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:58:45,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=15.0 2024-06-19 22:58:48,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79508.0, ans=0.1 2024-06-19 22:59:00,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=79544.66666666667, ans=10.0 2024-06-19 22:59:01,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=79544.66666666667, ans=0.0 2024-06-19 22:59:02,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.20 vs. limit=15.0 2024-06-19 22:59:02,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=22.5 2024-06-19 22:59:05,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=79544.66666666667, ans=0.025 2024-06-19 22:59:06,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=79544.66666666667, ans=0.0 2024-06-19 22:59:13,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=79563.0, ans=0.125 2024-06-19 22:59:15,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=79581.33333333333, ans=0.125 2024-06-19 22:59:24,546 INFO [train.py:1028] (0/2) Epoch 5, batch 2950, loss[loss=0.3211, simple_loss=0.3293, pruned_loss=0.1564, over 13263.00 frames. ], tot_loss[loss=0.3458, simple_loss=0.3469, pruned_loss=0.1723, over 2579563.50 frames. ], batch size: 43, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 22:59:25,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=79599.66666666667, ans=0.125 2024-06-19 22:59:31,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79618.0, ans=0.125 2024-06-19 22:59:31,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.44 vs. limit=22.5 2024-06-19 22:59:36,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.60 vs. limit=10.0 2024-06-19 22:59:36,315 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.181e+03 2.454e+03 2.906e+03 3.485e+03 8.729e+03, threshold=5.811e+03, percent-clipped=3.0 2024-06-19 22:59:41,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=15.0 2024-06-19 22:59:42,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=79636.33333333333, ans=0.2 2024-06-19 22:59:49,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=79654.66666666667, ans=0.125 2024-06-19 22:59:53,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=79673.0, ans=0.2 2024-06-19 22:59:53,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.75 vs. limit=5.0 2024-06-19 22:59:57,762 INFO [train.py:1028] (0/2) Epoch 5, batch 3000, loss[loss=0.3668, simple_loss=0.3669, pruned_loss=0.1834, over 13243.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3454, pruned_loss=0.1717, over 2578267.03 frames. ], batch size: 59, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 22:59:57,763 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 23:00:03,773 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.7400, 4.9612, 3.4136, 5.4952], device='cuda:0') 2024-06-19 23:00:05,487 INFO [train.py:1060] (0/2) Epoch 5, validation: loss=0.2538, simple_loss=0.3037, pruned_loss=0.1019, over 351949.00 frames. 2024-06-19 23:00:05,487 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 23:00:10,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79691.33333333333, ans=0.1 2024-06-19 23:00:15,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=79709.66666666667, ans=0.125 2024-06-19 23:00:20,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=79728.0, ans=0.0 2024-06-19 23:00:27,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=79746.33333333333, ans=0.125 2024-06-19 23:00:29,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2024-06-19 23:00:39,762 INFO [train.py:1028] (0/2) Epoch 5, batch 3050, loss[loss=0.3172, simple_loss=0.3369, pruned_loss=0.1487, over 13287.00 frames. ], tot_loss[loss=0.345, simple_loss=0.3453, pruned_loss=0.1724, over 2578189.54 frames. ], batch size: 46, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:00:45,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=79783.0, ans=0.125 2024-06-19 23:00:51,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+03 2.722e+03 3.308e+03 3.742e+03 5.908e+03, threshold=6.617e+03, percent-clipped=1.0 2024-06-19 23:01:02,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=79838.0, ans=0.0 2024-06-19 23:01:08,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79838.0, ans=0.125 2024-06-19 23:01:09,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.17 vs. limit=10.0 2024-06-19 23:01:10,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79856.33333333333, ans=0.125 2024-06-19 23:01:11,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=79856.33333333333, ans=0.0 2024-06-19 23:01:16,087 INFO [train.py:1028] (0/2) Epoch 5, batch 3100, loss[loss=0.3331, simple_loss=0.3305, pruned_loss=0.1678, over 13002.00 frames. ], tot_loss[loss=0.344, simple_loss=0.3444, pruned_loss=0.1718, over 2579610.45 frames. ], batch size: 144, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:01:42,295 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:01:43,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79929.66666666667, ans=0.125 2024-06-19 23:01:48,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=79948.0, ans=0.125 2024-06-19 23:01:50,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.74 vs. limit=6.0 2024-06-19 23:01:52,557 INFO [train.py:1028] (0/2) Epoch 5, batch 3150, loss[loss=0.3737, simple_loss=0.3597, pruned_loss=0.1939, over 12873.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.343, pruned_loss=0.1706, over 2581439.07 frames. ], batch size: 158, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:01:59,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79984.66666666667, ans=0.125 2024-06-19 23:02:03,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=79984.66666666667, ans=0.0 2024-06-19 23:02:04,594 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.371e+03 2.414e+03 2.944e+03 3.348e+03 4.771e+03, threshold=5.888e+03, percent-clipped=0.0 2024-06-19 23:02:13,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.28 vs. limit=15.0 2024-06-19 23:02:13,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=80021.33333333333, ans=0.0 2024-06-19 23:02:16,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=80021.33333333333, ans=0.125 2024-06-19 23:02:21,092 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.006e+01 2024-06-19 23:02:23,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=80039.66666666667, ans=0.125 2024-06-19 23:02:26,188 INFO [train.py:1028] (0/2) Epoch 5, batch 3200, loss[loss=0.3237, simple_loss=0.3334, pruned_loss=0.157, over 13131.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3435, pruned_loss=0.171, over 2580817.19 frames. ], batch size: 55, lr: 1.17e-02, grad_scale: 4.0 2024-06-19 23:02:28,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=80058.0, ans=0.0 2024-06-19 23:02:28,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80058.0, ans=0.1 2024-06-19 23:02:34,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=80076.33333333333, ans=0.1 2024-06-19 23:02:41,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=80094.66666666667, ans=0.125 2024-06-19 23:02:41,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=80094.66666666667, ans=0.125 2024-06-19 23:02:53,819 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:03:02,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=80149.66666666667, ans=0.125 2024-06-19 23:03:02,745 INFO [train.py:1028] (0/2) Epoch 5, batch 3250, loss[loss=0.3449, simple_loss=0.3457, pruned_loss=0.1721, over 13298.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3432, pruned_loss=0.1712, over 2585051.56 frames. ], batch size: 72, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:03:02,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=80149.66666666667, ans=0.125 2024-06-19 23:03:08,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=80149.66666666667, ans=0.125 2024-06-19 23:03:11,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=80168.0, ans=0.125 2024-06-19 23:03:16,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+03 2.523e+03 2.997e+03 3.669e+03 6.402e+03, threshold=5.993e+03, percent-clipped=1.0 2024-06-19 23:03:28,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=80204.66666666667, ans=0.0 2024-06-19 23:03:28,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80204.66666666667, ans=0.1 2024-06-19 23:03:30,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.78 vs. limit=15.0 2024-06-19 23:03:39,787 INFO [train.py:1028] (0/2) Epoch 5, batch 3300, loss[loss=0.3894, simple_loss=0.3712, pruned_loss=0.2038, over 12720.00 frames. ], tot_loss[loss=0.343, simple_loss=0.3433, pruned_loss=0.1713, over 2581983.33 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 2.0 2024-06-19 23:03:39,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80241.33333333333, ans=0.1 2024-06-19 23:03:42,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=80241.33333333333, ans=0.0 2024-06-19 23:03:59,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=80296.33333333333, ans=0.025 2024-06-19 23:04:02,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=80296.33333333333, ans=0.2 2024-06-19 23:04:12,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.87 vs. limit=15.0 2024-06-19 23:04:12,849 INFO [train.py:1028] (0/2) Epoch 5, batch 3350, loss[loss=0.3685, simple_loss=0.3552, pruned_loss=0.1909, over 12892.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.342, pruned_loss=0.1712, over 2577940.86 frames. ], batch size: 158, lr: 1.17e-02, grad_scale: 0.5 2024-06-19 23:04:18,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=80333.0, ans=0.2 2024-06-19 23:04:18,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=80333.0, ans=0.125 2024-06-19 23:04:27,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.950e+02 1.730e+03 2.047e+03 2.434e+03 4.918e+03, threshold=4.093e+03, percent-clipped=0.0 2024-06-19 23:04:29,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=80369.66666666667, ans=0.0 2024-06-19 23:04:31,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=80369.66666666667, ans=0.0 2024-06-19 23:04:51,752 INFO [train.py:1028] (0/2) Epoch 5, batch 3400, loss[loss=0.3569, simple_loss=0.3608, pruned_loss=0.1765, over 12670.00 frames. ], tot_loss[loss=0.3423, simple_loss=0.3418, pruned_loss=0.1714, over 2575346.50 frames. ], batch size: 22, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 23:05:00,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=80443.0, ans=0.125 2024-06-19 23:05:01,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=80443.0, ans=0.0 2024-06-19 23:05:04,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=80461.33333333333, ans=0.125 2024-06-19 23:05:05,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.94 vs. limit=10.0 2024-06-19 23:05:07,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80461.33333333333, ans=0.125 2024-06-19 23:05:19,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-19 23:05:26,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2024-06-19 23:05:28,960 INFO [train.py:1028] (0/2) Epoch 5, batch 3450, loss[loss=0.3437, simple_loss=0.3316, pruned_loss=0.1779, over 12745.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3402, pruned_loss=0.17, over 2576311.67 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 1.0 2024-06-19 23:05:29,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80516.33333333333, ans=0.125 2024-06-19 23:05:29,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=80516.33333333333, ans=0.125 2024-06-19 23:05:32,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=12.0 2024-06-19 23:05:35,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=80534.66666666667, ans=0.125 2024-06-19 23:05:37,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=80534.66666666667, ans=0.0 2024-06-19 23:05:42,861 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.572e+02 1.500e+03 1.792e+03 2.258e+03 5.058e+03, threshold=3.585e+03, percent-clipped=2.0 2024-06-19 23:05:47,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=80571.33333333333, ans=0.1 2024-06-19 23:05:53,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=80589.66666666667, ans=0.125 2024-06-19 23:05:54,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=80589.66666666667, ans=0.125 2024-06-19 23:05:54,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=80589.66666666667, ans=0.125 2024-06-19 23:05:57,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=80589.66666666667, ans=0.125 2024-06-19 23:06:01,252 INFO [train.py:1028] (0/2) Epoch 5, batch 3500, loss[loss=0.3402, simple_loss=0.35, pruned_loss=0.1652, over 12924.00 frames. ], tot_loss[loss=0.3386, simple_loss=0.3394, pruned_loss=0.1689, over 2575667.14 frames. ], batch size: 33, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:06:07,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=80626.33333333333, ans=0.2 2024-06-19 23:06:13,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-19 23:06:16,968 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=15.0 2024-06-19 23:06:17,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=80644.66666666667, ans=0.125 2024-06-19 23:06:21,981 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-44000.pt 2024-06-19 23:06:32,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=80681.33333333333, ans=0.0 2024-06-19 23:06:34,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=80681.33333333333, ans=0.2 2024-06-19 23:06:36,499 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:06:39,639 INFO [train.py:1028] (0/2) Epoch 5, batch 3550, loss[loss=0.2935, simple_loss=0.3042, pruned_loss=0.1414, over 13114.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3379, pruned_loss=0.1675, over 2578230.25 frames. ], batch size: 95, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:06:53,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.280e+02 1.174e+03 1.356e+03 1.634e+03 4.681e+03, threshold=2.712e+03, percent-clipped=1.0 2024-06-19 23:06:57,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.86 vs. limit=22.5 2024-06-19 23:07:10,109 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.24 vs. limit=22.5 2024-06-19 23:07:13,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=80773.0, ans=0.2 2024-06-19 23:07:14,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=80773.0, ans=0.125 2024-06-19 23:07:15,729 INFO [train.py:1028] (0/2) Epoch 5, batch 3600, loss[loss=0.2977, simple_loss=0.3181, pruned_loss=0.1386, over 13098.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3372, pruned_loss=0.167, over 2580602.59 frames. ], batch size: 48, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:07:15,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=80791.33333333333, ans=0.2 2024-06-19 23:07:19,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=80791.33333333333, ans=0.125 2024-06-19 23:07:33,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80828.0, ans=0.125 2024-06-19 23:07:35,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=80828.0, ans=0.0 2024-06-19 23:07:35,555 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=12.0 2024-06-19 23:07:42,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=80846.33333333333, ans=0.0 2024-06-19 23:07:51,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.79 vs. limit=15.0 2024-06-19 23:07:52,182 INFO [train.py:1028] (0/2) Epoch 5, batch 3650, loss[loss=0.3189, simple_loss=0.3192, pruned_loss=0.1593, over 12994.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3367, pruned_loss=0.1663, over 2579602.15 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:07:58,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-19 23:08:07,135 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.756e+02 1.233e+03 1.484e+03 1.778e+03 3.637e+03, threshold=2.968e+03, percent-clipped=4.0 2024-06-19 23:08:11,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.27 vs. limit=15.0 2024-06-19 23:08:16,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=80938.0, ans=0.125 2024-06-19 23:08:21,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80956.33333333333, ans=0.1 2024-06-19 23:08:25,133 INFO [train.py:1028] (0/2) Epoch 5, batch 3700, loss[loss=0.3216, simple_loss=0.3404, pruned_loss=0.1514, over 13303.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.3351, pruned_loss=0.165, over 2583665.77 frames. ], batch size: 72, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:08:34,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=80993.0, ans=0.025 2024-06-19 23:08:35,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=80993.0, ans=0.125 2024-06-19 23:08:58,561 INFO [train.py:1028] (0/2) Epoch 5, batch 3750, loss[loss=0.3456, simple_loss=0.3544, pruned_loss=0.1683, over 12378.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3348, pruned_loss=0.1648, over 2585781.75 frames. ], batch size: 22, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:09:09,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=81084.66666666667, ans=0.025 2024-06-19 23:09:10,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=81084.66666666667, ans=0.0 2024-06-19 23:09:18,505 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.153e+03 2.043e+03 2.365e+03 2.732e+03 4.895e+03, threshold=4.730e+03, percent-clipped=19.0 2024-06-19 23:09:29,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=81139.66666666667, ans=0.125 2024-06-19 23:09:38,260 INFO [train.py:1028] (0/2) Epoch 5, batch 3800, loss[loss=0.3206, simple_loss=0.3266, pruned_loss=0.1573, over 13240.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3348, pruned_loss=0.1648, over 2584329.75 frames. ], batch size: 83, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:09:41,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=81158.0, ans=0.125 2024-06-19 23:09:46,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=81176.33333333333, ans=0.125 2024-06-19 23:09:50,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81176.33333333333, ans=0.1 2024-06-19 23:09:53,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.88 vs. limit=12.0 2024-06-19 23:10:03,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2024-06-19 23:10:11,402 INFO [train.py:1028] (0/2) Epoch 5, batch 3850, loss[loss=0.3347, simple_loss=0.3258, pruned_loss=0.1718, over 13031.00 frames. ], tot_loss[loss=0.331, simple_loss=0.334, pruned_loss=0.164, over 2584109.45 frames. ], batch size: 144, lr: 1.16e-02, grad_scale: 0.5 2024-06-19 23:10:20,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=81268.0, ans=0.125 2024-06-19 23:10:23,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=81268.0, ans=0.025 2024-06-19 23:10:25,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=20.56 vs. limit=15.0 2024-06-19 23:10:29,092 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.493e+03 2.337e+03 2.713e+03 3.169e+03 6.664e+03, threshold=5.425e+03, percent-clipped=2.0 2024-06-19 23:10:33,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=81304.66666666667, ans=10.0 2024-06-19 23:10:43,590 INFO [train.py:1028] (0/2) Epoch 5, batch 3900, loss[loss=0.3041, simple_loss=0.3075, pruned_loss=0.1504, over 13177.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3336, pruned_loss=0.1642, over 2587490.15 frames. ], batch size: 83, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:10:45,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=81341.33333333333, ans=0.125 2024-06-19 23:10:50,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=81359.66666666667, ans=0.125 2024-06-19 23:11:07,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=81396.33333333333, ans=0.0 2024-06-19 23:11:08,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=81396.33333333333, ans=0.125 2024-06-19 23:11:09,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.60 vs. limit=22.5 2024-06-19 23:11:09,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2024-06-19 23:11:10,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.43 vs. limit=22.5 2024-06-19 23:11:16,356 INFO [train.py:1028] (0/2) Epoch 5, batch 3950, loss[loss=0.3117, simple_loss=0.315, pruned_loss=0.1542, over 13076.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3318, pruned_loss=0.1624, over 2589434.12 frames. ], batch size: 132, lr: 1.16e-02, grad_scale: 1.0 2024-06-19 23:11:22,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=81433.0, ans=0.025 2024-06-19 23:11:23,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=81451.33333333333, ans=0.0 2024-06-19 23:11:34,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=81469.66666666667, ans=0.0 2024-06-19 23:11:38,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=81469.66666666667, ans=0.125 2024-06-19 23:11:40,266 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.221e+03 1.594e+03 1.880e+03 2.309e+03 4.851e+03, threshold=3.760e+03, percent-clipped=0.0 2024-06-19 23:11:44,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.72 vs. limit=10.0 2024-06-19 23:11:45,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81488.0, ans=0.1 2024-06-19 23:11:45,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=81488.0, ans=0.125 2024-06-19 23:11:45,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81488.0, ans=0.1 2024-06-19 23:11:48,884 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=15.0 2024-06-19 23:11:55,500 INFO [train.py:1028] (0/2) Epoch 5, batch 4000, loss[loss=0.342, simple_loss=0.3446, pruned_loss=0.1698, over 12955.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3311, pruned_loss=0.1621, over 2583336.35 frames. ], batch size: 39, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:11:56,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.56 vs. limit=22.5 2024-06-19 23:12:03,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.90 vs. limit=15.0 2024-06-19 23:12:14,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=81561.33333333333, ans=0.0 2024-06-19 23:12:15,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=81579.66666666667, ans=0.015 2024-06-19 23:12:17,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=81579.66666666667, ans=0.0 2024-06-19 23:12:28,479 INFO [train.py:1028] (0/2) Epoch 5, batch 4050, loss[loss=0.3772, simple_loss=0.3459, pruned_loss=0.2043, over 11044.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3305, pruned_loss=0.1618, over 2581698.22 frames. ], batch size: 304, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:12:37,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=12.0 2024-06-19 23:12:39,540 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.520e-01 2024-06-19 23:12:45,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.157e+03 1.932e+03 2.305e+03 2.608e+03 6.099e+03, threshold=4.610e+03, percent-clipped=3.0 2024-06-19 23:12:55,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=81689.66666666667, ans=0.025 2024-06-19 23:12:56,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=81689.66666666667, ans=0.025 2024-06-19 23:13:01,025 INFO [train.py:1028] (0/2) Epoch 5, batch 4100, loss[loss=0.3635, simple_loss=0.3509, pruned_loss=0.1881, over 13020.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3314, pruned_loss=0.1629, over 2578187.87 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 4.0 2024-06-19 23:13:22,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=81744.66666666667, ans=0.0 2024-06-19 23:13:24,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=81763.0, ans=0.125 2024-06-19 23:13:27,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=81763.0, ans=15.0 2024-06-19 23:13:31,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=81781.33333333333, ans=0.125 2024-06-19 23:13:34,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=81781.33333333333, ans=0.025 2024-06-19 23:13:35,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=81781.33333333333, ans=0.125 2024-06-19 23:13:36,111 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:13:41,111 INFO [train.py:1028] (0/2) Epoch 5, batch 4150, loss[loss=0.3299, simple_loss=0.3346, pruned_loss=0.1626, over 13181.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3311, pruned_loss=0.1626, over 2576975.32 frames. ], batch size: 55, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:13:42,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=81799.66666666667, ans=0.125 2024-06-19 23:13:42,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81799.66666666667, ans=0.125 2024-06-19 23:13:44,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=81799.66666666667, ans=0.125 2024-06-19 23:13:47,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81818.0, ans=0.0 2024-06-19 23:13:48,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.25 vs. limit=15.0 2024-06-19 23:13:53,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81836.33333333333, ans=0.0 2024-06-19 23:13:59,581 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.352e+02 1.498e+03 1.854e+03 2.110e+03 3.690e+03, threshold=3.707e+03, percent-clipped=0.0 2024-06-19 23:14:00,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=81854.66666666667, ans=0.0 2024-06-19 23:14:05,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=81854.66666666667, ans=0.025 2024-06-19 23:14:07,158 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.15 vs. limit=15.0 2024-06-19 23:14:14,109 INFO [train.py:1028] (0/2) Epoch 5, batch 4200, loss[loss=0.3121, simple_loss=0.3197, pruned_loss=0.1522, over 13050.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3303, pruned_loss=0.162, over 2580186.15 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 4.0 2024-06-19 23:14:25,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.27 vs. limit=15.0 2024-06-19 23:14:35,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=21.27 vs. limit=15.0 2024-06-19 23:14:36,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=15.0 2024-06-19 23:14:46,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=81964.66666666667, ans=0.125 2024-06-19 23:14:47,671 INFO [train.py:1028] (0/2) Epoch 5, batch 4250, loss[loss=0.3198, simple_loss=0.3319, pruned_loss=0.1539, over 13236.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3298, pruned_loss=0.1613, over 2582096.99 frames. ], batch size: 46, lr: 1.16e-02, grad_scale: 2.0 2024-06-19 23:14:53,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2024-06-19 23:14:58,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2024-06-19 23:15:07,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.140e+03 1.770e+03 1.968e+03 2.259e+03 4.961e+03, threshold=3.936e+03, percent-clipped=1.0 2024-06-19 23:15:10,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.41 vs. limit=10.0 2024-06-19 23:15:20,172 INFO [train.py:1028] (0/2) Epoch 5, batch 4300, loss[loss=0.3414, simple_loss=0.3491, pruned_loss=0.1669, over 13194.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3298, pruned_loss=0.1611, over 2583033.34 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:15:20,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82074.66666666667, ans=0.1 2024-06-19 23:15:27,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=82074.66666666667, ans=0.125 2024-06-19 23:15:32,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=82093.0, ans=0.125 2024-06-19 23:15:37,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82111.33333333333, ans=0.1 2024-06-19 23:15:38,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=82111.33333333333, ans=0.0 2024-06-19 23:15:47,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=82129.66666666667, ans=0.2 2024-06-19 23:15:47,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=82129.66666666667, ans=0.2 2024-06-19 23:15:54,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=82148.0, ans=0.125 2024-06-19 23:15:54,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2024-06-19 23:15:56,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2024-06-19 23:15:59,103 INFO [train.py:1028] (0/2) Epoch 5, batch 4350, loss[loss=0.3194, simple_loss=0.3325, pruned_loss=0.1531, over 13189.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.3284, pruned_loss=0.1601, over 2587484.91 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:15:59,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=82166.33333333333, ans=0.125 2024-06-19 23:16:02,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=82166.33333333333, ans=0.125 2024-06-19 23:16:07,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-19 23:16:17,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82203.0, ans=0.125 2024-06-19 23:16:18,867 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.348e+02 1.875e+03 2.161e+03 2.490e+03 4.207e+03, threshold=4.321e+03, percent-clipped=2.0 2024-06-19 23:16:19,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82221.33333333333, ans=0.0 2024-06-19 23:16:24,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=82239.66666666667, ans=0.2 2024-06-19 23:16:29,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82239.66666666667, ans=0.0 2024-06-19 23:16:32,142 INFO [train.py:1028] (0/2) Epoch 5, batch 4400, loss[loss=0.3435, simple_loss=0.344, pruned_loss=0.1715, over 13233.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3278, pruned_loss=0.1596, over 2587647.95 frames. ], batch size: 83, lr: 1.15e-02, grad_scale: 4.0 2024-06-19 23:16:33,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.60 vs. limit=10.0 2024-06-19 23:16:36,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=82258.0, ans=0.0 2024-06-19 23:16:47,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82294.66666666667, ans=0.0 2024-06-19 23:16:50,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.51 vs. limit=22.5 2024-06-19 23:16:54,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.93 vs. limit=12.0 2024-06-19 23:16:58,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=82331.33333333333, ans=0.125 2024-06-19 23:16:59,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=82331.33333333333, ans=0.0 2024-06-19 23:17:03,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=82331.33333333333, ans=0.025 2024-06-19 23:17:05,057 INFO [train.py:1028] (0/2) Epoch 5, batch 4450, loss[loss=0.348, simple_loss=0.3473, pruned_loss=0.1744, over 12979.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3278, pruned_loss=0.1599, over 2582355.10 frames. ], batch size: 33, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:17:16,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.90 vs. limit=15.0 2024-06-19 23:17:20,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.90 vs. limit=15.0 2024-06-19 23:17:21,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.03 vs. limit=15.0 2024-06-19 23:17:25,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=82386.33333333333, ans=0.5 2024-06-19 23:17:28,456 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.476e+02 1.365e+03 1.583e+03 1.911e+03 3.208e+03, threshold=3.166e+03, percent-clipped=0.0 2024-06-19 23:17:43,284 INFO [train.py:1028] (0/2) Epoch 5, batch 4500, loss[loss=0.3091, simple_loss=0.3102, pruned_loss=0.154, over 13226.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3266, pruned_loss=0.159, over 2586620.13 frames. ], batch size: 89, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:17:54,363 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.582e+01 2024-06-19 23:17:54,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.82 vs. limit=10.0 2024-06-19 23:18:01,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=82478.0, ans=0.0 2024-06-19 23:18:16,737 INFO [train.py:1028] (0/2) Epoch 5, batch 4550, loss[loss=0.3064, simple_loss=0.3216, pruned_loss=0.1456, over 13237.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3271, pruned_loss=0.1594, over 2588946.94 frames. ], batch size: 52, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:18:18,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=82533.0, ans=0.125 2024-06-19 23:18:24,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=82551.33333333333, ans=0.05 2024-06-19 23:18:31,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.18 vs. limit=22.5 2024-06-19 23:18:39,451 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.159e+03 1.604e+03 1.872e+03 2.106e+03 4.111e+03, threshold=3.744e+03, percent-clipped=3.0 2024-06-19 23:18:40,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=82588.0, ans=0.125 2024-06-19 23:18:42,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=82606.33333333333, ans=0.0 2024-06-19 23:18:50,229 INFO [train.py:1028] (0/2) Epoch 5, batch 4600, loss[loss=0.3766, simple_loss=0.3598, pruned_loss=0.1967, over 12531.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3279, pruned_loss=0.1599, over 2585457.00 frames. ], batch size: 202, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:18:57,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=82643.0, ans=0.125 2024-06-19 23:18:58,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82643.0, ans=0.1 2024-06-19 23:18:58,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=15.0 2024-06-19 23:19:09,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=82679.66666666667, ans=0.0 2024-06-19 23:19:11,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=82679.66666666667, ans=0.0 2024-06-19 23:19:11,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82679.66666666667, ans=0.1 2024-06-19 23:19:17,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=82698.0, ans=0.2 2024-06-19 23:19:25,564 INFO [train.py:1028] (0/2) Epoch 5, batch 4650, loss[loss=0.3016, simple_loss=0.3055, pruned_loss=0.1488, over 13089.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3261, pruned_loss=0.1586, over 2588215.06 frames. ], batch size: 132, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:19:30,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=82716.33333333333, ans=0.09899494936611666 2024-06-19 23:19:34,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=82734.66666666667, ans=6.0 2024-06-19 23:19:45,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=82753.0, ans=0.125 2024-06-19 23:19:50,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=82771.33333333333, ans=15.0 2024-06-19 23:19:53,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.035e+03 1.408e+03 1.621e+03 2.006e+03 4.457e+03, threshold=3.241e+03, percent-clipped=1.0 2024-06-19 23:19:54,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=82771.33333333333, ans=22.5 2024-06-19 23:19:55,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=15.0 2024-06-19 23:19:57,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82789.66666666667, ans=0.125 2024-06-19 23:20:01,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=82789.66666666667, ans=0.0 2024-06-19 23:20:04,050 INFO [train.py:1028] (0/2) Epoch 5, batch 4700, loss[loss=0.3209, simple_loss=0.3341, pruned_loss=0.1538, over 12410.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.327, pruned_loss=0.1592, over 2583970.92 frames. ], batch size: 25, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:20:08,878 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:20:14,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.18 vs. limit=22.5 2024-06-19 23:20:30,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=82881.33333333333, ans=0.025 2024-06-19 23:20:30,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82881.33333333333, ans=0.125 2024-06-19 23:20:31,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=82881.33333333333, ans=0.07 2024-06-19 23:20:33,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2024-06-19 23:20:34,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=82881.33333333333, ans=0.0 2024-06-19 23:20:37,436 INFO [train.py:1028] (0/2) Epoch 5, batch 4750, loss[loss=0.3407, simple_loss=0.3338, pruned_loss=0.1738, over 12583.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3256, pruned_loss=0.1587, over 2581171.21 frames. ], batch size: 202, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:20:39,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=82899.66666666667, ans=0.125 2024-06-19 23:20:45,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=82918.0, ans=0.0 2024-06-19 23:20:51,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82936.33333333333, ans=0.125 2024-06-19 23:20:52,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=82936.33333333333, ans=0.125 2024-06-19 23:20:52,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=82936.33333333333, ans=0.0 2024-06-19 23:21:00,415 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.091e+02 1.211e+03 1.524e+03 1.744e+03 5.539e+03, threshold=3.047e+03, percent-clipped=1.0 2024-06-19 23:21:10,668 INFO [train.py:1028] (0/2) Epoch 5, batch 4800, loss[loss=0.3054, simple_loss=0.3101, pruned_loss=0.1504, over 13266.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3253, pruned_loss=0.158, over 2576772.00 frames. ], batch size: 63, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:21:22,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.22 vs. limit=6.0 2024-06-19 23:21:37,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-19 23:21:45,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2024-06-19 23:21:48,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=15.0 2024-06-19 23:21:49,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=23.27 vs. limit=22.5 2024-06-19 23:21:49,687 INFO [train.py:1028] (0/2) Epoch 5, batch 4850, loss[loss=0.3017, simple_loss=0.3058, pruned_loss=0.1488, over 13204.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3249, pruned_loss=0.1575, over 2574528.96 frames. ], batch size: 89, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:21:54,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=83083.0, ans=0.0 2024-06-19 23:21:55,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.95 vs. limit=5.0 2024-06-19 23:22:02,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83119.66666666667, ans=0.1 2024-06-19 23:22:09,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=83138.0, ans=0.125 2024-06-19 23:22:09,831 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:22:13,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.833e+02 1.308e+03 1.488e+03 1.873e+03 4.798e+03, threshold=2.977e+03, percent-clipped=7.0 2024-06-19 23:22:23,160 INFO [train.py:1028] (0/2) Epoch 5, batch 4900, loss[loss=0.2752, simple_loss=0.2972, pruned_loss=0.1266, over 13255.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3258, pruned_loss=0.1582, over 2576226.51 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:22:29,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.74 vs. limit=15.0 2024-06-19 23:22:31,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2024-06-19 23:22:38,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=12.0 2024-06-19 23:22:38,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=83211.33333333333, ans=0.07 2024-06-19 23:22:44,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=83229.66666666667, ans=0.125 2024-06-19 23:22:48,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=83229.66666666667, ans=0.0 2024-06-19 23:22:50,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=83248.0, ans=0.125 2024-06-19 23:22:54,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=83248.0, ans=0.0 2024-06-19 23:22:55,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=83266.33333333333, ans=0.125 2024-06-19 23:22:56,141 INFO [train.py:1028] (0/2) Epoch 5, batch 4950, loss[loss=0.3641, simple_loss=0.3424, pruned_loss=0.1929, over 10976.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.325, pruned_loss=0.1584, over 2569449.10 frames. ], batch size: 303, lr: 1.15e-02, grad_scale: 1.0 2024-06-19 23:23:05,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83284.66666666667, ans=0.1 2024-06-19 23:23:08,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=83303.0, ans=0.125 2024-06-19 23:23:10,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=83303.0, ans=0.125 2024-06-19 23:23:23,600 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.403e+02 1.084e+03 1.336e+03 1.537e+03 4.727e+03, threshold=2.672e+03, percent-clipped=2.0 2024-06-19 23:23:31,997 INFO [train.py:1028] (0/2) Epoch 5, batch 5000, loss[loss=0.3117, simple_loss=0.3133, pruned_loss=0.1551, over 13166.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.324, pruned_loss=0.1569, over 2573642.97 frames. ], batch size: 95, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:23:33,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=83358.0, ans=0.125 2024-06-19 23:23:41,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=83376.33333333333, ans=0.125 2024-06-19 23:23:52,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=83394.66666666667, ans=0.0 2024-06-19 23:23:54,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=83394.66666666667, ans=0.025 2024-06-19 23:24:00,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.46 vs. limit=15.0 2024-06-19 23:24:09,145 INFO [train.py:1028] (0/2) Epoch 5, batch 5050, loss[loss=0.3058, simple_loss=0.3251, pruned_loss=0.1433, over 12958.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3228, pruned_loss=0.1555, over 2572928.12 frames. ], batch size: 36, lr: 1.15e-02, grad_scale: 2.0 2024-06-19 23:24:16,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.57 vs. limit=15.0 2024-06-19 23:24:19,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=83468.0, ans=0.125 2024-06-19 23:24:27,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2024-06-19 23:24:31,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=83504.66666666667, ans=0.5 2024-06-19 23:24:32,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=83504.66666666667, ans=0.0 2024-06-19 23:24:33,158 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.560e+02 1.207e+03 1.422e+03 1.667e+03 3.146e+03, threshold=2.844e+03, percent-clipped=1.0 2024-06-19 23:24:40,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83523.0, ans=0.1 2024-06-19 23:24:42,037 INFO [train.py:1028] (0/2) Epoch 5, batch 5100, loss[loss=0.3157, simple_loss=0.3316, pruned_loss=0.1499, over 12928.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3232, pruned_loss=0.1561, over 2569926.62 frames. ], batch size: 39, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:24:42,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=83541.33333333333, ans=0.2 2024-06-19 23:24:49,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=83559.66666666667, ans=0.0 2024-06-19 23:24:59,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.29 vs. limit=22.5 2024-06-19 23:25:02,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=83596.33333333333, ans=0.125 2024-06-19 23:25:05,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=83596.33333333333, ans=0.125 2024-06-19 23:25:09,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-19 23:25:09,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=83614.66666666667, ans=0.125 2024-06-19 23:25:10,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=83614.66666666667, ans=0.05 2024-06-19 23:25:10,717 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=12.0 2024-06-19 23:25:12,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83614.66666666667, ans=0.1 2024-06-19 23:25:18,714 INFO [train.py:1028] (0/2) Epoch 5, batch 5150, loss[loss=0.3365, simple_loss=0.3326, pruned_loss=0.1702, over 13077.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3234, pruned_loss=0.157, over 2572306.20 frames. ], batch size: 132, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:25:22,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=83633.0, ans=0.0 2024-06-19 23:25:39,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83669.66666666667, ans=0.1 2024-06-19 23:25:39,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=83669.66666666667, ans=0.125 2024-06-19 23:25:45,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=83688.0, ans=0.0 2024-06-19 23:25:47,459 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.751e+02 1.183e+03 1.402e+03 1.679e+03 3.109e+03, threshold=2.805e+03, percent-clipped=1.0 2024-06-19 23:25:53,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.29 vs. limit=10.0 2024-06-19 23:25:54,423 INFO [train.py:1028] (0/2) Epoch 5, batch 5200, loss[loss=0.3388, simple_loss=0.3358, pruned_loss=0.1709, over 13223.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3231, pruned_loss=0.1566, over 2575801.56 frames. ], batch size: 95, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:25:57,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83724.66666666667, ans=0.1 2024-06-19 23:26:02,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83743.0, ans=0.1 2024-06-19 23:26:03,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=83743.0, ans=0.125 2024-06-19 23:26:14,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.70 vs. limit=15.0 2024-06-19 23:26:25,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=83798.0, ans=0.07 2024-06-19 23:26:27,339 INFO [train.py:1028] (0/2) Epoch 5, batch 5250, loss[loss=0.3072, simple_loss=0.322, pruned_loss=0.1463, over 13280.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.323, pruned_loss=0.1563, over 2571448.09 frames. ], batch size: 52, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:26:32,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-19 23:26:33,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=83834.66666666667, ans=0.125 2024-06-19 23:26:33,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=83834.66666666667, ans=0.2 2024-06-19 23:26:38,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=83834.66666666667, ans=0.0 2024-06-19 23:26:38,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=83834.66666666667, ans=0.2 2024-06-19 23:26:44,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2024-06-19 23:26:53,979 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.357e+02 1.471e+03 1.693e+03 1.982e+03 4.407e+03, threshold=3.386e+03, percent-clipped=3.0 2024-06-19 23:27:00,606 INFO [train.py:1028] (0/2) Epoch 5, batch 5300, loss[loss=0.3457, simple_loss=0.3405, pruned_loss=0.1754, over 13003.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3227, pruned_loss=0.1561, over 2568076.05 frames. ], batch size: 144, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:27:10,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.69 vs. limit=15.0 2024-06-19 23:27:11,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83926.33333333333, ans=0.125 2024-06-19 23:27:11,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83926.33333333333, ans=0.0 2024-06-19 23:27:22,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=83944.66666666667, ans=0.025 2024-06-19 23:27:23,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=83963.0, ans=0.0 2024-06-19 23:27:35,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=83981.33333333333, ans=0.125 2024-06-19 23:27:37,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=83981.33333333333, ans=0.0 2024-06-19 23:27:41,803 INFO [train.py:1028] (0/2) Epoch 5, batch 5350, loss[loss=0.3178, simple_loss=0.3342, pruned_loss=0.1507, over 11781.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3215, pruned_loss=0.1556, over 2574977.03 frames. ], batch size: 17, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:27:47,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=83999.66666666667, ans=0.125 2024-06-19 23:28:03,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=84054.66666666667, ans=0.2 2024-06-19 23:28:07,983 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.023e+02 1.502e+03 1.884e+03 2.236e+03 3.309e+03, threshold=3.768e+03, percent-clipped=0.0 2024-06-19 23:28:09,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=84073.0, ans=0.025 2024-06-19 23:28:13,886 INFO [train.py:1028] (0/2) Epoch 5, batch 5400, loss[loss=0.42, simple_loss=0.3825, pruned_loss=0.2287, over 12220.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3229, pruned_loss=0.1574, over 2567777.24 frames. ], batch size: 240, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:28:19,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84091.33333333333, ans=0.125 2024-06-19 23:28:21,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=84109.66666666667, ans=0.0 2024-06-19 23:28:40,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=84164.66666666667, ans=0.04949747468305833 2024-06-19 23:28:43,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=84164.66666666667, ans=0.0 2024-06-19 23:28:45,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.41 vs. limit=15.0 2024-06-19 23:28:47,216 INFO [train.py:1028] (0/2) Epoch 5, batch 5450, loss[loss=0.3252, simple_loss=0.3241, pruned_loss=0.1632, over 12513.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3227, pruned_loss=0.1565, over 2572816.52 frames. ], batch size: 25, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:28:57,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=84201.33333333333, ans=0.125 2024-06-19 23:28:57,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=84201.33333333333, ans=0.0 2024-06-19 23:28:59,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84201.33333333333, ans=0.125 2024-06-19 23:29:00,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=84219.66666666667, ans=0.125 2024-06-19 23:29:00,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.10 vs. limit=15.0 2024-06-19 23:29:02,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.65 vs. limit=10.0 2024-06-19 23:29:09,012 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:29:12,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84238.0, ans=0.125 2024-06-19 23:29:13,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.15 vs. limit=22.5 2024-06-19 23:29:16,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=84256.33333333333, ans=0.125 2024-06-19 23:29:21,322 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.850e+02 1.426e+03 1.698e+03 2.025e+03 5.135e+03, threshold=3.397e+03, percent-clipped=4.0 2024-06-19 23:29:26,551 INFO [train.py:1028] (0/2) Epoch 5, batch 5500, loss[loss=0.3748, simple_loss=0.3498, pruned_loss=0.1998, over 12227.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3223, pruned_loss=0.1562, over 2565881.77 frames. ], batch size: 240, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:29:27,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=84274.66666666667, ans=0.0 2024-06-19 23:29:30,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=15.0 2024-06-19 23:29:33,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=84293.0, ans=0.125 2024-06-19 23:29:34,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=84293.0, ans=0.0 2024-06-19 23:29:41,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=84311.33333333333, ans=0.2 2024-06-19 23:29:42,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.47 vs. limit=15.0 2024-06-19 23:29:49,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=84329.66666666667, ans=0.125 2024-06-19 23:29:50,564 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.181e+00 2024-06-19 23:29:52,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.52 vs. limit=22.5 2024-06-19 23:29:53,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=84348.0, ans=0.125 2024-06-19 23:29:57,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.58 vs. limit=22.5 2024-06-19 23:29:59,557 INFO [train.py:1028] (0/2) Epoch 5, batch 5550, loss[loss=0.2923, simple_loss=0.308, pruned_loss=0.1383, over 13146.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3216, pruned_loss=0.1554, over 2568589.39 frames. ], batch size: 43, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:30:01,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2024-06-19 23:30:01,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0 2024-06-19 23:30:05,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-06-19 23:30:08,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84384.66666666667, ans=0.1 2024-06-19 23:30:17,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=84403.0, ans=0.125 2024-06-19 23:30:27,854 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.331e+02 1.136e+03 1.301e+03 1.603e+03 4.722e+03, threshold=2.601e+03, percent-clipped=3.0 2024-06-19 23:30:32,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=84458.0, ans=0.125 2024-06-19 23:30:32,997 INFO [train.py:1028] (0/2) Epoch 5, batch 5600, loss[loss=0.2792, simple_loss=0.2908, pruned_loss=0.1338, over 13251.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3212, pruned_loss=0.155, over 2571031.06 frames. ], batch size: 89, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:30:40,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84476.33333333333, ans=0.1 2024-06-19 23:30:45,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.99 vs. limit=22.5 2024-06-19 23:30:54,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=84513.0, ans=0.125 2024-06-19 23:30:57,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=84513.0, ans=0.125 2024-06-19 23:31:07,898 INFO [train.py:1028] (0/2) Epoch 5, batch 5650, loss[loss=0.3736, simple_loss=0.3583, pruned_loss=0.1944, over 12542.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3208, pruned_loss=0.1546, over 2575805.11 frames. ], batch size: 202, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:31:16,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=84549.66666666667, ans=0.125 2024-06-19 23:31:29,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=84586.33333333333, ans=0.025 2024-06-19 23:31:36,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=84604.66666666667, ans=0.0 2024-06-19 23:31:43,830 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 1.196e+03 1.373e+03 1.555e+03 2.667e+03, threshold=2.746e+03, percent-clipped=1.0 2024-06-19 23:31:49,063 INFO [train.py:1028] (0/2) Epoch 5, batch 5700, loss[loss=0.2892, simple_loss=0.3041, pruned_loss=0.1371, over 13262.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.32, pruned_loss=0.1537, over 2579084.60 frames. ], batch size: 63, lr: 1.14e-02, grad_scale: 8.0 2024-06-19 23:32:01,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.95 vs. limit=15.0 2024-06-19 23:32:05,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=84678.0, ans=0.0 2024-06-19 23:32:16,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=84714.66666666667, ans=0.1 2024-06-19 23:32:22,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=84733.0, ans=0.125 2024-06-19 23:32:22,704 INFO [train.py:1028] (0/2) Epoch 5, batch 5750, loss[loss=0.3416, simple_loss=0.3368, pruned_loss=0.1732, over 12777.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3214, pruned_loss=0.1546, over 2580901.24 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 4.0 2024-06-19 23:32:22,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=84733.0, ans=0.0 2024-06-19 23:32:33,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=84751.33333333333, ans=0.2 2024-06-19 23:32:34,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=84751.33333333333, ans=0.025 2024-06-19 23:32:35,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.82 vs. limit=15.0 2024-06-19 23:32:37,661 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:32:40,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84769.66666666667, ans=0.0 2024-06-19 23:32:52,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=84806.33333333333, ans=0.0 2024-06-19 23:32:52,851 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.243e+02 1.120e+03 1.334e+03 1.668e+03 3.505e+03, threshold=2.668e+03, percent-clipped=2.0 2024-06-19 23:32:56,026 INFO [train.py:1028] (0/2) Epoch 5, batch 5800, loss[loss=0.3786, simple_loss=0.3632, pruned_loss=0.197, over 12826.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3232, pruned_loss=0.1562, over 2579866.57 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 2.0 2024-06-19 23:33:05,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=84843.0, ans=0.0 2024-06-19 23:33:22,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=84879.66666666667, ans=0.125 2024-06-19 23:33:26,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=84879.66666666667, ans=0.125 2024-06-19 23:33:35,596 INFO [train.py:1028] (0/2) Epoch 5, batch 5850, loss[loss=0.3576, simple_loss=0.3496, pruned_loss=0.1828, over 12532.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.326, pruned_loss=0.1579, over 2578532.60 frames. ], batch size: 202, lr: 1.14e-02, grad_scale: 1.0 2024-06-19 23:33:41,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.69 vs. limit=15.0 2024-06-19 23:33:47,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=84934.66666666667, ans=0.0 2024-06-19 23:33:47,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=84934.66666666667, ans=0.95 2024-06-19 23:33:50,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84953.0, ans=0.1 2024-06-19 23:33:55,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=84971.33333333333, ans=0.125 2024-06-19 23:34:01,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=84989.66666666667, ans=0.125 2024-06-19 23:34:03,123 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:34:05,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2024-06-19 23:34:06,278 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.258e+02 1.085e+03 1.305e+03 1.663e+03 4.351e+03, threshold=2.609e+03, percent-clipped=1.0 2024-06-19 23:34:09,189 INFO [train.py:1028] (0/2) Epoch 5, batch 5900, loss[loss=0.3159, simple_loss=0.3165, pruned_loss=0.1577, over 13068.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3288, pruned_loss=0.1594, over 2577736.70 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:34:19,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=15.0 2024-06-19 23:34:42,254 INFO [train.py:1028] (0/2) Epoch 5, batch 5950, loss[loss=0.3295, simple_loss=0.3318, pruned_loss=0.1636, over 13068.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3314, pruned_loss=0.1608, over 2582113.88 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:34:43,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.36 vs. limit=22.5 2024-06-19 23:34:47,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=85099.66666666667, ans=0.1 2024-06-19 23:34:52,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=85118.0, ans=10.0 2024-06-19 23:34:56,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=85136.33333333333, ans=0.125 2024-06-19 23:34:59,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.35 vs. limit=15.0 2024-06-19 23:35:02,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85154.66666666667, ans=0.1 2024-06-19 23:35:08,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=85154.66666666667, ans=0.025 2024-06-19 23:35:13,106 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 7.865e+02 9.410e+02 1.036e+03 2.310e+03, threshold=1.882e+03, percent-clipped=0.0 2024-06-19 23:35:19,113 INFO [train.py:1028] (0/2) Epoch 5, batch 6000, loss[loss=0.4253, simple_loss=0.3948, pruned_loss=0.2279, over 12188.00 frames. ], tot_loss[loss=0.329, simple_loss=0.3336, pruned_loss=0.1622, over 2575354.32 frames. ], batch size: 240, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:35:19,114 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-19 23:35:25,923 INFO [train.py:1060] (0/2) Epoch 5, validation: loss=0.2463, simple_loss=0.2986, pruned_loss=0.09699, over 351949.00 frames. 2024-06-19 23:35:25,924 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-19 23:35:28,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=85191.33333333333, ans=0.125 2024-06-19 23:35:30,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=85191.33333333333, ans=0.125 2024-06-19 23:35:41,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=85228.0, ans=0.125 2024-06-19 23:35:47,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=85246.33333333333, ans=0.025 2024-06-19 23:35:49,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=85246.33333333333, ans=0.09899494936611666 2024-06-19 23:35:49,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85246.33333333333, ans=0.125 2024-06-19 23:35:59,528 INFO [train.py:1028] (0/2) Epoch 5, batch 6050, loss[loss=0.3127, simple_loss=0.3252, pruned_loss=0.1501, over 12950.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3345, pruned_loss=0.1621, over 2577846.64 frames. ], batch size: 39, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:36:00,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=85283.0, ans=0.125 2024-06-19 23:36:03,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=85283.0, ans=0.0 2024-06-19 23:36:03,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85283.0, ans=0.125 2024-06-19 23:36:07,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.73 vs. limit=22.5 2024-06-19 23:36:12,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=85319.66666666667, ans=15.0 2024-06-19 23:36:15,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=85319.66666666667, ans=0.125 2024-06-19 23:36:15,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=85319.66666666667, ans=0.125 2024-06-19 23:36:18,376 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:36:26,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=85356.33333333333, ans=0.125 2024-06-19 23:36:30,750 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.703e+02 7.655e+02 8.977e+02 1.084e+03 2.525e+03, threshold=1.795e+03, percent-clipped=3.0 2024-06-19 23:36:32,787 INFO [train.py:1028] (0/2) Epoch 5, batch 6100, loss[loss=0.3118, simple_loss=0.3149, pruned_loss=0.1543, over 13178.00 frames. ], tot_loss[loss=0.3304, simple_loss=0.3357, pruned_loss=0.1626, over 2579576.54 frames. ], batch size: 121, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:36:32,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=85374.66666666667, ans=0.0 2024-06-19 23:36:43,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85393.0, ans=0.1 2024-06-19 23:36:43,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.68 vs. limit=10.0 2024-06-19 23:36:44,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=85393.0, ans=0.0 2024-06-19 23:36:49,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=15.0 2024-06-19 23:37:02,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=85448.0, ans=0.2 2024-06-19 23:37:07,061 INFO [train.py:1028] (0/2) Epoch 5, batch 6150, loss[loss=0.3839, simple_loss=0.365, pruned_loss=0.2014, over 10916.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.338, pruned_loss=0.1639, over 2578864.96 frames. ], batch size: 304, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:37:08,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=85466.33333333333, ans=0.0 2024-06-19 23:37:10,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=85466.33333333333, ans=0.0 2024-06-19 23:37:11,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=85466.33333333333, ans=0.07 2024-06-19 23:37:13,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=85484.66666666667, ans=0.125 2024-06-19 23:37:34,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=85521.33333333333, ans=0.125 2024-06-19 23:37:34,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=85521.33333333333, ans=0.025 2024-06-19 23:37:44,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=85539.66666666667, ans=0.125 2024-06-19 23:37:45,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=85539.66666666667, ans=0.5 2024-06-19 23:37:46,912 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.288e+02 1.186e+03 1.451e+03 1.754e+03 3.034e+03, threshold=2.901e+03, percent-clipped=21.0 2024-06-19 23:37:47,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=85558.0, ans=0.0 2024-06-19 23:37:47,687 INFO [train.py:1028] (0/2) Epoch 5, batch 6200, loss[loss=0.3768, simple_loss=0.376, pruned_loss=0.1888, over 13194.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3412, pruned_loss=0.1661, over 2574943.51 frames. ], batch size: 89, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:37:52,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=85558.0, ans=0.2 2024-06-19 23:37:53,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=85558.0, ans=0.07 2024-06-19 23:38:10,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85613.0, ans=0.1 2024-06-19 23:38:19,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=85631.33333333333, ans=0.0 2024-06-19 23:38:20,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=85631.33333333333, ans=0.125 2024-06-19 23:38:22,419 INFO [train.py:1028] (0/2) Epoch 5, batch 6250, loss[loss=0.3419, simple_loss=0.3465, pruned_loss=0.1687, over 13246.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3426, pruned_loss=0.1669, over 2568654.33 frames. ], batch size: 83, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:38:24,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=12.0 2024-06-19 23:38:38,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=85686.33333333333, ans=0.125 2024-06-19 23:38:39,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.97 vs. limit=15.0 2024-06-19 23:38:49,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.16 vs. limit=10.0 2024-06-19 23:38:54,141 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.928e+02 1.007e+03 1.251e+03 1.504e+03 2.374e+03, threshold=2.503e+03, percent-clipped=0.0 2024-06-19 23:38:54,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=85741.33333333333, ans=0.2 2024-06-19 23:38:54,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85741.33333333333, ans=0.1 2024-06-19 23:38:54,934 INFO [train.py:1028] (0/2) Epoch 5, batch 6300, loss[loss=0.3164, simple_loss=0.3215, pruned_loss=0.1557, over 11766.00 frames. ], tot_loss[loss=0.3402, simple_loss=0.3443, pruned_loss=0.168, over 2564528.75 frames. ], batch size: 17, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:39:34,570 INFO [train.py:1028] (0/2) Epoch 5, batch 6350, loss[loss=0.3993, simple_loss=0.3824, pruned_loss=0.2081, over 12579.00 frames. ], tot_loss[loss=0.3406, simple_loss=0.3458, pruned_loss=0.1677, over 2573707.55 frames. ], batch size: 202, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:39:36,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.24 vs. limit=15.0 2024-06-19 23:39:42,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=85851.33333333333, ans=0.0 2024-06-19 23:39:42,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-19 23:39:44,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=85851.33333333333, ans=0.025 2024-06-19 23:39:48,887 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=15.0 2024-06-19 23:39:51,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=85869.66666666667, ans=0.125 2024-06-19 23:39:56,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.07 vs. limit=10.0 2024-06-19 23:39:59,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=12.0 2024-06-19 23:40:00,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=85906.33333333333, ans=0.125 2024-06-19 23:40:07,125 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 9.941e+02 1.140e+03 1.373e+03 2.975e+03, threshold=2.281e+03, percent-clipped=1.0 2024-06-19 23:40:07,152 INFO [train.py:1028] (0/2) Epoch 5, batch 6400, loss[loss=0.3349, simple_loss=0.3449, pruned_loss=0.1625, over 13201.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3477, pruned_loss=0.1687, over 2575736.84 frames. ], batch size: 67, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:40:09,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=85924.66666666667, ans=0.0 2024-06-19 23:40:13,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2024-06-19 23:40:17,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85943.0, ans=0.1 2024-06-19 23:40:31,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=85979.66666666667, ans=0.0 2024-06-19 23:40:36,395 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:40:38,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85998.0, ans=0.125 2024-06-19 23:40:38,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=86016.33333333333, ans=0.0 2024-06-19 23:40:39,239 INFO [train.py:1028] (0/2) Epoch 5, batch 6450, loss[loss=0.3972, simple_loss=0.3833, pruned_loss=0.2055, over 12562.00 frames. ], tot_loss[loss=0.3442, simple_loss=0.3494, pruned_loss=0.1695, over 2581591.43 frames. ], batch size: 203, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:40:40,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=86016.33333333333, ans=0.0 2024-06-19 23:40:55,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=86053.0, ans=0.125 2024-06-19 23:41:01,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=86071.33333333333, ans=0.125 2024-06-19 23:41:06,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86089.66666666667, ans=0.1 2024-06-19 23:41:10,820 INFO [train.py:1028] (0/2) Epoch 5, batch 6500, loss[loss=0.3785, simple_loss=0.362, pruned_loss=0.1975, over 10718.00 frames. ], tot_loss[loss=0.3464, simple_loss=0.3515, pruned_loss=0.1707, over 2585556.13 frames. ], batch size: 303, lr: 1.13e-02, grad_scale: 4.0 2024-06-19 23:41:11,438 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.886e+02 1.068e+03 1.234e+03 1.425e+03 2.067e+03, threshold=2.467e+03, percent-clipped=0.0 2024-06-19 23:41:11,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=86108.0, ans=0.0 2024-06-19 23:41:12,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=86108.0, ans=0.2 2024-06-19 23:41:37,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=86163.0, ans=0.125 2024-06-19 23:41:40,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86163.0, ans=0.0 2024-06-19 23:41:47,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2024-06-19 23:41:50,406 INFO [train.py:1028] (0/2) Epoch 5, batch 6550, loss[loss=0.3456, simple_loss=0.3646, pruned_loss=0.1633, over 12636.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3525, pruned_loss=0.1709, over 2589222.47 frames. ], batch size: 22, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:41:57,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86218.0, ans=0.1 2024-06-19 23:41:57,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.94 vs. limit=15.0 2024-06-19 23:42:02,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=86218.0, ans=0.125 2024-06-19 23:42:10,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.31 vs. limit=15.0 2024-06-19 23:42:12,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=86254.66666666667, ans=0.2 2024-06-19 23:42:17,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=86273.0, ans=0.125 2024-06-19 23:42:19,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86273.0, ans=0.0 2024-06-19 23:42:21,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-19 23:42:23,538 INFO [train.py:1028] (0/2) Epoch 5, batch 6600, loss[loss=0.3398, simple_loss=0.3523, pruned_loss=0.1637, over 13243.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.352, pruned_loss=0.1704, over 2591044.17 frames. ], batch size: 72, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:42:24,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2024-06-19 23:42:25,650 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 8.385e+02 1.022e+03 1.228e+03 2.669e+03, threshold=2.045e+03, percent-clipped=1.0 2024-06-19 23:42:33,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.71 vs. limit=15.0 2024-06-19 23:42:52,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.86 vs. limit=10.0 2024-06-19 23:42:54,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=86364.66666666667, ans=0.125 2024-06-19 23:42:57,419 INFO [train.py:1028] (0/2) Epoch 5, batch 6650, loss[loss=0.3665, simple_loss=0.3623, pruned_loss=0.1853, over 12961.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3541, pruned_loss=0.1716, over 2585761.07 frames. ], batch size: 158, lr: 1.13e-02, grad_scale: 1.0 2024-06-19 23:42:59,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=86383.0, ans=0.0 2024-06-19 23:42:59,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.69 vs. limit=12.0 2024-06-19 23:43:13,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=86419.66666666667, ans=0.025 2024-06-19 23:43:21,920 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:43:38,962 INFO [train.py:1028] (0/2) Epoch 5, batch 6700, loss[loss=0.3885, simple_loss=0.382, pruned_loss=0.1975, over 12785.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.3558, pruned_loss=0.1724, over 2585599.40 frames. ], batch size: 176, lr: 1.13e-02, grad_scale: 2.0 2024-06-19 23:43:41,603 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.296e+02 9.031e+02 1.069e+03 1.271e+03 2.365e+03, threshold=2.138e+03, percent-clipped=1.0 2024-06-19 23:43:42,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=86474.66666666667, ans=0.2 2024-06-19 23:43:43,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=86474.66666666667, ans=0.0 2024-06-19 23:44:00,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=86529.66666666667, ans=0.125 2024-06-19 23:44:00,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=86529.66666666667, ans=12.0 2024-06-19 23:44:13,293 INFO [train.py:1028] (0/2) Epoch 5, batch 6750, loss[loss=0.4732, simple_loss=0.4401, pruned_loss=0.2532, over 12189.00 frames. ], tot_loss[loss=0.3513, simple_loss=0.3565, pruned_loss=0.173, over 2578849.86 frames. ], batch size: 240, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:44:14,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2024-06-19 23:44:26,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=86603.0, ans=0.125 2024-06-19 23:44:27,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.88 vs. limit=15.0 2024-06-19 23:44:37,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=86621.33333333333, ans=0.125 2024-06-19 23:44:40,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86639.66666666667, ans=0.1 2024-06-19 23:44:45,916 INFO [train.py:1028] (0/2) Epoch 5, batch 6800, loss[loss=0.3202, simple_loss=0.3338, pruned_loss=0.1533, over 13235.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.3581, pruned_loss=0.1737, over 2580926.56 frames. ], batch size: 67, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:44:46,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=15.0 2024-06-19 23:44:48,353 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.581e+02 9.542e+02 1.164e+03 1.366e+03 1.896e+03, threshold=2.327e+03, percent-clipped=0.0 2024-06-19 23:44:53,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2024-06-19 23:44:55,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86676.33333333333, ans=0.125 2024-06-19 23:45:18,490 INFO [train.py:1028] (0/2) Epoch 5, batch 6850, loss[loss=0.4244, simple_loss=0.428, pruned_loss=0.2104, over 13252.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3583, pruned_loss=0.1732, over 2584819.05 frames. ], batch size: 63, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:45:18,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=86749.66666666667, ans=0.025 2024-06-19 23:45:21,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=86749.66666666667, ans=0.125 2024-06-19 23:45:25,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.62 vs. limit=22.5 2024-06-19 23:45:32,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=86786.33333333333, ans=0.0 2024-06-19 23:45:34,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=86786.33333333333, ans=0.2 2024-06-19 23:45:39,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=86786.33333333333, ans=0.0 2024-06-19 23:45:48,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=12.0 2024-06-19 23:45:57,471 INFO [train.py:1028] (0/2) Epoch 5, batch 6900, loss[loss=0.3835, simple_loss=0.4011, pruned_loss=0.183, over 13011.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.3596, pruned_loss=0.1737, over 2586472.83 frames. ], batch size: 48, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:46:02,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 9.084e+02 1.101e+03 1.310e+03 2.113e+03, threshold=2.202e+03, percent-clipped=0.0 2024-06-19 23:46:12,367 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.240e+00 2024-06-19 23:46:22,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=86896.33333333333, ans=10.0 2024-06-19 23:46:31,037 INFO [train.py:1028] (0/2) Epoch 5, batch 6950, loss[loss=0.2997, simple_loss=0.3158, pruned_loss=0.1418, over 11733.00 frames. ], tot_loss[loss=0.3537, simple_loss=0.3601, pruned_loss=0.1736, over 2580481.83 frames. ], batch size: 17, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:46:31,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2024-06-19 23:46:39,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=86951.33333333333, ans=0.2 2024-06-19 23:46:42,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=86951.33333333333, ans=0.0 2024-06-19 23:46:51,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=86988.0, ans=0.0 2024-06-19 23:46:55,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2024-06-19 23:46:58,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=87006.33333333333, ans=0.2 2024-06-19 23:47:00,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=87006.33333333333, ans=0.125 2024-06-19 23:47:00,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87006.33333333333, ans=0.1 2024-06-19 23:47:03,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=87006.33333333333, ans=0.125 2024-06-19 23:47:04,300 INFO [train.py:1028] (0/2) Epoch 5, batch 7000, loss[loss=0.3526, simple_loss=0.3545, pruned_loss=0.1753, over 12930.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3597, pruned_loss=0.1733, over 2576659.85 frames. ], batch size: 158, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:47:07,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=87024.66666666667, ans=0.025 2024-06-19 23:47:08,667 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 9.861e+02 1.162e+03 1.346e+03 3.973e+03, threshold=2.324e+03, percent-clipped=6.0 2024-06-19 23:47:09,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.31 vs. limit=10.0 2024-06-19 23:47:18,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=87061.33333333333, ans=0.0 2024-06-19 23:47:23,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=87061.33333333333, ans=0.07 2024-06-19 23:47:30,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=87079.66666666667, ans=0.025 2024-06-19 23:47:30,868 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.585e-02 2024-06-19 23:47:32,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=87098.0, ans=0.125 2024-06-19 23:47:36,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=87098.0, ans=0.0 2024-06-19 23:47:42,122 INFO [train.py:1028] (0/2) Epoch 5, batch 7050, loss[loss=0.3902, simple_loss=0.3872, pruned_loss=0.1966, over 12836.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3611, pruned_loss=0.1738, over 2583704.47 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:48:01,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=87153.0, ans=0.05 2024-06-19 23:48:01,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=87153.0, ans=0.0 2024-06-19 23:48:01,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=87153.0, ans=0.0 2024-06-19 23:48:02,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=87153.0, ans=0.125 2024-06-19 23:48:02,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=87153.0, ans=0.125 2024-06-19 23:48:12,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.55 vs. limit=22.5 2024-06-19 23:48:13,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87189.66666666667, ans=0.1 2024-06-19 23:48:17,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=87189.66666666667, ans=0.0 2024-06-19 23:48:20,056 INFO [train.py:1028] (0/2) Epoch 5, batch 7100, loss[loss=0.392, simple_loss=0.3969, pruned_loss=0.1935, over 13204.00 frames. ], tot_loss[loss=0.3548, simple_loss=0.3612, pruned_loss=0.1742, over 2576906.06 frames. ], batch size: 112, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:48:24,788 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 8.838e+02 1.024e+03 1.247e+03 3.216e+03, threshold=2.049e+03, percent-clipped=1.0 2024-06-19 23:48:29,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=87226.33333333333, ans=10.0 2024-06-19 23:48:29,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=12.0 2024-06-19 23:48:33,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.22 vs. limit=10.0 2024-06-19 23:48:40,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=87263.0, ans=0.125 2024-06-19 23:48:47,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=18.29 vs. limit=15.0 2024-06-19 23:48:48,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=87281.33333333333, ans=0.125 2024-06-19 23:48:53,275 INFO [train.py:1028] (0/2) Epoch 5, batch 7150, loss[loss=0.4328, simple_loss=0.4119, pruned_loss=0.2268, over 12562.00 frames. ], tot_loss[loss=0.3556, simple_loss=0.3625, pruned_loss=0.1744, over 2574401.67 frames. ], batch size: 202, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:49:00,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=87318.0, ans=0.125 2024-06-19 23:49:03,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.11 vs. limit=22.5 2024-06-19 23:49:25,859 INFO [train.py:1028] (0/2) Epoch 5, batch 7200, loss[loss=0.3744, simple_loss=0.3799, pruned_loss=0.1844, over 13192.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3642, pruned_loss=0.1746, over 2579205.72 frames. ], batch size: 112, lr: 1.12e-02, grad_scale: 8.0 2024-06-19 23:49:29,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87391.33333333333, ans=0.1 2024-06-19 23:49:30,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.283e+02 9.668e+02 1.107e+03 1.833e+03, threshold=1.934e+03, percent-clipped=0.0 2024-06-19 23:49:30,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=87391.33333333333, ans=0.125 2024-06-19 23:49:34,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=15.0 2024-06-19 23:49:37,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=87409.66666666667, ans=0.0 2024-06-19 23:49:40,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=87428.0, ans=0.0 2024-06-19 23:49:43,909 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:49:44,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=12.0 2024-06-19 23:49:45,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=87446.33333333333, ans=0.125 2024-06-19 23:49:46,426 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:49:46,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=87446.33333333333, ans=0.0 2024-06-19 23:50:00,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.50 vs. limit=22.5 2024-06-19 23:50:04,607 INFO [train.py:1028] (0/2) Epoch 5, batch 7250, loss[loss=0.334, simple_loss=0.3578, pruned_loss=0.1551, over 12981.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.3644, pruned_loss=0.1741, over 2579531.93 frames. ], batch size: 36, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:50:10,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=87483.0, ans=0.125 2024-06-19 23:50:11,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.92 vs. limit=6.0 2024-06-19 23:50:11,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87501.33333333333, ans=0.1 2024-06-19 23:50:17,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=87519.66666666667, ans=0.125 2024-06-19 23:50:18,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2024-06-19 23:50:19,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=87519.66666666667, ans=0.125 2024-06-19 23:50:21,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=87519.66666666667, ans=0.0 2024-06-19 23:50:27,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=87538.0, ans=0.0 2024-06-19 23:50:31,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=87556.33333333333, ans=0.125 2024-06-19 23:50:31,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=87556.33333333333, ans=0.0 2024-06-19 23:50:36,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=87556.33333333333, ans=0.125 2024-06-19 23:50:37,406 INFO [train.py:1028] (0/2) Epoch 5, batch 7300, loss[loss=0.3463, simple_loss=0.3608, pruned_loss=0.1659, over 12987.00 frames. ], tot_loss[loss=0.358, simple_loss=0.3659, pruned_loss=0.175, over 2578928.01 frames. ], batch size: 36, lr: 1.12e-02, grad_scale: 4.0 2024-06-19 23:50:43,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=87574.66666666667, ans=0.2 2024-06-19 23:50:44,946 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.496e+02 8.648e+02 1.055e+03 1.180e+03 2.542e+03, threshold=2.110e+03, percent-clipped=4.0 2024-06-19 23:50:45,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=87593.0, ans=0.125 2024-06-19 23:51:07,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=87648.0, ans=0.125 2024-06-19 23:51:08,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-19 23:51:11,004 INFO [train.py:1028] (0/2) Epoch 5, batch 7350, loss[loss=0.3816, simple_loss=0.3913, pruned_loss=0.186, over 13277.00 frames. ], tot_loss[loss=0.3581, simple_loss=0.3664, pruned_loss=0.175, over 2579708.27 frames. ], batch size: 46, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:51:14,401 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.099e+00 2024-06-19 23:51:32,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87721.33333333333, ans=0.1 2024-06-19 23:51:35,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.90 vs. limit=10.0 2024-06-19 23:51:44,403 INFO [train.py:1028] (0/2) Epoch 5, batch 7400, loss[loss=0.3578, simple_loss=0.3724, pruned_loss=0.1716, over 13254.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.3667, pruned_loss=0.1752, over 2586228.07 frames. ], batch size: 63, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:51:57,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.130e+02 9.127e+02 1.084e+03 1.229e+03 2.601e+03, threshold=2.169e+03, percent-clipped=2.0 2024-06-19 23:52:05,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=15.0 2024-06-19 23:52:06,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.07 vs. limit=22.5 2024-06-19 23:52:11,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=87794.66666666667, ans=0.0 2024-06-19 23:52:27,091 INFO [train.py:1028] (0/2) Epoch 5, batch 7450, loss[loss=0.3228, simple_loss=0.3427, pruned_loss=0.1515, over 12649.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3658, pruned_loss=0.1743, over 2580041.11 frames. ], batch size: 29, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:52:31,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-19 23:52:34,656 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-19 23:52:55,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=87923.0, ans=0.2 2024-06-19 23:53:00,898 INFO [train.py:1028] (0/2) Epoch 5, batch 7500, loss[loss=0.3641, simple_loss=0.3539, pruned_loss=0.1872, over 10607.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3673, pruned_loss=0.1756, over 2577808.98 frames. ], batch size: 303, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:53:04,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=87941.33333333333, ans=0.125 2024-06-19 23:53:09,203 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.630e+02 1.095e+03 1.381e+03 1.649e+03 5.842e+03, threshold=2.762e+03, percent-clipped=6.0 2024-06-19 23:53:10,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=87959.66666666667, ans=0.0 2024-06-19 23:53:12,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=87959.66666666667, ans=0.025 2024-06-19 23:53:15,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=87978.0, ans=0.2 2024-06-19 23:53:15,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=87978.0, ans=0.125 2024-06-19 23:53:18,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87978.0, ans=0.1 2024-06-19 23:53:21,840 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-48000.pt 2024-06-19 23:53:29,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=87996.33333333333, ans=0.0 2024-06-19 23:53:39,328 INFO [train.py:1028] (0/2) Epoch 5, batch 7550, loss[loss=0.3643, simple_loss=0.3595, pruned_loss=0.1845, over 12901.00 frames. ], tot_loss[loss=0.3612, simple_loss=0.3686, pruned_loss=0.1769, over 2577216.42 frames. ], batch size: 158, lr: 1.12e-02, grad_scale: 1.0 2024-06-19 23:53:41,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=88033.0, ans=0.125 2024-06-19 23:53:59,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=88069.66666666667, ans=0.0 2024-06-19 23:54:00,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=88069.66666666667, ans=0.125 2024-06-19 23:54:02,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2024-06-19 23:54:03,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.02 vs. limit=6.0 2024-06-19 23:54:11,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-19 23:54:18,931 INFO [train.py:1028] (0/2) Epoch 5, batch 7600, loss[loss=0.3814, simple_loss=0.3813, pruned_loss=0.1907, over 13230.00 frames. ], tot_loss[loss=0.3626, simple_loss=0.3697, pruned_loss=0.1778, over 2577954.49 frames. ], batch size: 83, lr: 1.12e-02, grad_scale: 2.0 2024-06-19 23:54:22,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88124.66666666667, ans=0.1 2024-06-19 23:54:24,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=88124.66666666667, ans=0.0 2024-06-19 23:54:26,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=88143.0, ans=0.125 2024-06-19 23:54:27,858 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.439e+02 1.498e+03 1.714e+03 2.001e+03 4.296e+03, threshold=3.428e+03, percent-clipped=5.0 2024-06-19 23:54:33,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.02 vs. limit=22.5 2024-06-19 23:54:33,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.14 vs. limit=15.0 2024-06-19 23:54:34,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88161.33333333333, ans=0.1 2024-06-19 23:54:38,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=12.0 2024-06-19 23:54:48,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.56 vs. limit=15.0 2024-06-19 23:54:52,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.75 vs. limit=15.0 2024-06-19 23:54:52,737 INFO [train.py:1028] (0/2) Epoch 5, batch 7650, loss[loss=0.3726, simple_loss=0.3756, pruned_loss=0.1848, over 12953.00 frames. ], tot_loss[loss=0.3633, simple_loss=0.3702, pruned_loss=0.1781, over 2572942.30 frames. ], batch size: 33, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:54:54,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=88216.33333333333, ans=0.125 2024-06-19 23:54:58,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=88216.33333333333, ans=0.125 2024-06-19 23:55:00,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88234.66666666667, ans=0.125 2024-06-19 23:55:14,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=88271.33333333333, ans=0.125 2024-06-19 23:55:26,096 INFO [train.py:1028] (0/2) Epoch 5, batch 7700, loss[loss=0.3929, simple_loss=0.3996, pruned_loss=0.1931, over 13267.00 frames. ], tot_loss[loss=0.3636, simple_loss=0.3707, pruned_loss=0.1782, over 2569136.31 frames. ], batch size: 63, lr: 1.11e-02, grad_scale: 4.0 2024-06-19 23:55:28,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=88308.0, ans=0.2 2024-06-19 23:55:33,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=88326.33333333333, ans=0.2 2024-06-19 23:55:34,152 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.418e+03 1.625e+03 2.054e+03 5.058e+03, threshold=3.250e+03, percent-clipped=2.0 2024-06-19 23:55:34,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=88326.33333333333, ans=0.025 2024-06-19 23:55:35,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=15.0 2024-06-19 23:55:39,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=88344.66666666667, ans=0.125 2024-06-19 23:55:57,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=88381.33333333333, ans=0.125 2024-06-19 23:56:02,951 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.206e+01 2024-06-19 23:56:03,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=88381.33333333333, ans=0.125 2024-06-19 23:56:03,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88381.33333333333, ans=0.0 2024-06-19 23:56:04,743 INFO [train.py:1028] (0/2) Epoch 5, batch 7750, loss[loss=0.3678, simple_loss=0.3845, pruned_loss=0.1756, over 13041.00 frames. ], tot_loss[loss=0.3656, simple_loss=0.3721, pruned_loss=0.1796, over 2572943.64 frames. ], batch size: 71, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:56:05,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=88399.66666666667, ans=0.125 2024-06-19 23:56:22,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=88436.33333333333, ans=0.125 2024-06-19 23:56:29,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.35 vs. limit=22.5 2024-06-19 23:56:37,872 INFO [train.py:1028] (0/2) Epoch 5, batch 7800, loss[loss=0.3586, simple_loss=0.3661, pruned_loss=0.1756, over 13194.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3725, pruned_loss=0.1797, over 2577872.48 frames. ], batch size: 95, lr: 1.11e-02, grad_scale: 4.0 2024-06-19 23:56:42,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.42 vs. limit=15.0 2024-06-19 23:56:46,665 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.082e+03 1.940e+03 2.273e+03 2.600e+03 4.681e+03, threshold=4.546e+03, percent-clipped=5.0 2024-06-19 23:56:51,883 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-19 23:56:58,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=88546.33333333333, ans=0.125 2024-06-19 23:57:02,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.85 vs. limit=15.0 2024-06-19 23:57:03,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88546.33333333333, ans=0.125 2024-06-19 23:57:11,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2024-06-19 23:57:11,336 INFO [train.py:1028] (0/2) Epoch 5, batch 7850, loss[loss=0.3709, simple_loss=0.3791, pruned_loss=0.1814, over 12400.00 frames. ], tot_loss[loss=0.3679, simple_loss=0.3743, pruned_loss=0.1808, over 2573026.38 frames. ], batch size: 19, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:57:19,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=88601.33333333333, ans=0.0 2024-06-19 23:57:25,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=88619.66666666667, ans=0.125 2024-06-19 23:57:32,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=88638.0, ans=0.125 2024-06-19 23:57:33,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=88638.0, ans=0.125 2024-06-19 23:57:35,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.34 vs. limit=22.5 2024-06-19 23:57:44,063 INFO [train.py:1028] (0/2) Epoch 5, batch 7900, loss[loss=0.3771, simple_loss=0.3831, pruned_loss=0.1856, over 13200.00 frames. ], tot_loss[loss=0.3694, simple_loss=0.3754, pruned_loss=0.1817, over 2571055.72 frames. ], batch size: 77, lr: 1.11e-02, grad_scale: 2.0 2024-06-19 23:57:54,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=88693.0, ans=10.0 2024-06-19 23:58:00,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=88693.0, ans=0.0 2024-06-19 23:58:01,123 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.112e+03 2.322e+03 2.611e+03 2.986e+03 5.679e+03, threshold=5.223e+03, percent-clipped=0.0 2024-06-19 23:58:01,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=88693.0, ans=0.0 2024-06-19 23:58:08,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=88711.33333333333, ans=0.2 2024-06-19 23:58:14,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.78 vs. limit=22.5 2024-06-19 23:58:21,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=88748.0, ans=0.125 2024-06-19 23:58:23,505 INFO [train.py:1028] (0/2) Epoch 5, batch 7950, loss[loss=0.399, simple_loss=0.3823, pruned_loss=0.2079, over 10578.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.3764, pruned_loss=0.1826, over 2574029.30 frames. ], batch size: 303, lr: 1.11e-02, grad_scale: 0.5 2024-06-19 23:58:24,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=88766.33333333333, ans=0.125 2024-06-19 23:58:25,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=88766.33333333333, ans=0.2 2024-06-19 23:58:29,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=16.12 vs. limit=15.0 2024-06-19 23:58:30,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=88784.66666666667, ans=0.0 2024-06-19 23:58:30,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=88784.66666666667, ans=0.0 2024-06-19 23:58:31,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=88784.66666666667, ans=0.04949747468305833 2024-06-19 23:58:34,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=88784.66666666667, ans=0.04949747468305833 2024-06-19 23:58:56,962 INFO [train.py:1028] (0/2) Epoch 5, batch 8000, loss[loss=0.3162, simple_loss=0.3348, pruned_loss=0.1488, over 12531.00 frames. ], tot_loss[loss=0.3714, simple_loss=0.377, pruned_loss=0.1829, over 2571481.93 frames. ], batch size: 29, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:58:59,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=88858.0, ans=0.0 2024-06-19 23:59:07,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=88876.33333333333, ans=0.0 2024-06-19 23:59:08,650 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.385e+03 2.201e+03 2.738e+03 3.249e+03 1.176e+04, threshold=5.475e+03, percent-clipped=4.0 2024-06-19 23:59:12,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=12.0 2024-06-19 23:59:18,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88913.0, ans=0.1 2024-06-19 23:59:24,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=88931.33333333333, ans=0.125 2024-06-19 23:59:28,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.21 vs. limit=22.5 2024-06-19 23:59:31,150 INFO [train.py:1028] (0/2) Epoch 5, batch 8050, loss[loss=0.3795, simple_loss=0.3853, pruned_loss=0.1868, over 13159.00 frames. ], tot_loss[loss=0.3711, simple_loss=0.3767, pruned_loss=0.1828, over 2570566.59 frames. ], batch size: 83, lr: 1.11e-02, grad_scale: 1.0 2024-06-19 23:59:40,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=88968.0, ans=0.0 2024-06-19 23:59:52,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.75 vs. limit=22.5 2024-06-19 23:59:58,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=89004.66666666667, ans=0.0 2024-06-20 00:00:08,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.42 vs. limit=15.0 2024-06-20 00:00:09,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=89023.0, ans=0.0 2024-06-20 00:00:10,539 INFO [train.py:1028] (0/2) Epoch 5, batch 8100, loss[loss=0.3669, simple_loss=0.3711, pruned_loss=0.1813, over 13141.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3766, pruned_loss=0.1826, over 2574989.62 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:00:19,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=89059.66666666667, ans=0.1 2024-06-20 00:00:23,184 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.036e+03 2.085e+03 2.552e+03 2.957e+03 4.737e+03, threshold=5.105e+03, percent-clipped=0.0 2024-06-20 00:00:24,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89078.0, ans=0.125 2024-06-20 00:00:27,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=89078.0, ans=0.2 2024-06-20 00:00:29,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=89078.0, ans=0.125 2024-06-20 00:00:30,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.11 vs. limit=15.0 2024-06-20 00:00:33,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=89096.33333333333, ans=0.0 2024-06-20 00:00:42,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=89114.66666666667, ans=0.0 2024-06-20 00:00:44,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=89114.66666666667, ans=10.0 2024-06-20 00:00:45,935 INFO [train.py:1028] (0/2) Epoch 5, batch 8150, loss[loss=0.3653, simple_loss=0.366, pruned_loss=0.1823, over 13077.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.3773, pruned_loss=0.1822, over 2579153.30 frames. ], batch size: 121, lr: 1.11e-02, grad_scale: 1.0 2024-06-20 00:00:46,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=89133.0, ans=0.125 2024-06-20 00:00:53,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.22 vs. limit=6.0 2024-06-20 00:00:58,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=89169.66666666667, ans=0.1 2024-06-20 00:01:05,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=89188.0, ans=0.2 2024-06-20 00:01:11,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=89188.0, ans=0.2 2024-06-20 00:01:14,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2024-06-20 00:01:18,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=89224.66666666667, ans=0.5 2024-06-20 00:01:19,340 INFO [train.py:1028] (0/2) Epoch 5, batch 8200, loss[loss=0.3697, simple_loss=0.3783, pruned_loss=0.1805, over 13133.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.3775, pruned_loss=0.182, over 2582366.85 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:01:23,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=89224.66666666667, ans=0.2 2024-06-20 00:01:31,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=89243.0, ans=0.2 2024-06-20 00:01:32,581 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.148e+03 1.594e+03 1.937e+03 2.171e+03 4.125e+03, threshold=3.874e+03, percent-clipped=0.0 2024-06-20 00:01:49,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.41 vs. limit=22.5 2024-06-20 00:01:53,267 INFO [train.py:1028] (0/2) Epoch 5, batch 8250, loss[loss=0.3667, simple_loss=0.378, pruned_loss=0.1777, over 13243.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.3772, pruned_loss=0.1817, over 2583782.64 frames. ], batch size: 52, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:02:01,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=89316.33333333333, ans=0.125 2024-06-20 00:02:02,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=89316.33333333333, ans=15.0 2024-06-20 00:02:10,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=89353.0, ans=0.125 2024-06-20 00:02:16,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=89371.33333333333, ans=15.0 2024-06-20 00:02:24,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=89389.66666666667, ans=0.0 2024-06-20 00:02:29,107 INFO [train.py:1028] (0/2) Epoch 5, batch 8300, loss[loss=0.3789, simple_loss=0.378, pruned_loss=0.1899, over 13043.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.3761, pruned_loss=0.1809, over 2579990.01 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:02:29,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89408.0, ans=0.125 2024-06-20 00:02:30,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2024-06-20 00:02:33,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=89408.0, ans=0.2 2024-06-20 00:02:36,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=89426.33333333333, ans=0.0 2024-06-20 00:02:36,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=89426.33333333333, ans=0.0 2024-06-20 00:02:41,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.296e+02 1.441e+03 1.644e+03 1.921e+03 4.770e+03, threshold=3.288e+03, percent-clipped=2.0 2024-06-20 00:02:49,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=89463.0, ans=0.125 2024-06-20 00:02:52,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=89463.0, ans=0.0 2024-06-20 00:02:57,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=89481.33333333333, ans=0.0 2024-06-20 00:02:58,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=89481.33333333333, ans=0.125 2024-06-20 00:03:02,454 INFO [train.py:1028] (0/2) Epoch 5, batch 8350, loss[loss=0.3725, simple_loss=0.3755, pruned_loss=0.1848, over 13175.00 frames. ], tot_loss[loss=0.368, simple_loss=0.3758, pruned_loss=0.1801, over 2581362.04 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:03:10,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89518.0, ans=0.1 2024-06-20 00:03:10,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89518.0, ans=0.125 2024-06-20 00:03:11,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.49 vs. limit=15.0 2024-06-20 00:03:35,962 INFO [train.py:1028] (0/2) Epoch 5, batch 8400, loss[loss=0.3563, simple_loss=0.3677, pruned_loss=0.1724, over 12988.00 frames. ], tot_loss[loss=0.3679, simple_loss=0.3755, pruned_loss=0.1802, over 2576125.44 frames. ], batch size: 39, lr: 1.11e-02, grad_scale: 4.0 2024-06-20 00:03:36,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=89591.33333333333, ans=0.125 2024-06-20 00:03:42,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=89609.66666666667, ans=0.0 2024-06-20 00:03:45,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.00 vs. limit=6.0 2024-06-20 00:03:53,171 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.857e+02 1.754e+03 2.142e+03 2.602e+03 5.238e+03, threshold=4.285e+03, percent-clipped=5.0 2024-06-20 00:04:02,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=89646.33333333333, ans=0.025 2024-06-20 00:04:07,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.84 vs. limit=10.0 2024-06-20 00:04:11,527 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.85 vs. limit=22.5 2024-06-20 00:04:14,916 INFO [train.py:1028] (0/2) Epoch 5, batch 8450, loss[loss=0.3764, simple_loss=0.3808, pruned_loss=0.186, over 13144.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.3773, pruned_loss=0.1815, over 2577789.62 frames. ], batch size: 112, lr: 1.11e-02, grad_scale: 1.0 2024-06-20 00:04:28,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=89719.66666666667, ans=0.035 2024-06-20 00:04:29,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=89719.66666666667, ans=0.125 2024-06-20 00:04:33,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89738.0, ans=0.1 2024-06-20 00:04:35,645 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.472e-01 2024-06-20 00:04:36,246 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:04:47,643 INFO [train.py:1028] (0/2) Epoch 5, batch 8500, loss[loss=0.3426, simple_loss=0.3506, pruned_loss=0.1673, over 12626.00 frames. ], tot_loss[loss=0.3717, simple_loss=0.3786, pruned_loss=0.1824, over 2577553.96 frames. ], batch size: 29, lr: 1.11e-02, grad_scale: 2.0 2024-06-20 00:04:52,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=89774.66666666667, ans=0.025 2024-06-20 00:04:58,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2024-06-20 00:05:02,445 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.611e+02 1.623e+03 1.909e+03 2.315e+03 3.191e+03, threshold=3.818e+03, percent-clipped=0.0 2024-06-20 00:05:02,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=89811.33333333333, ans=0.125 2024-06-20 00:05:03,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=89811.33333333333, ans=0.0 2024-06-20 00:05:12,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=89829.66666666667, ans=0.2 2024-06-20 00:05:21,158 INFO [train.py:1028] (0/2) Epoch 5, batch 8550, loss[loss=0.3998, simple_loss=0.4076, pruned_loss=0.1961, over 12490.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.3778, pruned_loss=0.1812, over 2577625.22 frames. ], batch size: 22, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:05:21,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=89866.33333333333, ans=0.0 2024-06-20 00:05:23,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=89866.33333333333, ans=0.125 2024-06-20 00:05:24,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.01 vs. limit=15.0 2024-06-20 00:05:24,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=89866.33333333333, ans=0.0 2024-06-20 00:05:26,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.58 vs. limit=10.0 2024-06-20 00:05:42,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=89921.33333333333, ans=0.125 2024-06-20 00:05:58,209 INFO [train.py:1028] (0/2) Epoch 5, batch 8600, loss[loss=0.3597, simple_loss=0.3659, pruned_loss=0.1767, over 13130.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.3777, pruned_loss=0.1813, over 2575303.75 frames. ], batch size: 112, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:05:59,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=89958.0, ans=0.1 2024-06-20 00:06:05,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=89976.33333333333, ans=0.125 2024-06-20 00:06:06,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=89976.33333333333, ans=0.2 2024-06-20 00:06:11,722 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.426e+03 2024-06-20 00:06:13,074 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.102e+01 2024-06-20 00:06:14,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.76 vs. limit=15.0 2024-06-20 00:06:15,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=89994.66666666667, ans=0.125 2024-06-20 00:06:16,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2024-06-20 00:06:17,117 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.359e+02 1.529e+03 1.779e+03 2.066e+03 2.906e+03, threshold=3.559e+03, percent-clipped=0.0 2024-06-20 00:06:24,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90013.0, ans=0.1 2024-06-20 00:06:26,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=90013.0, ans=0.125 2024-06-20 00:06:35,083 INFO [train.py:1028] (0/2) Epoch 5, batch 8650, loss[loss=0.3468, simple_loss=0.3656, pruned_loss=0.164, over 13009.00 frames. ], tot_loss[loss=0.3716, simple_loss=0.3791, pruned_loss=0.182, over 2577171.74 frames. ], batch size: 102, lr: 1.10e-02, grad_scale: 1.0 2024-06-20 00:06:40,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=90068.0, ans=0.0 2024-06-20 00:06:49,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=90086.33333333333, ans=0.125 2024-06-20 00:06:53,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=90086.33333333333, ans=0.2 2024-06-20 00:07:02,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=90123.0, ans=0.125 2024-06-20 00:07:04,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=15.0 2024-06-20 00:07:08,031 INFO [train.py:1028] (0/2) Epoch 5, batch 8700, loss[loss=0.3418, simple_loss=0.3631, pruned_loss=0.1603, over 13133.00 frames. ], tot_loss[loss=0.3742, simple_loss=0.3808, pruned_loss=0.1838, over 2573780.92 frames. ], batch size: 59, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:07:08,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=90141.33333333333, ans=0.125 2024-06-20 00:07:10,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=90141.33333333333, ans=0.0 2024-06-20 00:07:15,452 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.091e+01 2024-06-20 00:07:23,800 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.033e+03 1.559e+03 1.828e+03 2.182e+03 3.915e+03, threshold=3.656e+03, percent-clipped=3.0 2024-06-20 00:07:24,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=90178.0, ans=0.2 2024-06-20 00:07:24,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2024-06-20 00:07:25,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.47 vs. limit=15.0 2024-06-20 00:07:35,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=90214.66666666667, ans=0.0 2024-06-20 00:07:41,054 INFO [train.py:1028] (0/2) Epoch 5, batch 8750, loss[loss=0.3793, simple_loss=0.3773, pruned_loss=0.1907, over 13134.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.3801, pruned_loss=0.1833, over 2569850.45 frames. ], batch size: 121, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:07:51,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=90251.33333333333, ans=0.0 2024-06-20 00:07:56,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=90251.33333333333, ans=0.125 2024-06-20 00:07:57,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.44 vs. limit=15.0 2024-06-20 00:08:14,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=90288.0, ans=6.0 2024-06-20 00:08:18,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.59 vs. limit=22.5 2024-06-20 00:08:22,509 INFO [train.py:1028] (0/2) Epoch 5, batch 8800, loss[loss=0.3555, simple_loss=0.3727, pruned_loss=0.1691, over 13277.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.3811, pruned_loss=0.1844, over 2574710.19 frames. ], batch size: 72, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:08:25,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=90324.66666666667, ans=0.0 2024-06-20 00:08:30,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.65 vs. limit=15.0 2024-06-20 00:08:31,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=90343.0, ans=0.0 2024-06-20 00:08:35,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=90361.33333333333, ans=0.125 2024-06-20 00:08:36,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=90361.33333333333, ans=0.125 2024-06-20 00:08:38,402 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.591e+02 1.580e+03 1.812e+03 2.178e+03 3.237e+03, threshold=3.624e+03, percent-clipped=0.0 2024-06-20 00:08:44,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=90379.66666666667, ans=0.0 2024-06-20 00:08:47,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=90379.66666666667, ans=0.2 2024-06-20 00:08:48,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=90398.0, ans=0.125 2024-06-20 00:08:50,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90398.0, ans=0.1 2024-06-20 00:08:55,990 INFO [train.py:1028] (0/2) Epoch 5, batch 8850, loss[loss=0.4114, simple_loss=0.4003, pruned_loss=0.2113, over 12557.00 frames. ], tot_loss[loss=0.3755, simple_loss=0.3812, pruned_loss=0.1849, over 2563988.06 frames. ], batch size: 202, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:09:01,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.64 vs. limit=15.0 2024-06-20 00:09:01,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=15.0 2024-06-20 00:09:05,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=90434.66666666667, ans=0.125 2024-06-20 00:09:07,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=90434.66666666667, ans=0.2 2024-06-20 00:09:09,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=90453.0, ans=0.125 2024-06-20 00:09:20,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=15.0 2024-06-20 00:09:29,492 INFO [train.py:1028] (0/2) Epoch 5, batch 8900, loss[loss=0.3664, simple_loss=0.3825, pruned_loss=0.1752, over 13023.00 frames. ], tot_loss[loss=0.3761, simple_loss=0.3819, pruned_loss=0.1852, over 2562513.15 frames. ], batch size: 33, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:09:32,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=90508.0, ans=0.2 2024-06-20 00:09:34,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=90508.0, ans=0.05 2024-06-20 00:09:40,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.11 vs. limit=10.0 2024-06-20 00:09:49,569 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.875e+02 1.655e+03 1.930e+03 2.365e+03 3.999e+03, threshold=3.860e+03, percent-clipped=2.0 2024-06-20 00:09:51,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=90563.0, ans=0.125 2024-06-20 00:09:57,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90563.0, ans=0.1 2024-06-20 00:10:01,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=90563.0, ans=0.2 2024-06-20 00:10:09,178 INFO [train.py:1028] (0/2) Epoch 5, batch 8950, loss[loss=0.4415, simple_loss=0.4318, pruned_loss=0.2256, over 12575.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.3816, pruned_loss=0.1841, over 2561581.03 frames. ], batch size: 203, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:10:20,896 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.293e-01 2024-06-20 00:10:21,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90618.0, ans=0.125 2024-06-20 00:10:35,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=90654.66666666667, ans=0.125 2024-06-20 00:10:43,088 INFO [train.py:1028] (0/2) Epoch 5, batch 9000, loss[loss=0.3697, simple_loss=0.383, pruned_loss=0.1782, over 13273.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.382, pruned_loss=0.1834, over 2566662.70 frames. ], batch size: 46, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:10:43,103 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 00:10:50,905 INFO [train.py:1060] (0/2) Epoch 5, validation: loss=0.2399, simple_loss=0.2944, pruned_loss=0.0927, over 351949.00 frames. 2024-06-20 00:10:50,906 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 00:10:51,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.95 vs. limit=22.5 2024-06-20 00:11:03,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=90709.66666666667, ans=0.1 2024-06-20 00:11:08,278 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.025e+03 1.496e+03 1.864e+03 2.244e+03 3.253e+03, threshold=3.728e+03, percent-clipped=0.0 2024-06-20 00:11:11,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=90746.33333333333, ans=0.0 2024-06-20 00:11:13,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.44 vs. limit=22.5 2024-06-20 00:11:14,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=90746.33333333333, ans=0.125 2024-06-20 00:11:19,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90764.66666666667, ans=0.125 2024-06-20 00:11:20,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=90764.66666666667, ans=0.0 2024-06-20 00:11:22,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=12.0 2024-06-20 00:11:24,183 INFO [train.py:1028] (0/2) Epoch 5, batch 9050, loss[loss=0.3383, simple_loss=0.3508, pruned_loss=0.163, over 11128.00 frames. ], tot_loss[loss=0.3759, simple_loss=0.383, pruned_loss=0.1844, over 2565445.92 frames. ], batch size: 16, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:11:27,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=90783.0, ans=0.0 2024-06-20 00:11:27,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90783.0, ans=0.1 2024-06-20 00:11:38,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=90819.66666666667, ans=0.0 2024-06-20 00:11:47,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=90838.0, ans=0.04949747468305833 2024-06-20 00:11:56,783 INFO [train.py:1028] (0/2) Epoch 5, batch 9100, loss[loss=0.3863, simple_loss=0.399, pruned_loss=0.1868, over 13247.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.3819, pruned_loss=0.1834, over 2566427.85 frames. ], batch size: 72, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:12:00,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=90874.66666666667, ans=0.025 2024-06-20 00:12:06,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=90893.0, ans=0.125 2024-06-20 00:12:13,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=6.0 2024-06-20 00:12:14,858 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.275e+02 1.278e+03 1.542e+03 1.838e+03 2.613e+03, threshold=3.085e+03, percent-clipped=0.0 2024-06-20 00:12:18,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.02 vs. limit=6.0 2024-06-20 00:12:29,022 INFO [train.py:1028] (0/2) Epoch 5, batch 9150, loss[loss=0.3518, simple_loss=0.3681, pruned_loss=0.1677, over 13133.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.3815, pruned_loss=0.1834, over 2568434.64 frames. ], batch size: 77, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:12:34,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=90966.33333333333, ans=0.125 2024-06-20 00:12:38,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=90984.66666666667, ans=0.125 2024-06-20 00:12:42,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=15.0 2024-06-20 00:12:50,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=91003.0, ans=0.0 2024-06-20 00:12:50,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=91021.33333333333, ans=0.0 2024-06-20 00:12:56,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.20 vs. limit=10.0 2024-06-20 00:12:57,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=91039.66666666667, ans=0.0 2024-06-20 00:13:01,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=91039.66666666667, ans=0.2 2024-06-20 00:13:03,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=91039.66666666667, ans=0.125 2024-06-20 00:13:04,337 INFO [train.py:1028] (0/2) Epoch 5, batch 9200, loss[loss=0.3924, simple_loss=0.3987, pruned_loss=0.193, over 12963.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.3808, pruned_loss=0.182, over 2571071.39 frames. ], batch size: 36, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:13:05,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=91058.0, ans=0.1 2024-06-20 00:13:19,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.54 vs. limit=22.5 2024-06-20 00:13:20,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=91094.66666666667, ans=0.2 2024-06-20 00:13:22,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=91094.66666666667, ans=0.125 2024-06-20 00:13:25,641 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.805e+02 1.447e+03 1.625e+03 1.984e+03 3.493e+03, threshold=3.250e+03, percent-clipped=1.0 2024-06-20 00:13:29,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=91113.0, ans=0.125 2024-06-20 00:13:37,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=91131.33333333333, ans=0.0 2024-06-20 00:13:38,750 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:13:39,188 INFO [train.py:1028] (0/2) Epoch 5, batch 9250, loss[loss=0.366, simple_loss=0.3841, pruned_loss=0.174, over 13207.00 frames. ], tot_loss[loss=0.3716, simple_loss=0.3805, pruned_loss=0.1814, over 2573518.36 frames. ], batch size: 67, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:13:39,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91149.66666666667, ans=0.125 2024-06-20 00:13:53,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=91186.33333333333, ans=0.125 2024-06-20 00:14:10,996 INFO [train.py:1028] (0/2) Epoch 5, batch 9300, loss[loss=0.3456, simple_loss=0.3549, pruned_loss=0.1682, over 12964.00 frames. ], tot_loss[loss=0.371, simple_loss=0.3799, pruned_loss=0.181, over 2571184.57 frames. ], batch size: 39, lr: 1.10e-02, grad_scale: 4.0 2024-06-20 00:14:22,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=91259.66666666667, ans=0.0 2024-06-20 00:14:23,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=91278.0, ans=0.0 2024-06-20 00:14:29,961 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.124e+03 1.794e+03 2.134e+03 2.597e+03 3.646e+03, threshold=4.268e+03, percent-clipped=3.0 2024-06-20 00:14:34,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=91296.33333333333, ans=0.125 2024-06-20 00:14:35,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.36 vs. limit=22.5 2024-06-20 00:14:38,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.44 vs. limit=15.0 2024-06-20 00:14:38,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2024-06-20 00:14:41,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=91314.66666666667, ans=0.125 2024-06-20 00:14:42,305 INFO [train.py:1028] (0/2) Epoch 5, batch 9350, loss[loss=0.3628, simple_loss=0.3728, pruned_loss=0.1764, over 12475.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3797, pruned_loss=0.181, over 2568169.21 frames. ], batch size: 22, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:14:44,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=91333.0, ans=0.05 2024-06-20 00:15:00,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=91388.0, ans=0.125 2024-06-20 00:15:03,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=91388.0, ans=0.05 2024-06-20 00:15:09,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=91406.33333333333, ans=0.0 2024-06-20 00:15:12,449 INFO [train.py:1028] (0/2) Epoch 5, batch 9400, loss[loss=0.3855, simple_loss=0.3925, pruned_loss=0.1893, over 13200.00 frames. ], tot_loss[loss=0.3717, simple_loss=0.3801, pruned_loss=0.1816, over 2567527.68 frames. ], batch size: 52, lr: 1.10e-02, grad_scale: 2.0 2024-06-20 00:15:15,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=91424.66666666667, ans=0.125 2024-06-20 00:15:22,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.67 vs. limit=10.0 2024-06-20 00:15:28,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.52 vs. limit=10.0 2024-06-20 00:15:29,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=91461.33333333333, ans=0.125 2024-06-20 00:15:32,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.090e+03 1.739e+03 2.062e+03 2.444e+03 4.574e+03, threshold=4.125e+03, percent-clipped=1.0 2024-06-20 00:15:35,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-20 00:15:43,359 INFO [train.py:1028] (0/2) Epoch 5, batch 9450, loss[loss=0.3551, simple_loss=0.3597, pruned_loss=0.1752, over 12612.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.3809, pruned_loss=0.1829, over 2567352.97 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:15:47,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91516.33333333333, ans=0.125 2024-06-20 00:15:49,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=91534.66666666667, ans=0.125 2024-06-20 00:15:50,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=91534.66666666667, ans=0.04949747468305833 2024-06-20 00:15:52,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=91534.66666666667, ans=0.0 2024-06-20 00:15:52,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=91534.66666666667, ans=0.125 2024-06-20 00:15:56,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=91553.0, ans=0.125 2024-06-20 00:16:04,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=91571.33333333333, ans=6.0 2024-06-20 00:16:12,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.00 vs. limit=15.0 2024-06-20 00:16:14,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91589.66666666667, ans=0.125 2024-06-20 00:16:15,489 INFO [train.py:1028] (0/2) Epoch 5, batch 9500, loss[loss=0.3226, simple_loss=0.3436, pruned_loss=0.1508, over 13279.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.3797, pruned_loss=0.1816, over 2576423.05 frames. ], batch size: 43, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:16:27,681 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:16:27,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.99 vs. limit=10.0 2024-06-20 00:16:37,263 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.047e+03 1.499e+03 1.754e+03 2.022e+03 3.467e+03, threshold=3.507e+03, percent-clipped=0.0 2024-06-20 00:16:48,801 INFO [train.py:1028] (0/2) Epoch 5, batch 9550, loss[loss=0.3629, simple_loss=0.3756, pruned_loss=0.1751, over 13172.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.3799, pruned_loss=0.1819, over 2572680.74 frames. ], batch size: 40, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:17:06,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=91736.33333333333, ans=0.125 2024-06-20 00:17:09,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=91754.66666666667, ans=0.0 2024-06-20 00:17:13,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.48 vs. limit=22.5 2024-06-20 00:17:18,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.41 vs. limit=15.0 2024-06-20 00:17:19,694 INFO [train.py:1028] (0/2) Epoch 5, batch 9600, loss[loss=0.4071, simple_loss=0.3878, pruned_loss=0.2131, over 10640.00 frames. ], tot_loss[loss=0.3722, simple_loss=0.38, pruned_loss=0.1822, over 2570177.33 frames. ], batch size: 304, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:17:25,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=91791.33333333333, ans=0.125 2024-06-20 00:17:27,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 00:17:28,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91809.66666666667, ans=0.1 2024-06-20 00:17:32,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.31 vs. limit=15.0 2024-06-20 00:17:34,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=91828.0, ans=0.125 2024-06-20 00:17:34,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=91828.0, ans=0.025 2024-06-20 00:17:36,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=91828.0, ans=0.125 2024-06-20 00:17:39,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=91846.33333333333, ans=0.0 2024-06-20 00:17:39,554 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.798e+02 1.541e+03 1.790e+03 2.144e+03 3.440e+03, threshold=3.580e+03, percent-clipped=0.0 2024-06-20 00:17:46,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.94 vs. limit=22.5 2024-06-20 00:17:48,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.95 vs. limit=10.0 2024-06-20 00:17:49,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=91864.66666666667, ans=0.125 2024-06-20 00:17:50,494 INFO [train.py:1028] (0/2) Epoch 5, batch 9650, loss[loss=0.3466, simple_loss=0.3529, pruned_loss=0.1701, over 13084.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.3809, pruned_loss=0.1836, over 2560570.44 frames. ], batch size: 132, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:17:56,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=91901.33333333333, ans=0.0 2024-06-20 00:17:59,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=91901.33333333333, ans=0.125 2024-06-20 00:18:05,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=91919.66666666667, ans=0.125 2024-06-20 00:18:11,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=91938.0, ans=0.125 2024-06-20 00:18:15,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=91956.33333333333, ans=0.125 2024-06-20 00:18:21,653 INFO [train.py:1028] (0/2) Epoch 5, batch 9700, loss[loss=0.3788, simple_loss=0.3738, pruned_loss=0.1919, over 13031.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.3805, pruned_loss=0.184, over 2555233.80 frames. ], batch size: 144, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:18:33,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=91993.0, ans=0.025 2024-06-20 00:18:40,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=92011.33333333333, ans=0.2 2024-06-20 00:18:44,931 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.066e+03 1.514e+03 1.825e+03 2.118e+03 3.438e+03, threshold=3.650e+03, percent-clipped=0.0 2024-06-20 00:18:45,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=92029.66666666667, ans=0.0 2024-06-20 00:18:54,082 INFO [train.py:1028] (0/2) Epoch 5, batch 9750, loss[loss=0.3722, simple_loss=0.3664, pruned_loss=0.189, over 13142.00 frames. ], tot_loss[loss=0.3721, simple_loss=0.3789, pruned_loss=0.1826, over 2551930.03 frames. ], batch size: 132, lr: 1.09e-02, grad_scale: 0.5 2024-06-20 00:18:58,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92066.33333333333, ans=0.125 2024-06-20 00:19:01,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=92084.66666666667, ans=0.125 2024-06-20 00:19:03,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=15.0 2024-06-20 00:19:06,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=92084.66666666667, ans=0.125 2024-06-20 00:19:13,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=92103.0, ans=0.0 2024-06-20 00:19:20,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=92139.66666666667, ans=0.125 2024-06-20 00:19:23,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=92139.66666666667, ans=0.125 2024-06-20 00:19:23,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=15.0 2024-06-20 00:19:25,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=92158.0, ans=0.2 2024-06-20 00:19:26,423 INFO [train.py:1028] (0/2) Epoch 5, batch 9800, loss[loss=0.3774, simple_loss=0.3862, pruned_loss=0.1842, over 13242.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.3772, pruned_loss=0.1813, over 2545126.46 frames. ], batch size: 40, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:19:44,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.92 vs. limit=15.0 2024-06-20 00:19:44,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=92213.0, ans=0.125 2024-06-20 00:19:44,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=92213.0, ans=0.0 2024-06-20 00:19:48,368 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.579e+02 1.085e+03 1.322e+03 1.560e+03 2.640e+03, threshold=2.645e+03, percent-clipped=0.0 2024-06-20 00:19:56,889 INFO [train.py:1028] (0/2) Epoch 5, batch 9850, loss[loss=0.378, simple_loss=0.3776, pruned_loss=0.1892, over 13115.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.3764, pruned_loss=0.1804, over 2538246.64 frames. ], batch size: 103, lr: 1.09e-02, grad_scale: 1.0 2024-06-20 00:20:13,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.54 vs. limit=6.0 2024-06-20 00:20:23,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=92323.0, ans=0.125 2024-06-20 00:20:29,359 INFO [train.py:1028] (0/2) Epoch 5, batch 9900, loss[loss=0.3348, simple_loss=0.357, pruned_loss=0.1563, over 12956.00 frames. ], tot_loss[loss=0.368, simple_loss=0.3755, pruned_loss=0.1803, over 2529661.78 frames. ], batch size: 39, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:20:32,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=92341.33333333333, ans=0.0 2024-06-20 00:20:34,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=92341.33333333333, ans=0.125 2024-06-20 00:20:41,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.94 vs. limit=15.0 2024-06-20 00:20:51,965 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.070e+02 1.144e+03 1.320e+03 1.530e+03 5.096e+03, threshold=2.641e+03, percent-clipped=5.0 2024-06-20 00:20:52,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=92396.33333333333, ans=0.125 2024-06-20 00:20:54,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=92414.66666666667, ans=0.125 2024-06-20 00:20:56,363 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:20:56,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=92414.66666666667, ans=0.125 2024-06-20 00:20:56,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92414.66666666667, ans=0.125 2024-06-20 00:21:00,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2024-06-20 00:21:00,482 INFO [train.py:1028] (0/2) Epoch 5, batch 9950, loss[loss=0.3705, simple_loss=0.3795, pruned_loss=0.1807, over 12702.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3731, pruned_loss=0.1794, over 2524696.90 frames. ], batch size: 29, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:21:06,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=92433.0, ans=0.0 2024-06-20 00:21:10,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2024-06-20 00:21:10,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=15.0 2024-06-20 00:21:15,034 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:21:24,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=92488.0, ans=0.125 2024-06-20 00:21:25,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=15.0 2024-06-20 00:21:32,787 INFO [train.py:1028] (0/2) Epoch 5, batch 10000, loss[loss=0.3125, simple_loss=0.3324, pruned_loss=0.1463, over 12718.00 frames. ], tot_loss[loss=0.3667, simple_loss=0.3735, pruned_loss=0.18, over 2488556.90 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:21:40,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.89 vs. limit=22.5 2024-06-20 00:21:45,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92561.33333333333, ans=0.1 2024-06-20 00:21:46,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=92561.33333333333, ans=0.95 2024-06-20 00:21:48,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=92561.33333333333, ans=0.125 2024-06-20 00:21:56,597 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.006e+02 1.104e+03 1.322e+03 1.552e+03 3.840e+03, threshold=2.643e+03, percent-clipped=5.0 2024-06-20 00:22:01,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=92598.0, ans=0.125 2024-06-20 00:22:04,828 INFO [train.py:1028] (0/2) Epoch 5, batch 10050, loss[loss=0.374, simple_loss=0.3851, pruned_loss=0.1814, over 12417.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.3738, pruned_loss=0.1814, over 2444886.29 frames. ], batch size: 22, lr: 1.09e-02, grad_scale: 2.0 2024-06-20 00:22:08,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=92616.33333333333, ans=0.125 2024-06-20 00:22:18,538 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:22:19,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=92653.0, ans=0.125 2024-06-20 00:22:27,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.22 vs. limit=22.5 2024-06-20 00:22:30,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=92689.66666666667, ans=0.5 2024-06-20 00:22:34,605 INFO [train.py:1028] (0/2) Epoch 5, batch 10100, loss[loss=0.3175, simple_loss=0.3277, pruned_loss=0.1536, over 10756.00 frames. ], tot_loss[loss=0.366, simple_loss=0.3728, pruned_loss=0.1796, over 2424230.59 frames. ], batch size: 16, lr: 1.09e-02, grad_scale: 4.0 2024-06-20 00:22:39,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92708.0, ans=0.1 2024-06-20 00:22:40,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.72 vs. limit=15.0 2024-06-20 00:22:41,335 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.08 vs. limit=15.0 2024-06-20 00:22:42,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92726.33333333333, ans=0.1 2024-06-20 00:22:48,642 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-5.pt 2024-06-20 00:24:51,472 INFO [train.py:1028] (0/2) Epoch 6, batch 0, loss[loss=0.2988, simple_loss=0.3204, pruned_loss=0.1386, over 13009.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3204, pruned_loss=0.1386, over 13009.00 frames. ], batch size: 36, lr: 1.02e-02, grad_scale: 8.0 2024-06-20 00:24:51,472 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 00:24:58,685 INFO [train.py:1060] (0/2) Epoch 6, validation: loss=0.2433, simple_loss=0.2974, pruned_loss=0.09461, over 351949.00 frames. 2024-06-20 00:24:58,686 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 00:25:01,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=92741.0, ans=0.125 2024-06-20 00:25:06,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92759.33333333333, ans=0.1 2024-06-20 00:25:07,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=92759.33333333333, ans=0.2 2024-06-20 00:25:09,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=92759.33333333333, ans=0.125 2024-06-20 00:25:10,043 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.18 vs. limit=15.0 2024-06-20 00:25:12,341 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.989e+02 1.139e+03 1.266e+03 1.488e+03 4.107e+03, threshold=2.532e+03, percent-clipped=1.0 2024-06-20 00:25:15,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92777.66666666667, ans=0.1 2024-06-20 00:25:24,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=92796.0, ans=0.2 2024-06-20 00:25:26,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=92814.33333333333, ans=0.025 2024-06-20 00:25:27,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=92814.33333333333, ans=0.125 2024-06-20 00:25:32,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=92832.66666666667, ans=0.125 2024-06-20 00:25:35,479 INFO [train.py:1028] (0/2) Epoch 6, batch 50, loss[loss=0.3129, simple_loss=0.3437, pruned_loss=0.1411, over 12641.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.348, pruned_loss=0.1654, over 573960.40 frames. ], batch size: 29, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:25:39,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92832.66666666667, ans=0.1 2024-06-20 00:25:41,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.67 vs. limit=15.0 2024-06-20 00:25:42,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92851.0, ans=0.1 2024-06-20 00:25:46,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92851.0, ans=0.125 2024-06-20 00:25:49,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92869.33333333333, ans=0.125 2024-06-20 00:25:50,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.45 vs. limit=22.5 2024-06-20 00:25:53,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=92869.33333333333, ans=0.1 2024-06-20 00:25:54,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=92887.66666666667, ans=0.0 2024-06-20 00:25:59,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=92887.66666666667, ans=0.5 2024-06-20 00:26:04,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2024-06-20 00:26:05,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=92906.0, ans=0.025 2024-06-20 00:26:10,160 INFO [train.py:1028] (0/2) Epoch 6, batch 100, loss[loss=0.3208, simple_loss=0.3443, pruned_loss=0.1486, over 13296.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3469, pruned_loss=0.1642, over 1017334.57 frames. ], batch size: 46, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:26:17,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92942.66666666667, ans=0.1 2024-06-20 00:26:21,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2024-06-20 00:26:23,364 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.162e+03 1.710e+03 2.043e+03 2.456e+03 4.156e+03, threshold=4.087e+03, percent-clipped=23.0 2024-06-20 00:26:30,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=92979.33333333333, ans=0.125 2024-06-20 00:26:30,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=92979.33333333333, ans=10.0 2024-06-20 00:26:37,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=92997.66666666667, ans=0.07 2024-06-20 00:26:37,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92997.66666666667, ans=0.125 2024-06-20 00:26:42,093 INFO [train.py:1028] (0/2) Epoch 6, batch 150, loss[loss=0.3366, simple_loss=0.3475, pruned_loss=0.1628, over 12588.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3462, pruned_loss=0.1623, over 1365448.58 frames. ], batch size: 29, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:26:47,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2024-06-20 00:26:54,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=93052.66666666667, ans=0.0 2024-06-20 00:27:05,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=93071.0, ans=0.125 2024-06-20 00:27:08,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=93089.33333333333, ans=0.2 2024-06-20 00:27:08,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=93089.33333333333, ans=0.125 2024-06-20 00:27:09,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=93089.33333333333, ans=0.0 2024-06-20 00:27:13,816 INFO [train.py:1028] (0/2) Epoch 6, batch 200, loss[loss=0.3814, simple_loss=0.3708, pruned_loss=0.196, over 12622.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3462, pruned_loss=0.1622, over 1635192.33 frames. ], batch size: 202, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:27:15,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=93107.66666666667, ans=0.125 2024-06-20 00:27:15,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.17 vs. limit=22.5 2024-06-20 00:27:27,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-20 00:27:27,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.052e+03 1.606e+03 1.902e+03 2.292e+03 3.851e+03, threshold=3.805e+03, percent-clipped=0.0 2024-06-20 00:27:27,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=93144.33333333333, ans=0.1 2024-06-20 00:27:28,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=93144.33333333333, ans=0.0 2024-06-20 00:27:44,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93181.0, ans=0.1 2024-06-20 00:27:48,505 INFO [train.py:1028] (0/2) Epoch 6, batch 250, loss[loss=0.3299, simple_loss=0.3283, pruned_loss=0.1658, over 13045.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3454, pruned_loss=0.1615, over 1847736.03 frames. ], batch size: 144, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:27:50,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=93199.33333333333, ans=0.125 2024-06-20 00:28:02,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=93236.0, ans=0.0 2024-06-20 00:28:06,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=93236.0, ans=0.125 2024-06-20 00:28:15,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=93254.33333333333, ans=0.125 2024-06-20 00:28:20,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.41 vs. limit=15.0 2024-06-20 00:28:23,562 INFO [train.py:1028] (0/2) Epoch 6, batch 300, loss[loss=0.3537, simple_loss=0.3588, pruned_loss=0.1743, over 13188.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3467, pruned_loss=0.1625, over 2010005.48 frames. ], batch size: 112, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:28:25,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=15.0 2024-06-20 00:28:39,219 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.877e+02 2.032e+03 2.416e+03 2.827e+03 4.360e+03, threshold=4.831e+03, percent-clipped=3.0 2024-06-20 00:28:46,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.04 vs. limit=6.0 2024-06-20 00:28:46,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=93346.0, ans=0.125 2024-06-20 00:28:49,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=93364.33333333333, ans=0.2 2024-06-20 00:28:53,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=93364.33333333333, ans=0.0 2024-06-20 00:28:54,885 INFO [train.py:1028] (0/2) Epoch 6, batch 350, loss[loss=0.3452, simple_loss=0.3544, pruned_loss=0.168, over 12945.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3459, pruned_loss=0.1613, over 2139361.56 frames. ], batch size: 33, lr: 1.01e-02, grad_scale: 1.0 2024-06-20 00:28:57,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93382.66666666667, ans=0.1 2024-06-20 00:29:05,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=93401.0, ans=0.0 2024-06-20 00:29:17,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93437.66666666667, ans=0.125 2024-06-20 00:29:22,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=33.13 vs. limit=22.5 2024-06-20 00:29:27,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=93456.0, ans=0.0 2024-06-20 00:29:28,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=93474.33333333333, ans=0.125 2024-06-20 00:29:29,369 INFO [train.py:1028] (0/2) Epoch 6, batch 400, loss[loss=0.3495, simple_loss=0.362, pruned_loss=0.1685, over 13252.00 frames. ], tot_loss[loss=0.3334, simple_loss=0.3455, pruned_loss=0.1607, over 2239429.47 frames. ], batch size: 63, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:29:35,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=93492.66666666667, ans=0.125 2024-06-20 00:29:42,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 00:29:44,935 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.309e+02 1.605e+03 1.845e+03 2.027e+03 2.972e+03, threshold=3.690e+03, percent-clipped=0.0 2024-06-20 00:29:48,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=93529.33333333333, ans=0.125 2024-06-20 00:29:56,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93547.66666666667, ans=0.1 2024-06-20 00:29:59,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.65 vs. limit=10.0 2024-06-20 00:30:00,362 INFO [train.py:1028] (0/2) Epoch 6, batch 450, loss[loss=0.3519, simple_loss=0.3691, pruned_loss=0.1673, over 13224.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3448, pruned_loss=0.1597, over 2312848.46 frames. ], batch size: 67, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:30:01,792 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=4.028e+00 2024-06-20 00:30:08,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=93584.33333333333, ans=0.125 2024-06-20 00:30:16,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=93602.66666666667, ans=0.5 2024-06-20 00:30:37,196 INFO [train.py:1028] (0/2) Epoch 6, batch 500, loss[loss=0.2895, simple_loss=0.3102, pruned_loss=0.1344, over 13150.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3448, pruned_loss=0.1591, over 2375494.93 frames. ], batch size: 121, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:30:38,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93657.66666666667, ans=0.125 2024-06-20 00:30:53,966 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.459e+02 1.224e+03 1.470e+03 1.758e+03 2.835e+03, threshold=2.940e+03, percent-clipped=0.0 2024-06-20 00:30:56,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=93694.33333333333, ans=0.2 2024-06-20 00:31:10,262 INFO [train.py:1028] (0/2) Epoch 6, batch 550, loss[loss=0.3587, simple_loss=0.361, pruned_loss=0.1782, over 12904.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.3441, pruned_loss=0.1585, over 2421071.08 frames. ], batch size: 158, lr: 1.01e-02, grad_scale: 1.0 2024-06-20 00:31:31,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=93804.33333333333, ans=0.2 2024-06-20 00:31:37,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.83 vs. limit=22.5 2024-06-20 00:31:38,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93822.66666666667, ans=0.125 2024-06-20 00:31:39,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=93822.66666666667, ans=0.0 2024-06-20 00:31:41,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93822.66666666667, ans=0.125 2024-06-20 00:31:45,006 INFO [train.py:1028] (0/2) Epoch 6, batch 600, loss[loss=0.3292, simple_loss=0.33, pruned_loss=0.1642, over 13047.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3432, pruned_loss=0.1581, over 2459129.05 frames. ], batch size: 144, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:31:45,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.84 vs. limit=15.0 2024-06-20 00:31:47,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=93841.0, ans=0.1 2024-06-20 00:31:51,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.0 2024-06-20 00:31:56,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=93859.33333333333, ans=0.1 2024-06-20 00:31:57,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.20 vs. limit=15.0 2024-06-20 00:31:57,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.18 vs. limit=22.5 2024-06-20 00:32:00,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=93877.66666666667, ans=0.125 2024-06-20 00:32:02,626 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.196e+02 1.254e+03 1.386e+03 1.678e+03 4.024e+03, threshold=2.772e+03, percent-clipped=1.0 2024-06-20 00:32:08,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=93896.0, ans=0.0 2024-06-20 00:32:17,829 INFO [train.py:1028] (0/2) Epoch 6, batch 650, loss[loss=0.3187, simple_loss=0.3346, pruned_loss=0.1514, over 13174.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3435, pruned_loss=0.1579, over 2490516.47 frames. ], batch size: 59, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:32:25,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=93932.66666666667, ans=0.125 2024-06-20 00:32:26,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=93932.66666666667, ans=0.07 2024-06-20 00:32:39,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93969.33333333333, ans=0.0 2024-06-20 00:32:43,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.27 vs. limit=22.5 2024-06-20 00:32:51,579 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.58 vs. limit=15.0 2024-06-20 00:32:52,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=94006.0, ans=0.2 2024-06-20 00:32:53,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=94024.33333333333, ans=0.0 2024-06-20 00:32:53,757 INFO [train.py:1028] (0/2) Epoch 6, batch 700, loss[loss=0.3115, simple_loss=0.3331, pruned_loss=0.1449, over 13256.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3438, pruned_loss=0.1583, over 2512758.23 frames. ], batch size: 46, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:32:55,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=94024.33333333333, ans=0.125 2024-06-20 00:32:56,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.04 vs. limit=15.0 2024-06-20 00:33:01,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=94042.66666666667, ans=0.125 2024-06-20 00:33:05,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=94061.0, ans=0.125 2024-06-20 00:33:09,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=94061.0, ans=0.025 2024-06-20 00:33:11,277 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.574e+02 1.248e+03 1.504e+03 1.761e+03 3.876e+03, threshold=3.008e+03, percent-clipped=3.0 2024-06-20 00:33:15,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=94079.33333333333, ans=0.09899494936611666 2024-06-20 00:33:19,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.72 vs. limit=15.0 2024-06-20 00:33:25,630 INFO [train.py:1028] (0/2) Epoch 6, batch 750, loss[loss=0.3085, simple_loss=0.3341, pruned_loss=0.1414, over 13227.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3439, pruned_loss=0.1577, over 2528979.00 frames. ], batch size: 63, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:33:28,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=94116.0, ans=0.125 2024-06-20 00:33:31,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2024-06-20 00:33:49,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=94171.0, ans=0.2 2024-06-20 00:33:56,347 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.091e+03 2024-06-20 00:33:58,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94189.33333333333, ans=0.125 2024-06-20 00:34:00,730 INFO [train.py:1028] (0/2) Epoch 6, batch 800, loss[loss=0.3027, simple_loss=0.3338, pruned_loss=0.1358, over 12869.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3437, pruned_loss=0.1576, over 2542606.35 frames. ], batch size: 36, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:34:01,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.67 vs. limit=15.0 2024-06-20 00:34:14,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.19 vs. limit=10.0 2024-06-20 00:34:17,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=94244.33333333333, ans=0.0 2024-06-20 00:34:19,353 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.132e+02 1.242e+03 1.526e+03 1.877e+03 2.567e+03, threshold=3.051e+03, percent-clipped=0.0 2024-06-20 00:34:24,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2024-06-20 00:34:34,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=94281.0, ans=0.0 2024-06-20 00:34:37,233 INFO [train.py:1028] (0/2) Epoch 6, batch 850, loss[loss=0.2979, simple_loss=0.3204, pruned_loss=0.1377, over 13151.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3433, pruned_loss=0.1572, over 2552322.54 frames. ], batch size: 95, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:34:39,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:34:43,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.74 vs. limit=10.0 2024-06-20 00:34:44,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=94317.66666666667, ans=0.95 2024-06-20 00:34:46,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=94317.66666666667, ans=0.125 2024-06-20 00:34:51,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=94336.0, ans=0.125 2024-06-20 00:34:52,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=15.0 2024-06-20 00:34:58,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=94354.33333333333, ans=0.2 2024-06-20 00:35:08,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=94372.66666666667, ans=0.95 2024-06-20 00:35:09,416 INFO [train.py:1028] (0/2) Epoch 6, batch 900, loss[loss=0.3197, simple_loss=0.3363, pruned_loss=0.1516, over 12895.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3425, pruned_loss=0.1574, over 2557089.11 frames. ], batch size: 36, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:35:09,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=94391.0, ans=0.2 2024-06-20 00:35:19,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=94409.33333333333, ans=0.0 2024-06-20 00:35:22,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-20 00:35:23,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.90 vs. limit=15.0 2024-06-20 00:35:25,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.33 vs. limit=15.0 2024-06-20 00:35:27,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=94427.66666666667, ans=0.125 2024-06-20 00:35:29,248 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.399e+02 9.491e+02 1.186e+03 1.373e+03 2.094e+03, threshold=2.372e+03, percent-clipped=0.0 2024-06-20 00:35:41,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=94464.33333333333, ans=0.125 2024-06-20 00:35:41,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2024-06-20 00:35:42,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=94464.33333333333, ans=0.125 2024-06-20 00:35:45,305 INFO [train.py:1028] (0/2) Epoch 6, batch 950, loss[loss=0.2909, simple_loss=0.3258, pruned_loss=0.128, over 13169.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3424, pruned_loss=0.1572, over 2559901.72 frames. ], batch size: 40, lr: 1.01e-02, grad_scale: 2.0 2024-06-20 00:35:51,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=94501.0, ans=0.025 2024-06-20 00:35:54,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.68 vs. limit=22.5 2024-06-20 00:35:55,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-20 00:35:55,693 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=4.367e+01 2024-06-20 00:36:02,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=94519.33333333333, ans=0.0 2024-06-20 00:36:13,693 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:36:16,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=94574.33333333333, ans=0.04949747468305833 2024-06-20 00:36:17,079 INFO [train.py:1028] (0/2) Epoch 6, batch 1000, loss[loss=0.3307, simple_loss=0.3495, pruned_loss=0.1559, over 13051.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3421, pruned_loss=0.1572, over 2560966.36 frames. ], batch size: 48, lr: 1.01e-02, grad_scale: 4.0 2024-06-20 00:36:28,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=94592.66666666667, ans=0.125 2024-06-20 00:36:31,127 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.060e-02 2024-06-20 00:36:33,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=94611.0, ans=0.2 2024-06-20 00:36:35,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94611.0, ans=0.1 2024-06-20 00:36:39,520 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.808e+02 8.900e+02 9.853e+02 1.075e+03 2.399e+03, threshold=1.971e+03, percent-clipped=1.0 2024-06-20 00:36:40,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=94629.33333333333, ans=0.125 2024-06-20 00:36:44,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.18 vs. limit=15.0 2024-06-20 00:36:52,438 INFO [train.py:1028] (0/2) Epoch 6, batch 1050, loss[loss=0.3169, simple_loss=0.337, pruned_loss=0.1484, over 13169.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3434, pruned_loss=0.1576, over 2563940.21 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:37:05,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=94702.66666666667, ans=0.125 2024-06-20 00:37:06,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=94702.66666666667, ans=0.0 2024-06-20 00:37:17,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=94739.33333333333, ans=0.125 2024-06-20 00:37:24,380 INFO [train.py:1028] (0/2) Epoch 6, batch 1100, loss[loss=0.3494, simple_loss=0.3635, pruned_loss=0.1677, over 13308.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3434, pruned_loss=0.1575, over 2569038.50 frames. ], batch size: 52, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:37:25,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=94757.66666666667, ans=0.1 2024-06-20 00:37:25,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=94757.66666666667, ans=0.2 2024-06-20 00:37:46,732 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.709e+02 9.361e+02 1.069e+03 1.325e+03 2.009e+03, threshold=2.138e+03, percent-clipped=1.0 2024-06-20 00:37:48,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=94812.66666666667, ans=0.0 2024-06-20 00:37:49,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=94812.66666666667, ans=0.125 2024-06-20 00:37:59,706 INFO [train.py:1028] (0/2) Epoch 6, batch 1150, loss[loss=0.3451, simple_loss=0.3542, pruned_loss=0.168, over 13298.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.3445, pruned_loss=0.1587, over 2570618.90 frames. ], batch size: 52, lr: 1.00e-02, grad_scale: 2.0 2024-06-20 00:38:04,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=12.0 2024-06-20 00:38:04,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=94849.33333333333, ans=0.025 2024-06-20 00:38:04,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.67 vs. limit=10.0 2024-06-20 00:38:12,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.06 vs. limit=22.5 2024-06-20 00:38:12,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=8.0 2024-06-20 00:38:20,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=94886.0, ans=0.2 2024-06-20 00:38:34,847 INFO [train.py:1028] (0/2) Epoch 6, batch 1200, loss[loss=0.3297, simple_loss=0.3451, pruned_loss=0.1571, over 13179.00 frames. ], tot_loss[loss=0.3314, simple_loss=0.3445, pruned_loss=0.1591, over 2572866.92 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:38:36,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=94941.0, ans=0.0 2024-06-20 00:38:42,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=94959.33333333333, ans=0.125 2024-06-20 00:38:55,921 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.126e+02 1.133e+03 1.284e+03 1.495e+03 2.434e+03, threshold=2.569e+03, percent-clipped=1.0 2024-06-20 00:39:06,713 INFO [train.py:1028] (0/2) Epoch 6, batch 1250, loss[loss=0.3059, simple_loss=0.319, pruned_loss=0.1465, over 13168.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3439, pruned_loss=0.1584, over 2581863.19 frames. ], batch size: 112, lr: 1.00e-02, grad_scale: 2.0 2024-06-20 00:39:06,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=95032.66666666667, ans=0.125 2024-06-20 00:39:12,269 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2024-06-20 00:39:12,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.84 vs. limit=15.0 2024-06-20 00:39:20,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=95069.33333333333, ans=0.125 2024-06-20 00:39:20,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=15.0 2024-06-20 00:39:22,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=95069.33333333333, ans=0.0 2024-06-20 00:39:26,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95087.66666666667, ans=0.125 2024-06-20 00:39:32,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=95106.0, ans=0.125 2024-06-20 00:39:34,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=95106.0, ans=0.05 2024-06-20 00:39:39,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=95106.0, ans=0.2 2024-06-20 00:39:41,961 INFO [train.py:1028] (0/2) Epoch 6, batch 1300, loss[loss=0.3557, simple_loss=0.3587, pruned_loss=0.1764, over 12707.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3431, pruned_loss=0.1577, over 2582809.67 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:39:47,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=12.0 2024-06-20 00:39:50,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=95142.66666666667, ans=0.2 2024-06-20 00:39:54,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=1.94 vs. limit=15.0 2024-06-20 00:39:56,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=95161.0, ans=0.125 2024-06-20 00:39:58,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=95161.0, ans=0.125 2024-06-20 00:39:59,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95161.0, ans=0.1 2024-06-20 00:40:00,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=95179.33333333333, ans=0.125 2024-06-20 00:40:02,739 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.915e+02 1.024e+03 1.223e+03 1.462e+03 2.953e+03, threshold=2.447e+03, percent-clipped=1.0 2024-06-20 00:40:10,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=95197.66666666667, ans=0.025 2024-06-20 00:40:12,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.55 vs. limit=10.0 2024-06-20 00:40:13,828 INFO [train.py:1028] (0/2) Epoch 6, batch 1350, loss[loss=0.3187, simple_loss=0.3398, pruned_loss=0.1488, over 13222.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3435, pruned_loss=0.1575, over 2584146.18 frames. ], batch size: 59, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:40:18,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=95216.0, ans=0.125 2024-06-20 00:40:23,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=95234.33333333333, ans=0.025 2024-06-20 00:40:33,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=95252.66666666667, ans=0.125 2024-06-20 00:40:39,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=95271.0, ans=0.125 2024-06-20 00:40:46,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=95289.33333333333, ans=0.125 2024-06-20 00:40:50,142 INFO [train.py:1028] (0/2) Epoch 6, batch 1400, loss[loss=0.3572, simple_loss=0.3674, pruned_loss=0.1735, over 12795.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3435, pruned_loss=0.1575, over 2587374.55 frames. ], batch size: 26, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:40:53,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=95307.66666666667, ans=0.1 2024-06-20 00:40:53,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95307.66666666667, ans=0.125 2024-06-20 00:40:58,235 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-52000.pt 2024-06-20 00:41:05,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=95326.0, ans=0.2 2024-06-20 00:41:07,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.25 vs. limit=10.0 2024-06-20 00:41:10,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95344.33333333333, ans=0.1 2024-06-20 00:41:13,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.36 vs. limit=15.0 2024-06-20 00:41:16,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.646e+02 1.005e+03 1.172e+03 1.351e+03 2.205e+03, threshold=2.344e+03, percent-clipped=0.0 2024-06-20 00:41:17,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-20 00:41:26,980 INFO [train.py:1028] (0/2) Epoch 6, batch 1450, loss[loss=0.3094, simple_loss=0.3203, pruned_loss=0.1492, over 13109.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.343, pruned_loss=0.1574, over 2588232.98 frames. ], batch size: 121, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:41:36,956 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.80 vs. limit=15.0 2024-06-20 00:41:38,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=95417.66666666667, ans=0.025 2024-06-20 00:41:39,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=95436.0, ans=0.125 2024-06-20 00:41:44,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.07 vs. limit=15.0 2024-06-20 00:41:49,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95454.33333333333, ans=0.1 2024-06-20 00:41:53,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95454.33333333333, ans=0.1 2024-06-20 00:41:56,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95472.66666666667, ans=0.125 2024-06-20 00:41:58,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=95472.66666666667, ans=0.0 2024-06-20 00:42:02,048 INFO [train.py:1028] (0/2) Epoch 6, batch 1500, loss[loss=0.3104, simple_loss=0.3222, pruned_loss=0.1492, over 13226.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3427, pruned_loss=0.1574, over 2590713.07 frames. ], batch size: 83, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:42:03,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.29 vs. limit=15.0 2024-06-20 00:42:13,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=95509.33333333333, ans=0.125 2024-06-20 00:42:22,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.98 vs. limit=10.0 2024-06-20 00:42:24,243 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.613e+02 1.105e+03 1.268e+03 1.432e+03 2.252e+03, threshold=2.535e+03, percent-clipped=0.0 2024-06-20 00:42:33,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=95564.33333333333, ans=0.0 2024-06-20 00:42:36,786 INFO [train.py:1028] (0/2) Epoch 6, batch 1550, loss[loss=0.3523, simple_loss=0.3542, pruned_loss=0.1752, over 12964.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3433, pruned_loss=0.1582, over 2586337.71 frames. ], batch size: 102, lr: 1.00e-02, grad_scale: 4.0 2024-06-20 00:42:37,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:38,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=95582.66666666667, ans=0.0 2024-06-20 00:42:42,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=95582.66666666667, ans=0.125 2024-06-20 00:42:45,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=95601.0, ans=0.0 2024-06-20 00:42:56,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=95637.66666666667, ans=0.125 2024-06-20 00:43:03,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=95656.0, ans=0.07 2024-06-20 00:43:09,523 INFO [train.py:1028] (0/2) Epoch 6, batch 1600, loss[loss=0.3203, simple_loss=0.3419, pruned_loss=0.1494, over 13153.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3433, pruned_loss=0.1577, over 2581210.98 frames. ], batch size: 77, lr: 1.00e-02, grad_scale: 8.0 2024-06-20 00:43:17,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95692.66666666667, ans=0.1 2024-06-20 00:43:23,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=95711.0, ans=0.125 2024-06-20 00:43:34,858 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.513e+02 1.099e+03 1.225e+03 2.280e+03, threshold=2.199e+03, percent-clipped=0.0 2024-06-20 00:43:38,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95747.66666666667, ans=0.125 2024-06-20 00:43:38,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.60 vs. limit=22.5 2024-06-20 00:43:44,659 INFO [train.py:1028] (0/2) Epoch 6, batch 1650, loss[loss=0.3343, simple_loss=0.346, pruned_loss=0.1614, over 13187.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3433, pruned_loss=0.158, over 2577204.84 frames. ], batch size: 95, lr: 9.99e-03, grad_scale: 4.0 2024-06-20 00:44:03,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.16 vs. limit=10.0 2024-06-20 00:44:04,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=95821.0, ans=0.0 2024-06-20 00:44:06,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=95821.0, ans=0.025 2024-06-20 00:44:08,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=95821.0, ans=0.125 2024-06-20 00:44:09,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=95821.0, ans=0.025 2024-06-20 00:44:11,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=95839.33333333333, ans=0.2 2024-06-20 00:44:16,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=95839.33333333333, ans=0.5 2024-06-20 00:44:17,628 INFO [train.py:1028] (0/2) Epoch 6, batch 1700, loss[loss=0.2976, simple_loss=0.3262, pruned_loss=0.1345, over 12549.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3417, pruned_loss=0.1566, over 2581077.50 frames. ], batch size: 25, lr: 9.99e-03, grad_scale: 2.0 2024-06-20 00:44:19,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.67 vs. limit=22.5 2024-06-20 00:44:29,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-20 00:44:44,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.89 vs. limit=10.0 2024-06-20 00:44:45,093 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.963e+02 9.664e+02 1.209e+03 1.481e+03 7.673e+03, threshold=2.417e+03, percent-clipped=3.0 2024-06-20 00:44:52,405 INFO [train.py:1028] (0/2) Epoch 6, batch 1750, loss[loss=0.304, simple_loss=0.339, pruned_loss=0.1345, over 12564.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.342, pruned_loss=0.1563, over 2581262.23 frames. ], batch size: 22, lr: 9.98e-03, grad_scale: 1.0 2024-06-20 00:44:55,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95949.33333333333, ans=0.1 2024-06-20 00:44:56,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95949.33333333333, ans=0.1 2024-06-20 00:44:59,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=95967.66666666667, ans=0.125 2024-06-20 00:45:12,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.09 vs. limit=15.0 2024-06-20 00:45:14,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=96004.33333333333, ans=0.025 2024-06-20 00:45:16,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.23 vs. limit=10.0 2024-06-20 00:45:24,204 INFO [train.py:1028] (0/2) Epoch 6, batch 1800, loss[loss=0.3088, simple_loss=0.3269, pruned_loss=0.1454, over 13201.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3427, pruned_loss=0.157, over 2581506.22 frames. ], batch size: 67, lr: 9.98e-03, grad_scale: 2.0 2024-06-20 00:45:37,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=96059.33333333333, ans=0.025 2024-06-20 00:45:48,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96096.0, ans=0.125 2024-06-20 00:45:51,898 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.788e+02 1.255e+03 1.422e+03 1.636e+03 2.353e+03, threshold=2.844e+03, percent-clipped=0.0 2024-06-20 00:45:56,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.52 vs. limit=22.5 2024-06-20 00:45:58,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.75 vs. limit=22.5 2024-06-20 00:45:58,979 INFO [train.py:1028] (0/2) Epoch 6, batch 1850, loss[loss=0.3298, simple_loss=0.3412, pruned_loss=0.1592, over 13259.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3429, pruned_loss=0.157, over 2583008.08 frames. ], batch size: 83, lr: 9.97e-03, grad_scale: 2.0 2024-06-20 00:46:03,910 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.244e+02 2024-06-20 00:46:05,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=96151.0, ans=0.125 2024-06-20 00:46:32,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.83 vs. limit=15.0 2024-06-20 00:46:34,267 INFO [train.py:1028] (0/2) Epoch 6, batch 1900, loss[loss=0.3109, simple_loss=0.3267, pruned_loss=0.1476, over 13126.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3424, pruned_loss=0.1569, over 2585404.29 frames. ], batch size: 95, lr: 9.97e-03, grad_scale: 4.0 2024-06-20 00:46:41,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=96242.66666666667, ans=0.0 2024-06-20 00:46:43,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=96242.66666666667, ans=0.2 2024-06-20 00:46:43,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=96242.66666666667, ans=0.2 2024-06-20 00:46:56,346 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:46:59,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=96279.33333333333, ans=0.95 2024-06-20 00:47:00,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2024-06-20 00:47:00,481 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.301e+02 1.223e+03 1.406e+03 1.603e+03 3.193e+03, threshold=2.812e+03, percent-clipped=1.0 2024-06-20 00:47:02,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=96297.66666666667, ans=0.125 2024-06-20 00:47:06,608 INFO [train.py:1028] (0/2) Epoch 6, batch 1950, loss[loss=0.322, simple_loss=0.3421, pruned_loss=0.151, over 13202.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.341, pruned_loss=0.1565, over 2591583.81 frames. ], batch size: 52, lr: 9.96e-03, grad_scale: 2.0 2024-06-20 00:47:09,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=96316.0, ans=0.125 2024-06-20 00:47:14,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=15.0 2024-06-20 00:47:19,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=15.0 2024-06-20 00:47:27,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=96352.66666666667, ans=0.0 2024-06-20 00:47:39,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=96389.33333333333, ans=0.125 2024-06-20 00:47:40,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=96389.33333333333, ans=0.0 2024-06-20 00:47:42,546 INFO [train.py:1028] (0/2) Epoch 6, batch 2000, loss[loss=0.3828, simple_loss=0.3951, pruned_loss=0.1853, over 12439.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3422, pruned_loss=0.1576, over 2587562.94 frames. ], batch size: 22, lr: 9.96e-03, grad_scale: 4.0 2024-06-20 00:47:44,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.41 vs. limit=15.0 2024-06-20 00:47:52,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=96426.0, ans=0.2 2024-06-20 00:47:53,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=96426.0, ans=0.1 2024-06-20 00:48:08,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.157e+02 1.131e+03 1.258e+03 1.415e+03 2.587e+03, threshold=2.515e+03, percent-clipped=0.0 2024-06-20 00:48:14,261 INFO [train.py:1028] (0/2) Epoch 6, batch 2050, loss[loss=0.2891, simple_loss=0.3108, pruned_loss=0.1337, over 12810.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3417, pruned_loss=0.1573, over 2582776.41 frames. ], batch size: 29, lr: 9.95e-03, grad_scale: 2.0 2024-06-20 00:48:14,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=96499.33333333333, ans=0.2 2024-06-20 00:48:30,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=96536.0, ans=0.0 2024-06-20 00:48:30,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=96536.0, ans=0.125 2024-06-20 00:48:36,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=96554.33333333333, ans=0.0 2024-06-20 00:48:36,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2024-06-20 00:48:36,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=96554.33333333333, ans=0.2 2024-06-20 00:48:38,925 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 00:48:39,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=96554.33333333333, ans=0.125 2024-06-20 00:48:48,831 INFO [train.py:1028] (0/2) Epoch 6, batch 2100, loss[loss=0.374, simple_loss=0.3899, pruned_loss=0.1791, over 13194.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3423, pruned_loss=0.1568, over 2584899.12 frames. ], batch size: 59, lr: 9.95e-03, grad_scale: 4.0 2024-06-20 00:48:53,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2024-06-20 00:49:02,000 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.072e+02 2024-06-20 00:49:02,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.36 vs. limit=15.0 2024-06-20 00:49:04,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2024-06-20 00:49:05,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=96627.66666666667, ans=0.05 2024-06-20 00:49:15,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.879e+02 1.045e+03 1.221e+03 1.457e+03 2.375e+03, threshold=2.443e+03, percent-clipped=0.0 2024-06-20 00:49:19,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=96664.33333333333, ans=0.025 2024-06-20 00:49:21,640 INFO [train.py:1028] (0/2) Epoch 6, batch 2150, loss[loss=0.2843, simple_loss=0.321, pruned_loss=0.1238, over 13265.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3419, pruned_loss=0.1559, over 2587607.45 frames. ], batch size: 52, lr: 9.95e-03, grad_scale: 4.0 2024-06-20 00:49:27,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=96682.66666666667, ans=0.05 2024-06-20 00:49:27,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=96701.0, ans=0.2 2024-06-20 00:49:28,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.64 vs. limit=15.0 2024-06-20 00:49:28,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.64 vs. limit=15.0 2024-06-20 00:49:49,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=96756.0, ans=0.125 2024-06-20 00:49:56,971 INFO [train.py:1028] (0/2) Epoch 6, batch 2200, loss[loss=0.3094, simple_loss=0.3274, pruned_loss=0.1457, over 13169.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.342, pruned_loss=0.1562, over 2587780.03 frames. ], batch size: 83, lr: 9.94e-03, grad_scale: 8.0 2024-06-20 00:50:05,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96792.66666666667, ans=0.1 2024-06-20 00:50:11,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.14 vs. limit=15.0 2024-06-20 00:50:12,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-20 00:50:13,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=96811.0, ans=0.2 2024-06-20 00:50:23,733 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.685e+02 9.284e+02 1.080e+03 1.260e+03 2.663e+03, threshold=2.161e+03, percent-clipped=1.0 2024-06-20 00:50:31,911 INFO [train.py:1028] (0/2) Epoch 6, batch 2250, loss[loss=0.3508, simple_loss=0.3626, pruned_loss=0.1695, over 13318.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3412, pruned_loss=0.1552, over 2587081.17 frames. ], batch size: 63, lr: 9.94e-03, grad_scale: 4.0 2024-06-20 00:50:32,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=15.0 2024-06-20 00:50:45,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=96902.66666666667, ans=0.125 2024-06-20 00:50:47,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.75 vs. limit=15.0 2024-06-20 00:50:49,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.01 vs. limit=15.0 2024-06-20 00:50:50,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-20 00:50:50,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96921.0, ans=0.1 2024-06-20 00:50:57,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=96939.33333333333, ans=0.0 2024-06-20 00:51:00,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=96939.33333333333, ans=0.09899494936611666 2024-06-20 00:51:03,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.72 vs. limit=22.5 2024-06-20 00:51:03,895 INFO [train.py:1028] (0/2) Epoch 6, batch 2300, loss[loss=0.2991, simple_loss=0.3192, pruned_loss=0.1395, over 12875.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3411, pruned_loss=0.1547, over 2582646.19 frames. ], batch size: 33, lr: 9.93e-03, grad_scale: 4.0 2024-06-20 00:51:10,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.17 vs. limit=15.0 2024-06-20 00:51:28,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=97012.66666666667, ans=0.0 2024-06-20 00:51:31,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=97012.66666666667, ans=0.0 2024-06-20 00:51:34,526 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.140e+02 9.588e+02 1.125e+03 1.295e+03 2.878e+03, threshold=2.249e+03, percent-clipped=1.0 2024-06-20 00:51:38,708 INFO [train.py:1028] (0/2) Epoch 6, batch 2350, loss[loss=0.3174, simple_loss=0.3363, pruned_loss=0.1492, over 13246.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3411, pruned_loss=0.1552, over 2585916.98 frames. ], batch size: 67, lr: 9.93e-03, grad_scale: 4.0 2024-06-20 00:51:41,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97049.33333333333, ans=0.125 2024-06-20 00:51:48,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.60 vs. limit=15.0 2024-06-20 00:52:02,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=97104.33333333333, ans=0.125 2024-06-20 00:52:03,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=97104.33333333333, ans=0.0 2024-06-20 00:52:09,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=97122.66666666667, ans=22.5 2024-06-20 00:52:10,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=97141.0, ans=0.025 2024-06-20 00:52:10,957 INFO [train.py:1028] (0/2) Epoch 6, batch 2400, loss[loss=0.3224, simple_loss=0.3423, pruned_loss=0.1513, over 13318.00 frames. ], tot_loss[loss=0.326, simple_loss=0.3407, pruned_loss=0.1556, over 2588651.66 frames. ], batch size: 46, lr: 9.92e-03, grad_scale: 2.0 2024-06-20 00:52:11,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=97141.0, ans=0.0 2024-06-20 00:52:34,162 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.94 vs. limit=15.0 2024-06-20 00:52:41,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=97214.33333333333, ans=0.125 2024-06-20 00:52:43,044 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.036e+02 1.197e+03 1.416e+03 1.654e+03 3.018e+03, threshold=2.832e+03, percent-clipped=7.0 2024-06-20 00:52:43,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=97214.33333333333, ans=0.125 2024-06-20 00:52:44,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-20 00:52:46,006 INFO [train.py:1028] (0/2) Epoch 6, batch 2450, loss[loss=0.3319, simple_loss=0.3477, pruned_loss=0.158, over 13237.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3398, pruned_loss=0.1558, over 2585466.68 frames. ], batch size: 63, lr: 9.92e-03, grad_scale: 2.0 2024-06-20 00:52:53,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=97251.0, ans=0.125 2024-06-20 00:53:11,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=97306.0, ans=0.125 2024-06-20 00:53:16,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=97306.0, ans=0.2 2024-06-20 00:53:18,077 INFO [train.py:1028] (0/2) Epoch 6, batch 2500, loss[loss=0.3191, simple_loss=0.3311, pruned_loss=0.1536, over 13183.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3378, pruned_loss=0.1544, over 2587110.25 frames. ], batch size: 83, lr: 9.91e-03, grad_scale: 2.0 2024-06-20 00:53:19,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=15.0 2024-06-20 00:53:24,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=97324.33333333333, ans=0.0 2024-06-20 00:53:33,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.68 vs. limit=10.0 2024-06-20 00:53:51,061 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.813e+02 1.158e+03 1.296e+03 1.557e+03 3.109e+03, threshold=2.592e+03, percent-clipped=1.0 2024-06-20 00:53:53,724 INFO [train.py:1028] (0/2) Epoch 6, batch 2550, loss[loss=0.3269, simple_loss=0.3494, pruned_loss=0.1522, over 12783.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.337, pruned_loss=0.1542, over 2587371.89 frames. ], batch size: 22, lr: 9.91e-03, grad_scale: 2.0 2024-06-20 00:53:53,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=97416.0, ans=0.2 2024-06-20 00:54:00,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.38 vs. limit=22.5 2024-06-20 00:54:01,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.02 vs. limit=22.5 2024-06-20 00:54:07,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=97452.66666666667, ans=0.125 2024-06-20 00:54:09,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97452.66666666667, ans=0.0 2024-06-20 00:54:17,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=97471.0, ans=0.0 2024-06-20 00:54:21,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=97471.0, ans=0.0 2024-06-20 00:54:26,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=97489.33333333333, ans=0.125 2024-06-20 00:54:28,699 INFO [train.py:1028] (0/2) Epoch 6, batch 2600, loss[loss=0.3106, simple_loss=0.3307, pruned_loss=0.1452, over 13270.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3365, pruned_loss=0.1543, over 2587665.41 frames. ], batch size: 52, lr: 9.90e-03, grad_scale: 4.0 2024-06-20 00:54:30,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=97507.66666666667, ans=0.025 2024-06-20 00:54:35,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=97526.0, ans=0.025 2024-06-20 00:54:37,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=97526.0, ans=0.125 2024-06-20 00:54:55,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=97581.0, ans=0.09899494936611666 2024-06-20 00:54:58,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.63 vs. limit=15.0 2024-06-20 00:54:58,771 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.595e+02 9.546e+02 1.128e+03 1.358e+03 2.681e+03, threshold=2.255e+03, percent-clipped=1.0 2024-06-20 00:55:01,354 INFO [train.py:1028] (0/2) Epoch 6, batch 2650, loss[loss=0.3141, simple_loss=0.323, pruned_loss=0.1526, over 13002.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3351, pruned_loss=0.1538, over 2588236.42 frames. ], batch size: 144, lr: 9.90e-03, grad_scale: 4.0 2024-06-20 00:55:09,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=15.0 2024-06-20 00:55:17,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=97636.0, ans=0.0 2024-06-20 00:55:27,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.53 vs. limit=15.0 2024-06-20 00:55:28,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=97654.33333333333, ans=0.0 2024-06-20 00:55:31,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.90 vs. limit=15.0 2024-06-20 00:55:35,608 INFO [train.py:1028] (0/2) Epoch 6, batch 2700, loss[loss=0.3153, simple_loss=0.3245, pruned_loss=0.1531, over 13260.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3333, pruned_loss=0.1532, over 2585637.79 frames. ], batch size: 89, lr: 9.89e-03, grad_scale: 4.0 2024-06-20 00:55:51,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=97727.66666666667, ans=0.0 2024-06-20 00:55:54,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=97727.66666666667, ans=0.2 2024-06-20 00:55:55,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=97746.0, ans=0.125 2024-06-20 00:55:57,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=97746.0, ans=0.1 2024-06-20 00:55:58,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2024-06-20 00:56:09,169 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.884e+02 1.128e+03 1.362e+03 1.723e+03 2.860e+03, threshold=2.723e+03, percent-clipped=7.0 2024-06-20 00:56:11,272 INFO [train.py:1028] (0/2) Epoch 6, batch 2750, loss[loss=0.2993, simple_loss=0.3216, pruned_loss=0.1385, over 13241.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3312, pruned_loss=0.1513, over 2583135.35 frames. ], batch size: 43, lr: 9.89e-03, grad_scale: 4.0 2024-06-20 00:56:14,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=97782.66666666667, ans=0.0 2024-06-20 00:56:23,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=97801.0, ans=0.0 2024-06-20 00:56:39,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=97856.0, ans=0.125 2024-06-20 00:56:44,700 INFO [train.py:1028] (0/2) Epoch 6, batch 2800, loss[loss=0.3427, simple_loss=0.3383, pruned_loss=0.1735, over 10956.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3308, pruned_loss=0.1514, over 2581449.92 frames. ], batch size: 304, lr: 9.89e-03, grad_scale: 8.0 2024-06-20 00:56:51,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=97892.66666666667, ans=0.125 2024-06-20 00:56:53,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=97892.66666666667, ans=0.125 2024-06-20 00:56:54,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=97892.66666666667, ans=0.125 2024-06-20 00:56:58,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=15.0 2024-06-20 00:57:18,842 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.360e+02 1.298e+03 1.508e+03 1.762e+03 2.773e+03, threshold=3.017e+03, percent-clipped=1.0 2024-06-20 00:57:19,353 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=12.0 2024-06-20 00:57:19,592 INFO [train.py:1028] (0/2) Epoch 6, batch 2850, loss[loss=0.3274, simple_loss=0.3392, pruned_loss=0.1578, over 13363.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3312, pruned_loss=0.1523, over 2578500.28 frames. ], batch size: 49, lr: 9.88e-03, grad_scale: 2.0 2024-06-20 00:57:29,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=97984.33333333333, ans=0.125 2024-06-20 00:57:32,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=98002.66666666667, ans=0.125 2024-06-20 00:57:47,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=98039.33333333333, ans=0.025 2024-06-20 00:57:50,712 INFO [train.py:1028] (0/2) Epoch 6, batch 2900, loss[loss=0.3051, simple_loss=0.3234, pruned_loss=0.1434, over 13174.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.329, pruned_loss=0.1512, over 2587432.21 frames. ], batch size: 55, lr: 9.88e-03, grad_scale: 4.0 2024-06-20 00:57:52,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=98057.66666666667, ans=0.125 2024-06-20 00:58:04,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=98076.0, ans=0.0 2024-06-20 00:58:08,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.74 vs. limit=10.0 2024-06-20 00:58:11,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=98094.33333333333, ans=0.025 2024-06-20 00:58:14,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=98112.66666666667, ans=0.1 2024-06-20 00:58:14,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=98112.66666666667, ans=10.0 2024-06-20 00:58:24,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=98131.0, ans=0.125 2024-06-20 00:58:24,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.55 vs. limit=15.0 2024-06-20 00:58:25,912 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 9.233e+02 1.132e+03 1.336e+03 1.963e+03, threshold=2.264e+03, percent-clipped=0.0 2024-06-20 00:58:26,759 INFO [train.py:1028] (0/2) Epoch 6, batch 2950, loss[loss=0.2841, simple_loss=0.3075, pruned_loss=0.1304, over 13240.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3281, pruned_loss=0.1502, over 2582185.39 frames. ], batch size: 43, lr: 9.87e-03, grad_scale: 4.0 2024-06-20 00:58:35,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=98167.66666666667, ans=0.2 2024-06-20 00:58:57,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=98222.66666666667, ans=0.0 2024-06-20 00:59:03,014 INFO [train.py:1028] (0/2) Epoch 6, batch 3000, loss[loss=0.2955, simple_loss=0.3181, pruned_loss=0.1365, over 13175.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3259, pruned_loss=0.1484, over 2580411.06 frames. ], batch size: 59, lr: 9.87e-03, grad_scale: 8.0 2024-06-20 00:59:03,015 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 00:59:09,620 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.6172, 2.1888, 1.8115, 1.6195], device='cuda:0') 2024-06-20 00:59:10,982 INFO [train.py:1060] (0/2) Epoch 6, validation: loss=0.232, simple_loss=0.2887, pruned_loss=0.08766, over 351949.00 frames. 2024-06-20 00:59:10,982 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 00:59:31,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=98296.0, ans=0.2 2024-06-20 00:59:33,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98296.0, ans=0.125 2024-06-20 00:59:42,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=98314.33333333333, ans=0.125 2024-06-20 00:59:44,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.452e+02 7.112e+02 8.266e+02 9.922e+02 2.231e+03, threshold=1.653e+03, percent-clipped=0.0 2024-06-20 00:59:44,117 INFO [train.py:1028] (0/2) Epoch 6, batch 3050, loss[loss=0.2901, simple_loss=0.3083, pruned_loss=0.136, over 13287.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.325, pruned_loss=0.1487, over 2578973.77 frames. ], batch size: 46, lr: 9.86e-03, grad_scale: 4.0 2024-06-20 00:59:47,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=98332.66666666667, ans=0.125 2024-06-20 00:59:50,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=98351.0, ans=0.125 2024-06-20 00:59:53,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.14 vs. limit=22.5 2024-06-20 01:00:06,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=98387.66666666667, ans=0.0 2024-06-20 01:00:09,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=98387.66666666667, ans=0.1 2024-06-20 01:00:15,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=98406.0, ans=0.125 2024-06-20 01:00:20,690 INFO [train.py:1028] (0/2) Epoch 6, batch 3100, loss[loss=0.3038, simple_loss=0.3175, pruned_loss=0.145, over 13007.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3228, pruned_loss=0.147, over 2580307.56 frames. ], batch size: 144, lr: 9.86e-03, grad_scale: 8.0 2024-06-20 01:00:23,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=98424.33333333333, ans=0.0 2024-06-20 01:00:29,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=98442.66666666667, ans=0.125 2024-06-20 01:00:29,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.38 vs. limit=15.0 2024-06-20 01:00:29,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.90 vs. limit=15.0 2024-06-20 01:00:36,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=98461.0, ans=0.0 2024-06-20 01:00:39,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98461.0, ans=0.1 2024-06-20 01:00:42,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=98479.33333333333, ans=0.125 2024-06-20 01:00:42,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.83 vs. limit=15.0 2024-06-20 01:00:54,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98497.66666666667, ans=0.0 2024-06-20 01:00:57,352 INFO [train.py:1028] (0/2) Epoch 6, batch 3150, loss[loss=0.3341, simple_loss=0.3382, pruned_loss=0.165, over 12891.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3212, pruned_loss=0.1457, over 2582221.55 frames. ], batch size: 158, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:00:58,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.873e+02 8.652e+02 1.048e+03 1.238e+03 2.093e+03, threshold=2.096e+03, percent-clipped=1.0 2024-06-20 01:00:58,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=98516.0, ans=0.0 2024-06-20 01:01:01,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=15.0 2024-06-20 01:01:02,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=98516.0, ans=0.0 2024-06-20 01:01:11,491 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:01:17,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=98571.0, ans=0.125 2024-06-20 01:01:23,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=98589.33333333333, ans=0.0 2024-06-20 01:01:28,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=98589.33333333333, ans=0.125 2024-06-20 01:01:30,689 INFO [train.py:1028] (0/2) Epoch 6, batch 3200, loss[loss=0.2979, simple_loss=0.3206, pruned_loss=0.1376, over 13093.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3209, pruned_loss=0.1455, over 2582617.06 frames. ], batch size: 55, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:01:48,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=98644.33333333333, ans=0.125 2024-06-20 01:01:50,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.48 vs. limit=15.0 2024-06-20 01:02:06,576 INFO [train.py:1028] (0/2) Epoch 6, batch 3250, loss[loss=0.2716, simple_loss=0.2971, pruned_loss=0.123, over 13252.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3204, pruned_loss=0.1455, over 2586784.16 frames. ], batch size: 72, lr: 9.85e-03, grad_scale: 2.0 2024-06-20 01:02:08,521 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.328e+02 1.125e+03 1.346e+03 1.559e+03 2.738e+03, threshold=2.692e+03, percent-clipped=3.0 2024-06-20 01:02:19,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=98717.66666666667, ans=0.125 2024-06-20 01:02:22,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=98736.0, ans=0.0 2024-06-20 01:02:41,023 INFO [train.py:1028] (0/2) Epoch 6, batch 3300, loss[loss=0.3319, simple_loss=0.3361, pruned_loss=0.1639, over 12759.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3198, pruned_loss=0.1453, over 2583774.97 frames. ], batch size: 176, lr: 9.84e-03, grad_scale: 2.0 2024-06-20 01:02:47,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.16 vs. limit=22.5 2024-06-20 01:02:52,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=98809.33333333333, ans=0.125 2024-06-20 01:03:03,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=98846.0, ans=0.0 2024-06-20 01:03:05,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.55 vs. limit=22.5 2024-06-20 01:03:17,085 INFO [train.py:1028] (0/2) Epoch 6, batch 3350, loss[loss=0.3141, simple_loss=0.3205, pruned_loss=0.1538, over 12926.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3196, pruned_loss=0.1461, over 2578011.21 frames. ], batch size: 158, lr: 9.84e-03, grad_scale: 2.0 2024-06-20 01:03:19,709 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.617e+02 9.980e+02 1.171e+03 1.408e+03 2.435e+03, threshold=2.343e+03, percent-clipped=0.0 2024-06-20 01:03:23,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=98901.0, ans=0.125 2024-06-20 01:03:27,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98901.0, ans=0.1 2024-06-20 01:03:30,004 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=3.774e+02 2024-06-20 01:03:30,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.0 2024-06-20 01:03:32,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.52 vs. limit=15.0 2024-06-20 01:03:33,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2024-06-20 01:03:37,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.19 vs. limit=15.0 2024-06-20 01:03:42,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98937.66666666667, ans=0.1 2024-06-20 01:03:42,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98937.66666666667, ans=0.125 2024-06-20 01:03:44,563 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:03:51,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=98974.33333333333, ans=0.025 2024-06-20 01:03:51,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:03:52,405 INFO [train.py:1028] (0/2) Epoch 6, batch 3400, loss[loss=0.3149, simple_loss=0.3248, pruned_loss=0.1525, over 12491.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3195, pruned_loss=0.1465, over 2576433.08 frames. ], batch size: 22, lr: 9.83e-03, grad_scale: 4.0 2024-06-20 01:03:52,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2024-06-20 01:03:56,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=98974.33333333333, ans=0.0 2024-06-20 01:03:57,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=98974.33333333333, ans=0.07 2024-06-20 01:03:59,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=98992.66666666667, ans=0.0 2024-06-20 01:04:25,317 INFO [train.py:1028] (0/2) Epoch 6, batch 3450, loss[loss=0.3393, simple_loss=0.3423, pruned_loss=0.1681, over 12763.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3186, pruned_loss=0.1454, over 2578742.17 frames. ], batch size: 177, lr: 9.83e-03, grad_scale: 4.0 2024-06-20 01:04:26,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.52 vs. limit=22.5 2024-06-20 01:04:27,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.070e+03 1.267e+03 1.523e+03 2.267e+03, threshold=2.535e+03, percent-clipped=0.0 2024-06-20 01:04:33,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=99084.33333333333, ans=0.0 2024-06-20 01:04:49,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.26 vs. limit=15.0 2024-06-20 01:04:54,627 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.653e+00 2024-06-20 01:04:55,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=99139.33333333333, ans=0.04949747468305833 2024-06-20 01:05:01,034 INFO [train.py:1028] (0/2) Epoch 6, batch 3500, loss[loss=0.3108, simple_loss=0.3239, pruned_loss=0.1489, over 12962.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.318, pruned_loss=0.1451, over 2576598.57 frames. ], batch size: 33, lr: 9.82e-03, grad_scale: 8.0 2024-06-20 01:05:09,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99176.0, ans=0.125 2024-06-20 01:05:20,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=15.0 2024-06-20 01:05:23,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99212.66666666667, ans=0.125 2024-06-20 01:05:24,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.80 vs. limit=15.0 2024-06-20 01:05:25,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=99212.66666666667, ans=0.0 2024-06-20 01:05:29,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=12.0 2024-06-20 01:05:32,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99231.0, ans=0.1 2024-06-20 01:05:33,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99249.33333333333, ans=0.125 2024-06-20 01:05:34,117 INFO [train.py:1028] (0/2) Epoch 6, batch 3550, loss[loss=0.2592, simple_loss=0.2815, pruned_loss=0.1184, over 13117.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3166, pruned_loss=0.1438, over 2577382.95 frames. ], batch size: 95, lr: 9.82e-03, grad_scale: 8.0 2024-06-20 01:05:36,628 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 7.417e+02 8.185e+02 9.702e+02 1.522e+03, threshold=1.637e+03, percent-clipped=0.0 2024-06-20 01:05:37,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=99249.33333333333, ans=0.0 2024-06-20 01:05:38,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=99249.33333333333, ans=0.2 2024-06-20 01:05:44,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=99267.66666666667, ans=0.125 2024-06-20 01:05:46,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=99286.0, ans=0.125 2024-06-20 01:06:02,544 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.43 vs. limit=22.5 2024-06-20 01:06:09,893 INFO [train.py:1028] (0/2) Epoch 6, batch 3600, loss[loss=0.2815, simple_loss=0.2983, pruned_loss=0.1324, over 13031.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3155, pruned_loss=0.1431, over 2579896.44 frames. ], batch size: 48, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:06:12,115 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.706e-01 2024-06-20 01:06:16,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=15.0 2024-06-20 01:06:33,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.40 vs. limit=10.0 2024-06-20 01:06:34,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99396.0, ans=0.125 2024-06-20 01:06:34,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99396.0, ans=0.125 2024-06-20 01:06:38,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99414.33333333333, ans=0.125 2024-06-20 01:06:42,940 INFO [train.py:1028] (0/2) Epoch 6, batch 3650, loss[loss=0.2696, simple_loss=0.2894, pruned_loss=0.1249, over 13033.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3149, pruned_loss=0.1425, over 2578792.79 frames. ], batch size: 102, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:06:49,027 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.186e+02 7.368e+02 8.742e+02 1.006e+03 1.542e+03, threshold=1.748e+03, percent-clipped=0.0 2024-06-20 01:06:49,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=99432.66666666667, ans=0.125 2024-06-20 01:07:10,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=15.0 2024-06-20 01:07:13,412 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:07:15,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99506.0, ans=0.125 2024-06-20 01:07:18,211 INFO [train.py:1028] (0/2) Epoch 6, batch 3700, loss[loss=0.2767, simple_loss=0.2972, pruned_loss=0.1281, over 13246.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3134, pruned_loss=0.1411, over 2584137.37 frames. ], batch size: 72, lr: 9.81e-03, grad_scale: 8.0 2024-06-20 01:07:24,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=99542.66666666667, ans=0.07 2024-06-20 01:07:48,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2024-06-20 01:07:50,426 INFO [train.py:1028] (0/2) Epoch 6, batch 3750, loss[loss=0.3074, simple_loss=0.3258, pruned_loss=0.1445, over 12479.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3136, pruned_loss=0.1411, over 2585456.80 frames. ], batch size: 22, lr: 9.80e-03, grad_scale: 4.0 2024-06-20 01:07:55,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=99616.0, ans=0.0 2024-06-20 01:07:57,442 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 6.943e+02 8.097e+02 9.691e+02 1.370e+03, threshold=1.619e+03, percent-clipped=0.0 2024-06-20 01:08:00,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=99634.33333333333, ans=0.125 2024-06-20 01:08:02,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=99634.33333333333, ans=0.0 2024-06-20 01:08:08,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.59 vs. limit=15.0 2024-06-20 01:08:10,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=99652.66666666667, ans=0.125 2024-06-20 01:08:10,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2024-06-20 01:08:20,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=99689.33333333333, ans=0.0 2024-06-20 01:08:26,784 INFO [train.py:1028] (0/2) Epoch 6, batch 3800, loss[loss=0.2988, simple_loss=0.3175, pruned_loss=0.14, over 13197.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3134, pruned_loss=0.1407, over 2584419.63 frames. ], batch size: 83, lr: 9.80e-03, grad_scale: 8.0 2024-06-20 01:08:37,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99726.0, ans=0.125 2024-06-20 01:08:40,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=99744.33333333333, ans=0.035 2024-06-20 01:08:41,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.60 vs. limit=15.0 2024-06-20 01:08:43,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99744.33333333333, ans=0.1 2024-06-20 01:08:48,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-20 01:09:02,372 INFO [train.py:1028] (0/2) Epoch 6, batch 3850, loss[loss=0.3377, simple_loss=0.3344, pruned_loss=0.1705, over 13032.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3128, pruned_loss=0.1402, over 2583554.58 frames. ], batch size: 144, lr: 9.79e-03, grad_scale: 8.0 2024-06-20 01:09:06,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.717e+02 7.554e+02 8.331e+02 9.832e+02 1.639e+03, threshold=1.666e+03, percent-clipped=2.0 2024-06-20 01:09:06,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.50 vs. limit=15.0 2024-06-20 01:09:19,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.15 vs. limit=15.0 2024-06-20 01:09:22,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=99854.33333333333, ans=0.125 2024-06-20 01:09:25,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99854.33333333333, ans=0.1 2024-06-20 01:09:31,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=99872.66666666667, ans=0.125 2024-06-20 01:09:32,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=99872.66666666667, ans=0.025 2024-06-20 01:09:33,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=15.0 2024-06-20 01:09:34,674 INFO [train.py:1028] (0/2) Epoch 6, batch 3900, loss[loss=0.2747, simple_loss=0.2963, pruned_loss=0.1265, over 13189.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3127, pruned_loss=0.1405, over 2586708.53 frames. ], batch size: 83, lr: 9.79e-03, grad_scale: 8.0 2024-06-20 01:09:35,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.34 vs. limit=10.0 2024-06-20 01:09:43,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99909.33333333333, ans=0.1 2024-06-20 01:10:09,905 INFO [train.py:1028] (0/2) Epoch 6, batch 3950, loss[loss=0.2941, simple_loss=0.3008, pruned_loss=0.1437, over 13155.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3113, pruned_loss=0.1392, over 2588246.16 frames. ], batch size: 132, lr: 9.78e-03, grad_scale: 8.0 2024-06-20 01:10:13,610 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.561e+02 8.249e+02 9.979e+02 1.174e+03 2.298e+03, threshold=1.996e+03, percent-clipped=2.0 2024-06-20 01:10:16,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100001.0, ans=0.1 2024-06-20 01:10:21,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.81 vs. limit=15.0 2024-06-20 01:10:24,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=100019.33333333333, ans=0.04949747468305833 2024-06-20 01:10:24,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=100019.33333333333, ans=0.0 2024-06-20 01:10:25,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2024-06-20 01:10:30,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=100037.66666666667, ans=0.0 2024-06-20 01:10:36,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=100056.0, ans=0.125 2024-06-20 01:10:41,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=100056.0, ans=0.2 2024-06-20 01:10:42,433 INFO [train.py:1028] (0/2) Epoch 6, batch 4000, loss[loss=0.3199, simple_loss=0.3412, pruned_loss=0.1493, over 12938.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3114, pruned_loss=0.1399, over 2583116.75 frames. ], batch size: 39, lr: 9.78e-03, grad_scale: 4.0 2024-06-20 01:10:59,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.21 vs. limit=22.5 2024-06-20 01:11:09,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=100129.33333333333, ans=0.025 2024-06-20 01:11:18,313 INFO [train.py:1028] (0/2) Epoch 6, batch 4050, loss[loss=0.3495, simple_loss=0.3412, pruned_loss=0.1789, over 11081.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.312, pruned_loss=0.1405, over 2581157.78 frames. ], batch size: 304, lr: 9.77e-03, grad_scale: 4.0 2024-06-20 01:11:23,350 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 7.981e+02 9.352e+02 1.132e+03 2.256e+03, threshold=1.870e+03, percent-clipped=1.0 2024-06-20 01:11:29,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=100184.33333333333, ans=0.0 2024-06-20 01:11:29,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=100184.33333333333, ans=0.0 2024-06-20 01:11:37,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=100221.0, ans=0.0 2024-06-20 01:11:42,335 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-20 01:11:48,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-20 01:11:51,281 INFO [train.py:1028] (0/2) Epoch 6, batch 4100, loss[loss=0.2797, simple_loss=0.2968, pruned_loss=0.1313, over 13110.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3121, pruned_loss=0.1407, over 2577276.12 frames. ], batch size: 103, lr: 9.77e-03, grad_scale: 8.0 2024-06-20 01:11:51,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=100257.66666666667, ans=0.125 2024-06-20 01:11:54,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=100257.66666666667, ans=0.95 2024-06-20 01:11:54,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100257.66666666667, ans=0.1 2024-06-20 01:12:08,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100294.33333333333, ans=0.1 2024-06-20 01:12:12,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=100294.33333333333, ans=0.2 2024-06-20 01:12:28,157 INFO [train.py:1028] (0/2) Epoch 6, batch 4150, loss[loss=0.283, simple_loss=0.3035, pruned_loss=0.1312, over 13118.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3112, pruned_loss=0.14, over 2576184.27 frames. ], batch size: 55, lr: 9.77e-03, grad_scale: 8.0 2024-06-20 01:12:28,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=100349.33333333333, ans=0.0 2024-06-20 01:12:33,409 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 6.239e+02 7.485e+02 8.636e+02 1.277e+03, threshold=1.497e+03, percent-clipped=0.0 2024-06-20 01:12:43,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2024-06-20 01:12:44,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=100386.0, ans=0.125 2024-06-20 01:12:55,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=100404.33333333333, ans=0.0 2024-06-20 01:12:55,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=100404.33333333333, ans=0.0 2024-06-20 01:13:01,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=100422.66666666667, ans=0.0 2024-06-20 01:13:04,531 INFO [train.py:1028] (0/2) Epoch 6, batch 4200, loss[loss=0.2844, simple_loss=0.3007, pruned_loss=0.1341, over 13182.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.31, pruned_loss=0.1392, over 2579147.94 frames. ], batch size: 103, lr: 9.76e-03, grad_scale: 8.0 2024-06-20 01:13:06,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=100441.0, ans=0.05 2024-06-20 01:13:09,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=100441.0, ans=0.025 2024-06-20 01:13:11,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.91 vs. limit=10.0 2024-06-20 01:13:11,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=100459.33333333333, ans=0.07 2024-06-20 01:13:29,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=100496.0, ans=0.025 2024-06-20 01:13:30,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=100514.33333333333, ans=0.2 2024-06-20 01:13:37,938 INFO [train.py:1028] (0/2) Epoch 6, batch 4250, loss[loss=0.2981, simple_loss=0.3198, pruned_loss=0.1382, over 13313.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3092, pruned_loss=0.1388, over 2581922.86 frames. ], batch size: 46, lr: 9.76e-03, grad_scale: 4.0 2024-06-20 01:13:43,830 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.087e+02 7.802e+02 8.816e+02 9.960e+02 1.586e+03, threshold=1.763e+03, percent-clipped=1.0 2024-06-20 01:13:44,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=100551.0, ans=0.2 2024-06-20 01:13:45,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=15.0 2024-06-20 01:13:46,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=100551.0, ans=0.0 2024-06-20 01:13:48,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=100551.0, ans=0.0 2024-06-20 01:13:49,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=100551.0, ans=0.0 2024-06-20 01:13:55,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.76 vs. limit=15.0 2024-06-20 01:13:56,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=100569.33333333333, ans=0.0 2024-06-20 01:14:07,765 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:14:13,782 INFO [train.py:1028] (0/2) Epoch 6, batch 4300, loss[loss=0.3415, simple_loss=0.3591, pruned_loss=0.162, over 13162.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3094, pruned_loss=0.1391, over 2582506.13 frames. ], batch size: 59, lr: 9.75e-03, grad_scale: 8.0 2024-06-20 01:14:19,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.30 vs. limit=15.0 2024-06-20 01:14:21,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=100642.66666666667, ans=0.125 2024-06-20 01:14:30,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-20 01:14:35,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=100679.33333333333, ans=0.0 2024-06-20 01:14:46,250 INFO [train.py:1028] (0/2) Epoch 6, batch 4350, loss[loss=0.3209, simple_loss=0.3341, pruned_loss=0.1539, over 13213.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3085, pruned_loss=0.1385, over 2586487.73 frames. ], batch size: 59, lr: 9.75e-03, grad_scale: 2.0 2024-06-20 01:14:56,908 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.685e+02 9.767e+02 1.180e+03 1.401e+03 3.558e+03, threshold=2.361e+03, percent-clipped=7.0 2024-06-20 01:15:08,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.87 vs. limit=15.0 2024-06-20 01:15:15,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=100789.33333333333, ans=0.2 2024-06-20 01:15:22,788 INFO [train.py:1028] (0/2) Epoch 6, batch 4400, loss[loss=0.304, simple_loss=0.3126, pruned_loss=0.1477, over 13237.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3083, pruned_loss=0.1384, over 2586631.40 frames. ], batch size: 83, lr: 9.74e-03, grad_scale: 4.0 2024-06-20 01:15:23,269 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.38 vs. limit=15.0 2024-06-20 01:15:33,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=100826.0, ans=0.125 2024-06-20 01:15:51,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100881.0, ans=0.1 2024-06-20 01:15:55,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=100899.33333333333, ans=0.125 2024-06-20 01:15:55,797 INFO [train.py:1028] (0/2) Epoch 6, batch 4450, loss[loss=0.2504, simple_loss=0.287, pruned_loss=0.1069, over 12824.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3094, pruned_loss=0.1395, over 2581500.02 frames. ], batch size: 33, lr: 9.74e-03, grad_scale: 4.0 2024-06-20 01:16:01,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=100899.33333333333, ans=0.05 2024-06-20 01:16:03,120 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.177e+02 8.814e+02 1.069e+03 1.192e+03 3.153e+03, threshold=2.139e+03, percent-clipped=1.0 2024-06-20 01:16:07,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=100917.66666666667, ans=0.125 2024-06-20 01:16:13,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=100936.0, ans=0.0 2024-06-20 01:16:22,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=100954.33333333333, ans=0.125 2024-06-20 01:16:26,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.59 vs. limit=10.0 2024-06-20 01:16:31,836 INFO [train.py:1028] (0/2) Epoch 6, batch 4500, loss[loss=0.2692, simple_loss=0.2919, pruned_loss=0.1232, over 13277.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3088, pruned_loss=0.1393, over 2586223.70 frames. ], batch size: 89, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:16:32,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=100991.0, ans=0.025 2024-06-20 01:16:38,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=101009.33333333333, ans=0.125 2024-06-20 01:16:59,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2024-06-20 01:17:00,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=101064.33333333333, ans=0.125 2024-06-20 01:17:07,950 INFO [train.py:1028] (0/2) Epoch 6, batch 4550, loss[loss=0.2646, simple_loss=0.2884, pruned_loss=0.1204, over 13260.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3084, pruned_loss=0.139, over 2589608.22 frames. ], batch size: 52, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:17:08,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2024-06-20 01:17:14,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.372e+02 1.112e+03 1.273e+03 1.464e+03 2.363e+03, threshold=2.546e+03, percent-clipped=1.0 2024-06-20 01:17:16,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=101101.0, ans=0.2 2024-06-20 01:17:16,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.27 vs. limit=6.0 2024-06-20 01:17:18,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.09 vs. limit=6.0 2024-06-20 01:17:26,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=101137.66666666667, ans=0.125 2024-06-20 01:17:39,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.74 vs. limit=15.0 2024-06-20 01:17:40,155 INFO [train.py:1028] (0/2) Epoch 6, batch 4600, loss[loss=0.3093, simple_loss=0.3154, pruned_loss=0.1516, over 12545.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3081, pruned_loss=0.1386, over 2584521.29 frames. ], batch size: 202, lr: 9.73e-03, grad_scale: 8.0 2024-06-20 01:17:40,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=101174.33333333333, ans=0.025 2024-06-20 01:17:42,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=101174.33333333333, ans=0.125 2024-06-20 01:17:43,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=101174.33333333333, ans=0.125 2024-06-20 01:17:50,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=101192.66666666667, ans=0.2 2024-06-20 01:17:56,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101211.0, ans=0.125 2024-06-20 01:17:58,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101211.0, ans=0.1 2024-06-20 01:18:01,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=101229.33333333333, ans=0.125 2024-06-20 01:18:10,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=101247.66666666667, ans=0.0 2024-06-20 01:18:12,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=101247.66666666667, ans=0.0 2024-06-20 01:18:16,541 INFO [train.py:1028] (0/2) Epoch 6, batch 4650, loss[loss=0.3024, simple_loss=0.3139, pruned_loss=0.1454, over 13070.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3071, pruned_loss=0.1379, over 2587162.22 frames. ], batch size: 132, lr: 9.72e-03, grad_scale: 2.0 2024-06-20 01:18:18,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=101266.0, ans=10.0 2024-06-20 01:18:23,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=101284.33333333333, ans=0.0 2024-06-20 01:18:25,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=101284.33333333333, ans=0.0 2024-06-20 01:18:25,547 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.841e+02 9.136e+02 1.077e+03 1.233e+03 1.735e+03, threshold=2.153e+03, percent-clipped=0.0 2024-06-20 01:18:28,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=101284.33333333333, ans=0.05 2024-06-20 01:18:31,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.12 vs. limit=22.5 2024-06-20 01:18:34,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=101302.66666666667, ans=0.125 2024-06-20 01:18:37,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101321.0, ans=0.1 2024-06-20 01:18:53,005 INFO [train.py:1028] (0/2) Epoch 6, batch 4700, loss[loss=0.2825, simple_loss=0.3103, pruned_loss=0.1274, over 12270.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3073, pruned_loss=0.1378, over 2582511.33 frames. ], batch size: 25, lr: 9.72e-03, grad_scale: 4.0 2024-06-20 01:18:53,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101357.66666666667, ans=0.1 2024-06-20 01:18:56,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=101357.66666666667, ans=0.125 2024-06-20 01:18:59,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=101376.0, ans=0.125 2024-06-20 01:19:17,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=15.0 2024-06-20 01:19:23,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=101431.0, ans=0.2 2024-06-20 01:19:25,973 INFO [train.py:1028] (0/2) Epoch 6, batch 4750, loss[loss=0.3333, simple_loss=0.3296, pruned_loss=0.1685, over 12515.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3071, pruned_loss=0.1381, over 2579748.86 frames. ], batch size: 202, lr: 9.71e-03, grad_scale: 2.0 2024-06-20 01:19:32,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101467.66666666667, ans=0.125 2024-06-20 01:19:34,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=101467.66666666667, ans=0.125 2024-06-20 01:19:35,841 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.428e+02 8.777e+02 1.007e+03 1.182e+03 2.243e+03, threshold=2.013e+03, percent-clipped=1.0 2024-06-20 01:19:44,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=101486.0, ans=0.025 2024-06-20 01:19:48,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-20 01:19:58,991 INFO [train.py:1028] (0/2) Epoch 6, batch 4800, loss[loss=0.2714, simple_loss=0.2931, pruned_loss=0.1248, over 13229.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3061, pruned_loss=0.1369, over 2575583.66 frames. ], batch size: 63, lr: 9.71e-03, grad_scale: 4.0 2024-06-20 01:20:10,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.76 vs. limit=10.0 2024-06-20 01:20:22,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=101596.0, ans=0.0 2024-06-20 01:20:22,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=101596.0, ans=0.0 2024-06-20 01:20:23,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=15.0 2024-06-20 01:20:25,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=101596.0, ans=0.125 2024-06-20 01:20:26,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=101596.0, ans=0.125 2024-06-20 01:20:26,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=101596.0, ans=0.0 2024-06-20 01:20:34,757 INFO [train.py:1028] (0/2) Epoch 6, batch 4850, loss[loss=0.2584, simple_loss=0.2752, pruned_loss=0.1208, over 13229.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3061, pruned_loss=0.1368, over 2573956.03 frames. ], batch size: 89, lr: 9.70e-03, grad_scale: 2.0 2024-06-20 01:20:36,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=101632.66666666667, ans=0.05 2024-06-20 01:20:48,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.975e+02 9.063e+02 1.146e+03 1.329e+03 2.365e+03, threshold=2.292e+03, percent-clipped=3.0 2024-06-20 01:20:51,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=101669.33333333333, ans=0.0 2024-06-20 01:20:52,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=101669.33333333333, ans=22.5 2024-06-20 01:20:53,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101669.33333333333, ans=0.1 2024-06-20 01:21:01,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=101687.66666666667, ans=0.0 2024-06-20 01:21:11,499 INFO [train.py:1028] (0/2) Epoch 6, batch 4900, loss[loss=0.2832, simple_loss=0.3037, pruned_loss=0.1314, over 13162.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3067, pruned_loss=0.1373, over 2574123.09 frames. ], batch size: 59, lr: 9.70e-03, grad_scale: 4.0 2024-06-20 01:21:28,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=101761.0, ans=0.125 2024-06-20 01:21:34,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=15.0 2024-06-20 01:21:44,234 INFO [train.py:1028] (0/2) Epoch 6, batch 4950, loss[loss=0.3179, simple_loss=0.3166, pruned_loss=0.1595, over 10995.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.307, pruned_loss=0.1383, over 2568014.39 frames. ], batch size: 304, lr: 9.70e-03, grad_scale: 4.0 2024-06-20 01:21:55,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 8.228e+02 1.276e+03 1.546e+03 1.810e+03 3.035e+03, threshold=3.093e+03, percent-clipped=7.0 2024-06-20 01:22:08,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101871.0, ans=0.1 2024-06-20 01:22:19,906 INFO [train.py:1028] (0/2) Epoch 6, batch 5000, loss[loss=0.2853, simple_loss=0.3041, pruned_loss=0.1333, over 13214.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3062, pruned_loss=0.1375, over 2573454.43 frames. ], batch size: 95, lr: 9.69e-03, grad_scale: 4.0 2024-06-20 01:22:20,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=101907.66666666667, ans=0.125 2024-06-20 01:22:26,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101926.0, ans=0.1 2024-06-20 01:22:26,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101926.0, ans=0.125 2024-06-20 01:22:27,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101926.0, ans=0.1 2024-06-20 01:22:27,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=101926.0, ans=0.0 2024-06-20 01:22:29,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=101926.0, ans=0.0 2024-06-20 01:22:33,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.0 2024-06-20 01:22:36,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=101944.33333333333, ans=0.2 2024-06-20 01:22:38,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.20 vs. limit=22.5 2024-06-20 01:22:38,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=15.0 2024-06-20 01:22:51,781 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:22:56,219 INFO [train.py:1028] (0/2) Epoch 6, batch 5050, loss[loss=0.2762, simple_loss=0.3053, pruned_loss=0.1235, over 12937.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3067, pruned_loss=0.1374, over 2572560.58 frames. ], batch size: 36, lr: 9.69e-03, grad_scale: 2.0 2024-06-20 01:23:07,517 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.998e+02 1.271e+03 1.495e+03 1.762e+03 2.864e+03, threshold=2.991e+03, percent-clipped=0.0 2024-06-20 01:23:16,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.95 vs. limit=15.0 2024-06-20 01:23:28,477 INFO [train.py:1028] (0/2) Epoch 6, batch 5100, loss[loss=0.3069, simple_loss=0.3272, pruned_loss=0.1433, over 12904.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3072, pruned_loss=0.1384, over 2569900.66 frames. ], batch size: 39, lr: 9.68e-03, grad_scale: 2.0 2024-06-20 01:23:35,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102109.33333333333, ans=0.0 2024-06-20 01:23:35,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=102109.33333333333, ans=0.07 2024-06-20 01:23:38,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=102109.33333333333, ans=0.0 2024-06-20 01:23:50,736 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:24:01,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.04 vs. limit=10.0 2024-06-20 01:24:02,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102164.33333333333, ans=0.125 2024-06-20 01:24:03,702 INFO [train.py:1028] (0/2) Epoch 6, batch 5150, loss[loss=0.2685, simple_loss=0.2782, pruned_loss=0.1294, over 13104.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3069, pruned_loss=0.1388, over 2571180.85 frames. ], batch size: 132, lr: 9.68e-03, grad_scale: 2.0 2024-06-20 01:24:04,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2024-06-20 01:24:11,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=102201.0, ans=0.125 2024-06-20 01:24:16,273 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 6.962e+02 1.438e+03 1.698e+03 1.975e+03 4.116e+03, threshold=3.396e+03, percent-clipped=1.0 2024-06-20 01:24:17,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102219.33333333333, ans=0.125 2024-06-20 01:24:18,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=102219.33333333333, ans=0.5 2024-06-20 01:24:19,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=102219.33333333333, ans=0.2 2024-06-20 01:24:23,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=102237.66666666667, ans=0.025 2024-06-20 01:24:31,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=102256.0, ans=0.125 2024-06-20 01:24:34,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102256.0, ans=0.1 2024-06-20 01:24:35,907 INFO [train.py:1028] (0/2) Epoch 6, batch 5200, loss[loss=0.2949, simple_loss=0.311, pruned_loss=0.1394, over 13113.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3071, pruned_loss=0.1389, over 2574466.27 frames. ], batch size: 95, lr: 9.67e-03, grad_scale: 4.0 2024-06-20 01:24:40,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=102274.33333333333, ans=0.2 2024-06-20 01:24:53,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=102311.0, ans=0.125 2024-06-20 01:24:55,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=102311.0, ans=0.125 2024-06-20 01:25:07,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=102347.66666666667, ans=0.125 2024-06-20 01:25:12,032 INFO [train.py:1028] (0/2) Epoch 6, batch 5250, loss[loss=0.2891, simple_loss=0.3064, pruned_loss=0.1358, over 13260.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3076, pruned_loss=0.1391, over 2569014.65 frames. ], batch size: 52, lr: 9.67e-03, grad_scale: 2.0 2024-06-20 01:25:12,832 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:25:13,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=102366.0, ans=0.125 2024-06-20 01:25:25,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.309e+02 1.320e+03 1.559e+03 1.852e+03 2.798e+03, threshold=3.117e+03, percent-clipped=0.0 2024-06-20 01:25:26,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=102402.66666666667, ans=0.1 2024-06-20 01:25:36,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2024-06-20 01:25:39,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=102439.33333333333, ans=0.0 2024-06-20 01:25:39,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=102439.33333333333, ans=0.125 2024-06-20 01:25:45,477 INFO [train.py:1028] (0/2) Epoch 6, batch 5300, loss[loss=0.275, simple_loss=0.2879, pruned_loss=0.131, over 13028.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3072, pruned_loss=0.1388, over 2565117.29 frames. ], batch size: 144, lr: 9.67e-03, grad_scale: 4.0 2024-06-20 01:25:59,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=102476.0, ans=0.125 2024-06-20 01:26:03,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=102494.33333333333, ans=0.0 2024-06-20 01:26:08,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2024-06-20 01:26:20,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=102531.0, ans=0.0 2024-06-20 01:26:22,312 INFO [train.py:1028] (0/2) Epoch 6, batch 5350, loss[loss=0.2803, simple_loss=0.3058, pruned_loss=0.1274, over 11584.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3063, pruned_loss=0.1382, over 2573105.89 frames. ], batch size: 17, lr: 9.66e-03, grad_scale: 2.0 2024-06-20 01:26:23,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2024-06-20 01:26:24,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=15.0 2024-06-20 01:26:32,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 01:26:37,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2024-06-20 01:26:40,280 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 9.071e+02 1.349e+03 1.608e+03 1.884e+03 2.769e+03, threshold=3.216e+03, percent-clipped=0.0 2024-06-20 01:26:41,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=102586.0, ans=0.05 2024-06-20 01:26:45,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.17 vs. limit=22.5 2024-06-20 01:26:47,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=102604.33333333333, ans=0.125 2024-06-20 01:26:48,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=102604.33333333333, ans=0.125 2024-06-20 01:26:48,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.16 vs. limit=15.0 2024-06-20 01:26:58,299 INFO [train.py:1028] (0/2) Epoch 6, batch 5400, loss[loss=0.3356, simple_loss=0.3262, pruned_loss=0.1724, over 12248.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3071, pruned_loss=0.1391, over 2565788.35 frames. ], batch size: 240, lr: 9.66e-03, grad_scale: 2.0 2024-06-20 01:27:01,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=102641.0, ans=0.0 2024-06-20 01:27:06,930 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-56000.pt 2024-06-20 01:27:21,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=102677.66666666667, ans=0.05 2024-06-20 01:27:22,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.40 vs. limit=22.5 2024-06-20 01:27:25,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=102696.0, ans=0.2 2024-06-20 01:27:26,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=102696.0, ans=0.125 2024-06-20 01:27:26,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.18 vs. limit=22.5 2024-06-20 01:27:30,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=102714.33333333333, ans=0.125 2024-06-20 01:27:30,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=102714.33333333333, ans=0.0 2024-06-20 01:27:33,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=102714.33333333333, ans=0.0 2024-06-20 01:27:36,697 INFO [train.py:1028] (0/2) Epoch 6, batch 5450, loss[loss=0.3368, simple_loss=0.3439, pruned_loss=0.1649, over 12902.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3069, pruned_loss=0.1386, over 2570413.11 frames. ], batch size: 26, lr: 9.65e-03, grad_scale: 1.0 2024-06-20 01:27:36,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=102732.66666666667, ans=0.0 2024-06-20 01:27:36,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=102732.66666666667, ans=0.0 2024-06-20 01:27:46,081 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 01:27:54,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=102769.33333333333, ans=0.2 2024-06-20 01:27:54,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=102769.33333333333, ans=0.07 2024-06-20 01:27:55,965 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.928e+02 9.785e+02 1.164e+03 1.397e+03 5.624e+03, threshold=2.327e+03, percent-clipped=2.0 2024-06-20 01:27:59,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=102769.33333333333, ans=0.125 2024-06-20 01:28:10,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=102806.0, ans=0.125 2024-06-20 01:28:11,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=21.81 vs. limit=15.0 2024-06-20 01:28:13,653 INFO [train.py:1028] (0/2) Epoch 6, batch 5500, loss[loss=0.3319, simple_loss=0.3257, pruned_loss=0.1691, over 12282.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3071, pruned_loss=0.1385, over 2564230.97 frames. ], batch size: 240, lr: 9.65e-03, grad_scale: 2.0 2024-06-20 01:28:24,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=102842.66666666667, ans=0.125 2024-06-20 01:28:26,521 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2024-06-20 01:28:29,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=102861.0, ans=0.0 2024-06-20 01:28:36,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=102879.33333333333, ans=0.0 2024-06-20 01:28:46,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102897.66666666667, ans=0.125 2024-06-20 01:28:49,914 INFO [train.py:1028] (0/2) Epoch 6, batch 5550, loss[loss=0.2867, simple_loss=0.3149, pruned_loss=0.1293, over 13222.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.306, pruned_loss=0.1373, over 2568056.83 frames. ], batch size: 43, lr: 9.65e-03, grad_scale: 2.0 2024-06-20 01:28:55,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102934.33333333333, ans=0.1 2024-06-20 01:28:58,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=102934.33333333333, ans=0.0 2024-06-20 01:29:05,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.07 vs. limit=22.5 2024-06-20 01:29:05,519 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 7.860e+02 1.026e+03 1.226e+03 1.535e+03 6.286e+03, threshold=2.452e+03, percent-clipped=5.0 2024-06-20 01:29:09,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.27 vs. limit=15.0 2024-06-20 01:29:17,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.61 vs. limit=10.0 2024-06-20 01:29:18,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=102989.33333333333, ans=0.125 2024-06-20 01:29:19,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=102989.33333333333, ans=0.125 2024-06-20 01:29:22,245 INFO [train.py:1028] (0/2) Epoch 6, batch 5600, loss[loss=0.2864, simple_loss=0.3005, pruned_loss=0.1361, over 13237.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.306, pruned_loss=0.1374, over 2570866.10 frames. ], batch size: 89, lr: 9.64e-03, grad_scale: 2.0 2024-06-20 01:29:25,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=103007.66666666667, ans=10.0 2024-06-20 01:29:34,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=103026.0, ans=0.125 2024-06-20 01:29:40,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.14 vs. limit=15.0 2024-06-20 01:29:42,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103062.66666666667, ans=0.1 2024-06-20 01:29:43,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103062.66666666667, ans=0.1 2024-06-20 01:29:48,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=103062.66666666667, ans=0.125 2024-06-20 01:29:58,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.64 vs. limit=10.0 2024-06-20 01:29:58,994 INFO [train.py:1028] (0/2) Epoch 6, batch 5650, loss[loss=0.3252, simple_loss=0.3218, pruned_loss=0.1643, over 12530.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3061, pruned_loss=0.1369, over 2576135.72 frames. ], batch size: 202, lr: 9.64e-03, grad_scale: 2.0 2024-06-20 01:30:08,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2024-06-20 01:30:15,272 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 8.754e+02 9.782e+02 1.242e+03 3.604e+03, threshold=1.956e+03, percent-clipped=2.0 2024-06-20 01:30:19,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103154.33333333333, ans=0.1 2024-06-20 01:30:30,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=103172.66666666667, ans=0.0 2024-06-20 01:30:32,841 INFO [train.py:1028] (0/2) Epoch 6, batch 5700, loss[loss=0.2563, simple_loss=0.2844, pruned_loss=0.1141, over 13263.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3046, pruned_loss=0.136, over 2579353.34 frames. ], batch size: 63, lr: 9.63e-03, grad_scale: 4.0 2024-06-20 01:30:40,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=103191.0, ans=0.025 2024-06-20 01:30:41,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=103209.33333333333, ans=0.125 2024-06-20 01:30:51,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2024-06-20 01:30:53,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=103227.66666666667, ans=0.125 2024-06-20 01:30:57,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=103246.0, ans=0.1 2024-06-20 01:31:03,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=103264.33333333333, ans=0.0 2024-06-20 01:31:07,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=103264.33333333333, ans=0.0 2024-06-20 01:31:09,032 INFO [train.py:1028] (0/2) Epoch 6, batch 5750, loss[loss=0.3352, simple_loss=0.3349, pruned_loss=0.1678, over 12776.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3068, pruned_loss=0.1375, over 2579609.41 frames. ], batch size: 176, lr: 9.63e-03, grad_scale: 1.0 2024-06-20 01:31:10,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.38 vs. limit=15.0 2024-06-20 01:31:18,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=103301.0, ans=0.125 2024-06-20 01:31:26,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.927e+02 9.370e+02 1.129e+03 1.328e+03 3.062e+03, threshold=2.258e+03, percent-clipped=3.0 2024-06-20 01:31:32,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103337.66666666667, ans=0.1 2024-06-20 01:31:32,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=103337.66666666667, ans=0.125 2024-06-20 01:31:33,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2024-06-20 01:31:37,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=103356.0, ans=0.125 2024-06-20 01:31:40,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=103356.0, ans=0.125 2024-06-20 01:31:41,797 INFO [train.py:1028] (0/2) Epoch 6, batch 5800, loss[loss=0.315, simple_loss=0.3213, pruned_loss=0.1543, over 12815.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3082, pruned_loss=0.1389, over 2579101.46 frames. ], batch size: 176, lr: 9.62e-03, grad_scale: 2.0 2024-06-20 01:32:14,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=103447.66666666667, ans=0.125 2024-06-20 01:32:18,800 INFO [train.py:1028] (0/2) Epoch 6, batch 5850, loss[loss=0.339, simple_loss=0.3365, pruned_loss=0.1707, over 12537.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3109, pruned_loss=0.1408, over 2577753.36 frames. ], batch size: 202, lr: 9.62e-03, grad_scale: 2.0 2024-06-20 01:32:39,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 1.111e+03 1.278e+03 1.486e+03 2.956e+03, threshold=2.557e+03, percent-clipped=1.0 2024-06-20 01:32:40,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103502.66666666667, ans=0.1 2024-06-20 01:32:42,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103521.0, ans=0.1 2024-06-20 01:32:45,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=103521.0, ans=12.0 2024-06-20 01:32:49,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=103539.33333333333, ans=0.0 2024-06-20 01:32:55,382 INFO [train.py:1028] (0/2) Epoch 6, batch 5900, loss[loss=0.2962, simple_loss=0.306, pruned_loss=0.1432, over 13029.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3133, pruned_loss=0.1419, over 2578167.32 frames. ], batch size: 121, lr: 9.62e-03, grad_scale: 4.0 2024-06-20 01:33:06,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.87 vs. limit=12.0 2024-06-20 01:33:17,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.69 vs. limit=15.0 2024-06-20 01:33:18,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.79 vs. limit=15.0 2024-06-20 01:33:27,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-20 01:33:28,991 INFO [train.py:1028] (0/2) Epoch 6, batch 5950, loss[loss=0.3094, simple_loss=0.3133, pruned_loss=0.1528, over 13135.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3146, pruned_loss=0.1426, over 2581754.18 frames. ], batch size: 121, lr: 9.61e-03, grad_scale: 4.0 2024-06-20 01:33:40,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=103667.66666666667, ans=0.125 2024-06-20 01:33:43,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=103686.0, ans=0.0 2024-06-20 01:33:45,657 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.269e+02 8.909e+02 1.064e+03 1.243e+03 1.858e+03, threshold=2.127e+03, percent-clipped=0.0 2024-06-20 01:33:51,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=103704.33333333333, ans=0.025 2024-06-20 01:33:52,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=103704.33333333333, ans=0.125 2024-06-20 01:33:58,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.02 vs. limit=15.0 2024-06-20 01:34:04,618 INFO [train.py:1028] (0/2) Epoch 6, batch 6000, loss[loss=0.4103, simple_loss=0.3804, pruned_loss=0.2201, over 12166.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3172, pruned_loss=0.1445, over 2575735.20 frames. ], batch size: 240, lr: 9.61e-03, grad_scale: 8.0 2024-06-20 01:34:04,619 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 01:34:12,614 INFO [train.py:1060] (0/2) Epoch 6, validation: loss=0.2312, simple_loss=0.2879, pruned_loss=0.08728, over 351949.00 frames. 2024-06-20 01:34:12,615 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 01:34:20,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=103759.33333333333, ans=0.0 2024-06-20 01:34:21,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.09 vs. limit=15.0 2024-06-20 01:34:34,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.00 vs. limit=10.0 2024-06-20 01:34:51,372 INFO [train.py:1028] (0/2) Epoch 6, batch 6050, loss[loss=0.2921, simple_loss=0.3063, pruned_loss=0.1389, over 12880.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3183, pruned_loss=0.1447, over 2577820.07 frames. ], batch size: 39, lr: 9.60e-03, grad_scale: 8.0 2024-06-20 01:34:51,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=103832.66666666667, ans=0.09899494936611666 2024-06-20 01:35:03,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.76 vs. limit=10.0 2024-06-20 01:35:08,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=103869.33333333333, ans=0.0 2024-06-20 01:35:08,732 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.774e+02 7.397e+02 8.831e+02 1.017e+03 1.469e+03, threshold=1.766e+03, percent-clipped=0.0 2024-06-20 01:35:09,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=103869.33333333333, ans=0.125 2024-06-20 01:35:11,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.40 vs. limit=22.5 2024-06-20 01:35:17,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.84 vs. limit=15.0 2024-06-20 01:35:24,970 INFO [train.py:1028] (0/2) Epoch 6, batch 6100, loss[loss=0.3088, simple_loss=0.3157, pruned_loss=0.151, over 13136.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3192, pruned_loss=0.1448, over 2579586.30 frames. ], batch size: 121, lr: 9.60e-03, grad_scale: 8.0 2024-06-20 01:35:31,898 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.046e+01 2024-06-20 01:35:37,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=103942.66666666667, ans=0.5 2024-06-20 01:35:39,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=103961.0, ans=0.125 2024-06-20 01:35:54,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=103997.66666666667, ans=0.04949747468305833 2024-06-20 01:35:55,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=103997.66666666667, ans=0.025 2024-06-20 01:35:58,622 INFO [train.py:1028] (0/2) Epoch 6, batch 6150, loss[loss=0.3259, simple_loss=0.326, pruned_loss=0.1629, over 10870.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3212, pruned_loss=0.1459, over 2578253.01 frames. ], batch size: 303, lr: 9.59e-03, grad_scale: 8.0 2024-06-20 01:36:09,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=104034.33333333333, ans=0.125 2024-06-20 01:36:11,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.46 vs. limit=22.5 2024-06-20 01:36:12,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=104034.33333333333, ans=0.125 2024-06-20 01:36:14,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=104034.33333333333, ans=0.025 2024-06-20 01:36:18,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=104052.66666666667, ans=0.125 2024-06-20 01:36:18,809 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.844e+02 8.898e+02 1.040e+03 1.564e+03, threshold=1.780e+03, percent-clipped=0.0 2024-06-20 01:36:35,067 INFO [train.py:1028] (0/2) Epoch 6, batch 6200, loss[loss=0.3162, simple_loss=0.3291, pruned_loss=0.1516, over 13251.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3232, pruned_loss=0.147, over 2575797.93 frames. ], batch size: 89, lr: 9.59e-03, grad_scale: 8.0 2024-06-20 01:36:35,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104107.66666666667, ans=0.1 2024-06-20 01:36:37,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=104107.66666666667, ans=0.2 2024-06-20 01:36:44,446 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.113e+01 2024-06-20 01:36:45,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=104126.0, ans=0.0 2024-06-20 01:36:50,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=104126.0, ans=0.2 2024-06-20 01:36:56,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=104144.33333333333, ans=0.025 2024-06-20 01:37:01,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=104162.66666666667, ans=0.125 2024-06-20 01:37:12,644 INFO [train.py:1028] (0/2) Epoch 6, batch 6250, loss[loss=0.2878, simple_loss=0.31, pruned_loss=0.1328, over 13199.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3246, pruned_loss=0.1479, over 2567800.85 frames. ], batch size: 83, lr: 9.59e-03, grad_scale: 4.0 2024-06-20 01:37:23,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=104217.66666666667, ans=0.125 2024-06-20 01:37:29,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104236.0, ans=0.1 2024-06-20 01:37:30,888 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 6.215e+02 7.331e+02 8.348e+02 1.298e+03, threshold=1.466e+03, percent-clipped=0.0 2024-06-20 01:37:35,162 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.04 vs. limit=22.5 2024-06-20 01:37:41,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=104272.66666666667, ans=0.1 2024-06-20 01:37:43,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2024-06-20 01:37:45,012 INFO [train.py:1028] (0/2) Epoch 6, batch 6300, loss[loss=0.2794, simple_loss=0.3055, pruned_loss=0.1266, over 11334.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3263, pruned_loss=0.1484, over 2563038.55 frames. ], batch size: 16, lr: 9.58e-03, grad_scale: 8.0 2024-06-20 01:37:50,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104291.0, ans=0.1 2024-06-20 01:37:56,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=104309.33333333333, ans=0.125 2024-06-20 01:38:04,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.32 vs. limit=22.5 2024-06-20 01:38:07,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.44 vs. limit=15.0 2024-06-20 01:38:13,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=104346.0, ans=0.125 2024-06-20 01:38:15,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=104364.33333333333, ans=0.0 2024-06-20 01:38:16,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104364.33333333333, ans=0.1 2024-06-20 01:38:21,054 INFO [train.py:1028] (0/2) Epoch 6, batch 6350, loss[loss=0.3761, simple_loss=0.3676, pruned_loss=0.1923, over 12541.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.328, pruned_loss=0.1485, over 2573875.03 frames. ], batch size: 202, lr: 9.58e-03, grad_scale: 4.0 2024-06-20 01:38:25,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=104382.66666666667, ans=15.0 2024-06-20 01:38:27,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104401.0, ans=0.0 2024-06-20 01:38:35,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104419.33333333333, ans=0.1 2024-06-20 01:38:39,577 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 6.094e+02 7.184e+02 8.120e+02 1.749e+03, threshold=1.437e+03, percent-clipped=3.0 2024-06-20 01:38:49,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.57 vs. limit=22.5 2024-06-20 01:38:49,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=104456.0, ans=0.025 2024-06-20 01:38:56,887 INFO [train.py:1028] (0/2) Epoch 6, batch 6400, loss[loss=0.2928, simple_loss=0.3206, pruned_loss=0.1325, over 13219.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3303, pruned_loss=0.1494, over 2574959.09 frames. ], batch size: 67, lr: 9.57e-03, grad_scale: 8.0 2024-06-20 01:38:59,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=104474.33333333333, ans=0.025 2024-06-20 01:38:59,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2024-06-20 01:39:05,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=104492.66666666667, ans=0.125 2024-06-20 01:39:14,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=104511.0, ans=0.125 2024-06-20 01:39:19,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.04 vs. limit=10.0 2024-06-20 01:39:20,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=104529.33333333333, ans=0.2 2024-06-20 01:39:29,522 INFO [train.py:1028] (0/2) Epoch 6, batch 6450, loss[loss=0.3595, simple_loss=0.3599, pruned_loss=0.1795, over 12584.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3321, pruned_loss=0.1505, over 2580434.15 frames. ], batch size: 202, lr: 9.57e-03, grad_scale: 8.0 2024-06-20 01:39:38,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=104584.33333333333, ans=0.05 2024-06-20 01:39:48,867 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 7.295e+02 8.514e+02 9.694e+02 1.702e+03, threshold=1.703e+03, percent-clipped=2.0 2024-06-20 01:39:49,018 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=1.253e+00 2024-06-20 01:40:01,390 INFO [train.py:1028] (0/2) Epoch 6, batch 6500, loss[loss=0.3425, simple_loss=0.3416, pruned_loss=0.1717, over 10785.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3333, pruned_loss=0.1509, over 2584561.32 frames. ], batch size: 303, lr: 9.57e-03, grad_scale: 4.0 2024-06-20 01:40:04,868 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.738e+01 2024-06-20 01:40:18,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=104694.33333333333, ans=0.04949747468305833 2024-06-20 01:40:23,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=104694.33333333333, ans=0.0 2024-06-20 01:40:24,075 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=12.0 2024-06-20 01:40:30,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=15.0 2024-06-20 01:40:33,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104731.0, ans=0.1 2024-06-20 01:40:35,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.46 vs. limit=15.0 2024-06-20 01:40:36,995 INFO [train.py:1028] (0/2) Epoch 6, batch 6550, loss[loss=0.2904, simple_loss=0.3266, pruned_loss=0.1271, over 12606.00 frames. ], tot_loss[loss=0.319, simple_loss=0.335, pruned_loss=0.1515, over 2588936.44 frames. ], batch size: 22, lr: 9.56e-03, grad_scale: 2.0 2024-06-20 01:40:38,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.70 vs. limit=6.0 2024-06-20 01:40:39,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=15.0 2024-06-20 01:40:55,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2024-06-20 01:41:01,508 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.581e+02 7.885e+02 9.108e+02 1.072e+03 2.590e+03, threshold=1.822e+03, percent-clipped=4.0 2024-06-20 01:41:03,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=104804.33333333333, ans=0.0 2024-06-20 01:41:04,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=104804.33333333333, ans=0.025 2024-06-20 01:41:06,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=104822.66666666667, ans=0.09899494936611666 2024-06-20 01:41:07,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=104822.66666666667, ans=0.125 2024-06-20 01:41:11,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104822.66666666667, ans=0.0 2024-06-20 01:41:12,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-20 01:41:13,245 INFO [train.py:1028] (0/2) Epoch 6, batch 6600, loss[loss=0.3137, simple_loss=0.331, pruned_loss=0.1482, over 13273.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3354, pruned_loss=0.1517, over 2591403.39 frames. ], batch size: 72, lr: 9.56e-03, grad_scale: 4.0 2024-06-20 01:41:23,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=104859.33333333333, ans=0.07 2024-06-20 01:41:38,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=104896.0, ans=0.0 2024-06-20 01:41:39,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2024-06-20 01:41:44,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2024-06-20 01:41:46,043 INFO [train.py:1028] (0/2) Epoch 6, batch 6650, loss[loss=0.3314, simple_loss=0.3368, pruned_loss=0.163, over 12998.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3377, pruned_loss=0.153, over 2585084.98 frames. ], batch size: 158, lr: 9.55e-03, grad_scale: 4.0 2024-06-20 01:41:57,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=12.0 2024-06-20 01:42:00,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104969.33333333333, ans=0.0 2024-06-20 01:42:06,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=104987.66666666667, ans=0.0 2024-06-20 01:42:06,882 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.308e+02 8.435e+02 9.713e+02 1.453e+03, threshold=1.687e+03, percent-clipped=0.0 2024-06-20 01:42:20,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=105006.0, ans=0.125 2024-06-20 01:42:20,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=105006.0, ans=0.125 2024-06-20 01:42:22,588 INFO [train.py:1028] (0/2) Epoch 6, batch 6700, loss[loss=0.3714, simple_loss=0.3734, pruned_loss=0.1847, over 12789.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3388, pruned_loss=0.1539, over 2583943.40 frames. ], batch size: 176, lr: 9.55e-03, grad_scale: 8.0 2024-06-20 01:42:27,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105024.33333333333, ans=0.125 2024-06-20 01:42:27,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=105024.33333333333, ans=0.125 2024-06-20 01:42:30,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.30 vs. limit=15.0 2024-06-20 01:42:37,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=105061.0, ans=0.125 2024-06-20 01:42:39,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=105061.0, ans=0.0 2024-06-20 01:42:42,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=105079.33333333333, ans=15.0 2024-06-20 01:42:55,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105097.66666666667, ans=0.0 2024-06-20 01:43:00,675 INFO [train.py:1028] (0/2) Epoch 6, batch 6750, loss[loss=0.4333, simple_loss=0.4143, pruned_loss=0.2262, over 12177.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3391, pruned_loss=0.1544, over 2577005.53 frames. ], batch size: 241, lr: 9.55e-03, grad_scale: 4.0 2024-06-20 01:43:07,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=105134.33333333333, ans=0.0 2024-06-20 01:43:10,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.34 vs. limit=22.5 2024-06-20 01:43:10,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=15.0 2024-06-20 01:43:19,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=12.0 2024-06-20 01:43:21,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105171.0, ans=0.0 2024-06-20 01:43:22,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=105171.0, ans=0.025 2024-06-20 01:43:22,896 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.541e+02 6.858e+02 8.793e+02 1.036e+03 4.493e+03, threshold=1.759e+03, percent-clipped=1.0 2024-06-20 01:43:28,794 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.85 vs. limit=15.0 2024-06-20 01:43:29,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=105189.33333333333, ans=0.0 2024-06-20 01:43:33,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=105207.66666666667, ans=10.0 2024-06-20 01:43:33,454 INFO [train.py:1028] (0/2) Epoch 6, batch 6800, loss[loss=0.3078, simple_loss=0.3332, pruned_loss=0.1412, over 13247.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3413, pruned_loss=0.1553, over 2580198.04 frames. ], batch size: 67, lr: 9.54e-03, grad_scale: 4.0 2024-06-20 01:43:42,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=105226.0, ans=0.2 2024-06-20 01:43:57,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.17 vs. limit=15.0 2024-06-20 01:44:05,923 INFO [train.py:1028] (0/2) Epoch 6, batch 6850, loss[loss=0.3267, simple_loss=0.3532, pruned_loss=0.1501, over 13248.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3422, pruned_loss=0.155, over 2583544.71 frames. ], batch size: 63, lr: 9.54e-03, grad_scale: 4.0 2024-06-20 01:44:07,631 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.050e-01 2024-06-20 01:44:24,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.02 vs. limit=10.0 2024-06-20 01:44:31,557 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.802e+02 7.452e+02 8.768e+02 1.027e+03 3.457e+03, threshold=1.754e+03, percent-clipped=4.0 2024-06-20 01:44:35,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=105372.66666666667, ans=0.125 2024-06-20 01:44:41,751 INFO [train.py:1028] (0/2) Epoch 6, batch 6900, loss[loss=0.3162, simple_loss=0.3349, pruned_loss=0.1488, over 13317.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3439, pruned_loss=0.1559, over 2585802.72 frames. ], batch size: 49, lr: 9.53e-03, grad_scale: 8.0 2024-06-20 01:44:57,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.62 vs. limit=22.5 2024-06-20 01:44:59,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=105427.66666666667, ans=0.05 2024-06-20 01:45:08,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=105446.0, ans=0.2 2024-06-20 01:45:10,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2024-06-20 01:45:17,195 INFO [train.py:1028] (0/2) Epoch 6, batch 6950, loss[loss=0.2895, simple_loss=0.3092, pruned_loss=0.1349, over 11366.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3441, pruned_loss=0.1558, over 2580246.76 frames. ], batch size: 16, lr: 9.53e-03, grad_scale: 4.0 2024-06-20 01:45:21,297 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.654e+01 2024-06-20 01:45:32,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=105519.33333333333, ans=0.0 2024-06-20 01:45:32,458 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.30 vs. limit=15.0 2024-06-20 01:45:39,874 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.025e+03 1.227e+03 1.439e+03 2.544e+03, threshold=2.454e+03, percent-clipped=9.0 2024-06-20 01:45:49,417 INFO [train.py:1028] (0/2) Epoch 6, batch 7000, loss[loss=0.3456, simple_loss=0.3549, pruned_loss=0.1681, over 12954.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.344, pruned_loss=0.1556, over 2576645.05 frames. ], batch size: 158, lr: 9.52e-03, grad_scale: 8.0 2024-06-20 01:45:53,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=105574.33333333333, ans=0.2 2024-06-20 01:45:54,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=105574.33333333333, ans=0.09899494936611666 2024-06-20 01:46:01,201 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.22 vs. limit=22.5 2024-06-20 01:46:04,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=24.12 vs. limit=22.5 2024-06-20 01:46:09,407 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.139e+01 2024-06-20 01:46:10,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=105629.33333333333, ans=0.0 2024-06-20 01:46:10,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.15 vs. limit=12.0 2024-06-20 01:46:11,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105629.33333333333, ans=0.1 2024-06-20 01:46:22,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=105647.66666666667, ans=0.2 2024-06-20 01:46:26,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=105666.0, ans=0.125 2024-06-20 01:46:26,602 INFO [train.py:1028] (0/2) Epoch 6, batch 7050, loss[loss=0.3905, simple_loss=0.3876, pruned_loss=0.1968, over 12838.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3457, pruned_loss=0.1566, over 2582909.64 frames. ], batch size: 177, lr: 9.52e-03, grad_scale: 1.0 2024-06-20 01:46:30,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=105666.0, ans=0.0 2024-06-20 01:46:33,900 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=5.812e+01 2024-06-20 01:46:39,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=105702.66666666667, ans=0.2 2024-06-20 01:46:41,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=105702.66666666667, ans=0.125 2024-06-20 01:46:46,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=105721.0, ans=0.125 2024-06-20 01:46:54,528 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.226e+03 1.456e+03 1.740e+03 3.307e+03, threshold=2.912e+03, percent-clipped=4.0 2024-06-20 01:47:02,474 INFO [train.py:1028] (0/2) Epoch 6, batch 7100, loss[loss=0.3582, simple_loss=0.3746, pruned_loss=0.1709, over 13135.00 frames. ], tot_loss[loss=0.3304, simple_loss=0.3462, pruned_loss=0.1573, over 2574618.74 frames. ], batch size: 112, lr: 9.52e-03, grad_scale: 2.0 2024-06-20 01:47:07,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105757.66666666667, ans=0.1 2024-06-20 01:47:12,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=105776.0, ans=0.125 2024-06-20 01:47:17,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.68 vs. limit=10.0 2024-06-20 01:47:18,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=105794.33333333333, ans=0.0 2024-06-20 01:47:18,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=105794.33333333333, ans=0.125 2024-06-20 01:47:27,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=105812.66666666667, ans=0.125 2024-06-20 01:47:28,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.93 vs. limit=22.5 2024-06-20 01:47:29,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=105831.0, ans=0.0 2024-06-20 01:47:35,686 INFO [train.py:1028] (0/2) Epoch 6, batch 7150, loss[loss=0.4059, simple_loss=0.4013, pruned_loss=0.2053, over 12543.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3471, pruned_loss=0.1573, over 2572534.36 frames. ], batch size: 202, lr: 9.51e-03, grad_scale: 2.0 2024-06-20 01:47:39,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=105849.33333333333, ans=0.0 2024-06-20 01:47:40,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105849.33333333333, ans=0.1 2024-06-20 01:47:41,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.58 vs. limit=22.5 2024-06-20 01:47:58,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105904.33333333333, ans=0.1 2024-06-20 01:48:00,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.638e+02 9.300e+02 1.129e+03 1.295e+03 2.373e+03, threshold=2.257e+03, percent-clipped=0.0 2024-06-20 01:48:07,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=15.0 2024-06-20 01:48:08,478 INFO [train.py:1028] (0/2) Epoch 6, batch 7200, loss[loss=0.3591, simple_loss=0.3689, pruned_loss=0.1746, over 13180.00 frames. ], tot_loss[loss=0.3318, simple_loss=0.3483, pruned_loss=0.1577, over 2578024.19 frames. ], batch size: 112, lr: 9.51e-03, grad_scale: 4.0 2024-06-20 01:48:08,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=105941.0, ans=0.2 2024-06-20 01:48:11,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=105941.0, ans=0.025 2024-06-20 01:48:11,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=105941.0, ans=0.025 2024-06-20 01:48:17,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=105959.33333333333, ans=0.125 2024-06-20 01:48:22,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=105977.66666666667, ans=0.0 2024-06-20 01:48:34,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=105996.0, ans=0.025 2024-06-20 01:48:36,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2024-06-20 01:48:42,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=106014.33333333333, ans=0.0 2024-06-20 01:48:43,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=15.0 2024-06-20 01:48:44,592 INFO [train.py:1028] (0/2) Epoch 6, batch 7250, loss[loss=0.3306, simple_loss=0.3502, pruned_loss=0.1555, over 13028.00 frames. ], tot_loss[loss=0.333, simple_loss=0.3494, pruned_loss=0.1582, over 2579929.75 frames. ], batch size: 36, lr: 9.50e-03, grad_scale: 4.0 2024-06-20 01:48:48,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2024-06-20 01:48:52,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=106051.0, ans=0.125 2024-06-20 01:49:10,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=106087.66666666667, ans=0.0 2024-06-20 01:49:11,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=106087.66666666667, ans=0.0 2024-06-20 01:49:13,543 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.240e+02 9.829e+02 1.196e+03 1.932e+03, threshold=1.966e+03, percent-clipped=0.0 2024-06-20 01:49:13,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=106087.66666666667, ans=0.0 2024-06-20 01:49:19,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=106106.0, ans=22.5 2024-06-20 01:49:21,334 INFO [train.py:1028] (0/2) Epoch 6, batch 7300, loss[loss=0.336, simple_loss=0.3643, pruned_loss=0.1539, over 12905.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3505, pruned_loss=0.1588, over 2579552.32 frames. ], batch size: 36, lr: 9.50e-03, grad_scale: 4.0 2024-06-20 01:49:22,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=106124.33333333333, ans=0.04949747468305833 2024-06-20 01:49:22,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106124.33333333333, ans=0.1 2024-06-20 01:49:35,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=106161.0, ans=0.125 2024-06-20 01:49:47,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106197.66666666667, ans=0.0 2024-06-20 01:49:53,703 INFO [train.py:1028] (0/2) Epoch 6, batch 7350, loss[loss=0.3611, simple_loss=0.3745, pruned_loss=0.1738, over 13337.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3509, pruned_loss=0.1589, over 2581378.02 frames. ], batch size: 46, lr: 9.50e-03, grad_scale: 2.0 2024-06-20 01:49:59,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2024-06-20 01:49:59,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.63 vs. limit=22.5 2024-06-20 01:50:06,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=106252.66666666667, ans=0.07 2024-06-20 01:50:16,653 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.39 vs. limit=15.0 2024-06-20 01:50:19,648 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.095e+02 9.523e+02 1.084e+03 2.294e+03, threshold=1.905e+03, percent-clipped=1.0 2024-06-20 01:50:26,114 INFO [train.py:1028] (0/2) Epoch 6, batch 7400, loss[loss=0.3251, simple_loss=0.3517, pruned_loss=0.1492, over 13200.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.3496, pruned_loss=0.1577, over 2586579.06 frames. ], batch size: 63, lr: 9.49e-03, grad_scale: 4.0 2024-06-20 01:50:29,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=106307.66666666667, ans=0.2 2024-06-20 01:50:30,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=106307.66666666667, ans=0.125 2024-06-20 01:50:31,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=106307.66666666667, ans=0.2 2024-06-20 01:50:47,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=106344.33333333333, ans=0.125 2024-06-20 01:50:53,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.74 vs. limit=15.0 2024-06-20 01:50:55,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=106381.0, ans=0.2 2024-06-20 01:51:06,278 INFO [train.py:1028] (0/2) Epoch 6, batch 7450, loss[loss=0.2606, simple_loss=0.2935, pruned_loss=0.1139, over 12730.00 frames. ], tot_loss[loss=0.3314, simple_loss=0.349, pruned_loss=0.1569, over 2579643.01 frames. ], batch size: 29, lr: 9.49e-03, grad_scale: 2.0 2024-06-20 01:51:07,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=106399.33333333333, ans=0.0 2024-06-20 01:51:21,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=106436.0, ans=10.0 2024-06-20 01:51:22,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106436.0, ans=0.1 2024-06-20 01:51:23,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=106436.0, ans=0.09899494936611666 2024-06-20 01:51:34,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.406e+02 8.515e+02 9.635e+02 1.171e+03 3.444e+03, threshold=1.927e+03, percent-clipped=3.0 2024-06-20 01:51:37,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-20 01:51:40,256 INFO [train.py:1028] (0/2) Epoch 6, batch 7500, loss[loss=0.376, simple_loss=0.3685, pruned_loss=0.1917, over 10613.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.351, pruned_loss=0.1583, over 2577241.04 frames. ], batch size: 304, lr: 9.48e-03, grad_scale: 4.0 2024-06-20 01:51:41,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=106491.0, ans=0.0 2024-06-20 01:51:52,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.38 vs. limit=22.5 2024-06-20 01:52:10,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=106564.33333333333, ans=0.125 2024-06-20 01:52:10,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=106564.33333333333, ans=0.0 2024-06-20 01:52:12,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=106564.33333333333, ans=0.125 2024-06-20 01:52:13,417 INFO [train.py:1028] (0/2) Epoch 6, batch 7550, loss[loss=0.3431, simple_loss=0.3524, pruned_loss=0.1669, over 12945.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3526, pruned_loss=0.1598, over 2576692.46 frames. ], batch size: 158, lr: 9.48e-03, grad_scale: 4.0 2024-06-20 01:52:32,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.73 vs. limit=15.0 2024-06-20 01:52:33,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=106619.33333333333, ans=0.2 2024-06-20 01:52:40,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=106637.66666666667, ans=0.125 2024-06-20 01:52:43,569 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 7.838e+02 8.819e+02 9.949e+02 1.629e+03, threshold=1.764e+03, percent-clipped=0.0 2024-06-20 01:52:45,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.04 vs. limit=22.5 2024-06-20 01:52:47,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.48 vs. limit=15.0 2024-06-20 01:52:49,160 INFO [train.py:1028] (0/2) Epoch 6, batch 7600, loss[loss=0.3443, simple_loss=0.3517, pruned_loss=0.1684, over 13234.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3536, pruned_loss=0.1604, over 2577250.99 frames. ], batch size: 83, lr: 9.48e-03, grad_scale: 8.0 2024-06-20 01:53:05,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=106711.0, ans=0.0 2024-06-20 01:53:19,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=106747.66666666667, ans=0.0 2024-06-20 01:53:25,857 INFO [train.py:1028] (0/2) Epoch 6, batch 7650, loss[loss=0.3084, simple_loss=0.3325, pruned_loss=0.1421, over 12997.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3541, pruned_loss=0.1604, over 2573001.41 frames. ], batch size: 33, lr: 9.47e-03, grad_scale: 4.0 2024-06-20 01:53:27,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106766.0, ans=0.125 2024-06-20 01:53:41,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=106802.66666666667, ans=0.0 2024-06-20 01:53:42,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=106802.66666666667, ans=0.0 2024-06-20 01:53:43,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=106802.66666666667, ans=0.07 2024-06-20 01:53:50,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=106821.0, ans=0.2 2024-06-20 01:53:53,492 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.245e+02 8.799e+02 1.036e+03 1.566e+03, threshold=1.760e+03, percent-clipped=0.0 2024-06-20 01:53:54,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=106839.33333333333, ans=0.0 2024-06-20 01:53:58,980 INFO [train.py:1028] (0/2) Epoch 6, batch 7700, loss[loss=0.3166, simple_loss=0.3423, pruned_loss=0.1455, over 13298.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.355, pruned_loss=0.1612, over 2569864.78 frames. ], batch size: 63, lr: 9.47e-03, grad_scale: 8.0 2024-06-20 01:54:02,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-20 01:54:04,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=106857.66666666667, ans=0.125 2024-06-20 01:54:08,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=12.0 2024-06-20 01:54:09,419 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.359e+01 2024-06-20 01:54:19,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106912.66666666667, ans=0.125 2024-06-20 01:54:32,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.55 vs. limit=10.0 2024-06-20 01:54:35,134 INFO [train.py:1028] (0/2) Epoch 6, batch 7750, loss[loss=0.3173, simple_loss=0.3326, pruned_loss=0.1509, over 13086.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3556, pruned_loss=0.1618, over 2574414.41 frames. ], batch size: 71, lr: 9.46e-03, grad_scale: 8.0 2024-06-20 01:54:41,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=106967.66666666667, ans=0.2 2024-06-20 01:54:55,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2024-06-20 01:55:01,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=107004.33333333333, ans=0.0 2024-06-20 01:55:06,119 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 6.608e+02 7.848e+02 9.254e+02 1.503e+03, threshold=1.570e+03, percent-clipped=0.0 2024-06-20 01:55:11,753 INFO [train.py:1028] (0/2) Epoch 6, batch 7800, loss[loss=0.3161, simple_loss=0.3431, pruned_loss=0.1445, over 13154.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3551, pruned_loss=0.1606, over 2578970.82 frames. ], batch size: 95, lr: 9.46e-03, grad_scale: 8.0 2024-06-20 01:55:11,844 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.849e+01 2024-06-20 01:55:13,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=107041.0, ans=0.125 2024-06-20 01:55:13,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.92 vs. limit=10.0 2024-06-20 01:55:17,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=107041.0, ans=0.125 2024-06-20 01:55:23,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107059.33333333333, ans=0.1 2024-06-20 01:55:24,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=12.0 2024-06-20 01:55:29,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.96 vs. limit=6.0 2024-06-20 01:55:44,998 INFO [train.py:1028] (0/2) Epoch 6, batch 7850, loss[loss=0.2742, simple_loss=0.3106, pruned_loss=0.1189, over 11607.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.3571, pruned_loss=0.1617, over 2572357.41 frames. ], batch size: 17, lr: 9.46e-03, grad_scale: 4.0 2024-06-20 01:55:46,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=107132.66666666667, ans=0.2 2024-06-20 01:55:46,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=107132.66666666667, ans=0.5 2024-06-20 01:55:47,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=107132.66666666667, ans=0.2 2024-06-20 01:55:47,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=107132.66666666667, ans=0.125 2024-06-20 01:55:53,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=107151.0, ans=0.025 2024-06-20 01:56:00,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.74 vs. limit=15.0 2024-06-20 01:56:07,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=107187.66666666667, ans=0.125 2024-06-20 01:56:08,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=107187.66666666667, ans=0.025 2024-06-20 01:56:12,837 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.744e+02 5.089e+02 6.002e+02 7.106e+02 1.254e+03, threshold=1.200e+03, percent-clipped=0.0 2024-06-20 01:56:20,817 INFO [train.py:1028] (0/2) Epoch 6, batch 7900, loss[loss=0.3534, simple_loss=0.3697, pruned_loss=0.1685, over 13170.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3565, pruned_loss=0.1613, over 2571921.15 frames. ], batch size: 77, lr: 9.45e-03, grad_scale: 8.0 2024-06-20 01:56:20,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=107224.33333333333, ans=0.2 2024-06-20 01:56:30,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=107242.66666666667, ans=0.025 2024-06-20 01:56:33,538 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.533e-03 2024-06-20 01:56:34,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=107261.0, ans=0.125 2024-06-20 01:56:56,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=107316.0, ans=0.0 2024-06-20 01:56:56,785 INFO [train.py:1028] (0/2) Epoch 6, batch 7950, loss[loss=0.3534, simple_loss=0.3557, pruned_loss=0.1756, over 10747.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3566, pruned_loss=0.1611, over 2575890.04 frames. ], batch size: 304, lr: 9.45e-03, grad_scale: 8.0 2024-06-20 01:56:57,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=107316.0, ans=0.1 2024-06-20 01:57:12,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107352.66666666667, ans=0.1 2024-06-20 01:57:12,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=107352.66666666667, ans=0.025 2024-06-20 01:57:24,887 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.199e+02 5.932e+02 7.051e+02 1.141e+03, threshold=1.186e+03, percent-clipped=0.0 2024-06-20 01:57:29,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=107407.66666666667, ans=0.125 2024-06-20 01:57:29,785 INFO [train.py:1028] (0/2) Epoch 6, batch 8000, loss[loss=0.3339, simple_loss=0.3523, pruned_loss=0.1577, over 12735.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3567, pruned_loss=0.1607, over 2573681.97 frames. ], batch size: 29, lr: 9.44e-03, grad_scale: 16.0 2024-06-20 01:57:33,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=107407.66666666667, ans=0.0 2024-06-20 01:57:34,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=107407.66666666667, ans=0.0 2024-06-20 01:57:38,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=107426.0, ans=0.125 2024-06-20 01:57:40,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=107426.0, ans=0.125 2024-06-20 01:57:40,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=107426.0, ans=0.125 2024-06-20 01:57:41,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=107426.0, ans=0.2 2024-06-20 01:57:44,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=107444.33333333333, ans=0.125 2024-06-20 01:57:45,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.23 vs. limit=22.5 2024-06-20 01:57:53,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.11 vs. limit=10.0 2024-06-20 01:58:03,384 INFO [train.py:1028] (0/2) Epoch 6, batch 8050, loss[loss=0.314, simple_loss=0.339, pruned_loss=0.1445, over 13196.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3567, pruned_loss=0.1604, over 2572662.00 frames. ], batch size: 83, lr: 9.44e-03, grad_scale: 8.0 2024-06-20 01:58:07,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=12.0 2024-06-20 01:58:19,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.97 vs. limit=10.0 2024-06-20 01:58:20,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.80 vs. limit=22.5 2024-06-20 01:58:23,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.19 vs. limit=12.0 2024-06-20 01:58:26,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=107554.33333333333, ans=0.0 2024-06-20 01:58:35,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.502e+02 7.648e+02 8.723e+02 1.499e+03, threshold=1.530e+03, percent-clipped=4.0 2024-06-20 01:58:35,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=107572.66666666667, ans=0.125 2024-06-20 01:58:38,869 INFO [train.py:1028] (0/2) Epoch 6, batch 8100, loss[loss=0.3328, simple_loss=0.3531, pruned_loss=0.1563, over 13208.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3569, pruned_loss=0.1605, over 2576084.34 frames. ], batch size: 112, lr: 9.44e-03, grad_scale: 8.0 2024-06-20 01:58:47,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=107591.0, ans=0.125 2024-06-20 01:58:51,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=107609.33333333333, ans=0.025 2024-06-20 01:58:53,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=107609.33333333333, ans=0.05 2024-06-20 01:58:53,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=107609.33333333333, ans=0.0 2024-06-20 01:59:03,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.03 vs. limit=15.0 2024-06-20 01:59:04,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.29 vs. limit=15.0 2024-06-20 01:59:06,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=107646.0, ans=0.0 2024-06-20 01:59:15,376 INFO [train.py:1028] (0/2) Epoch 6, batch 8150, loss[loss=0.3365, simple_loss=0.3505, pruned_loss=0.1613, over 13145.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3567, pruned_loss=0.1601, over 2579883.02 frames. ], batch size: 121, lr: 9.43e-03, grad_scale: 2.0 2024-06-20 01:59:30,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=107719.33333333333, ans=0.0 2024-06-20 01:59:45,687 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 5.928e+02 6.980e+02 8.576e+02 2.219e+03, threshold=1.396e+03, percent-clipped=5.0 2024-06-20 01:59:47,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=107774.33333333333, ans=0.07 2024-06-20 01:59:48,422 INFO [train.py:1028] (0/2) Epoch 6, batch 8200, loss[loss=0.3285, simple_loss=0.3562, pruned_loss=0.1504, over 13135.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3575, pruned_loss=0.1606, over 2583909.49 frames. ], batch size: 112, lr: 9.43e-03, grad_scale: 4.0 2024-06-20 01:59:53,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=12.0 2024-06-20 02:00:11,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=107829.33333333333, ans=0.125 2024-06-20 02:00:25,904 INFO [train.py:1028] (0/2) Epoch 6, batch 8250, loss[loss=0.3563, simple_loss=0.386, pruned_loss=0.1633, over 13256.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.3588, pruned_loss=0.1611, over 2583802.38 frames. ], batch size: 52, lr: 9.42e-03, grad_scale: 2.0 2024-06-20 02:00:26,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.38 vs. limit=15.0 2024-06-20 02:00:33,978 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.608e+00 2024-06-20 02:00:36,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2024-06-20 02:00:46,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=107921.0, ans=0.0 2024-06-20 02:00:59,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.74 vs. limit=22.5 2024-06-20 02:01:00,041 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 5.541e+02 6.226e+02 7.589e+02 1.070e+03, threshold=1.245e+03, percent-clipped=0.0 2024-06-20 02:01:02,091 INFO [train.py:1028] (0/2) Epoch 6, batch 8300, loss[loss=0.3254, simple_loss=0.3444, pruned_loss=0.1532, over 13096.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.3578, pruned_loss=0.1604, over 2579971.24 frames. ], batch size: 103, lr: 9.42e-03, grad_scale: 4.0 2024-06-20 02:01:29,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=108031.0, ans=0.0 2024-06-20 02:01:30,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=108031.0, ans=0.125 2024-06-20 02:01:35,718 INFO [train.py:1028] (0/2) Epoch 6, batch 8350, loss[loss=0.33, simple_loss=0.3482, pruned_loss=0.156, over 13219.00 frames. ], tot_loss[loss=0.3383, simple_loss=0.3574, pruned_loss=0.1596, over 2580955.65 frames. ], batch size: 112, lr: 9.42e-03, grad_scale: 4.0 2024-06-20 02:01:36,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-20 02:01:40,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108049.33333333333, ans=0.125 2024-06-20 02:01:42,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=108067.66666666667, ans=0.125 2024-06-20 02:01:42,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108067.66666666667, ans=0.125 2024-06-20 02:01:58,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=108104.33333333333, ans=0.0 2024-06-20 02:01:58,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=108104.33333333333, ans=0.95 2024-06-20 02:02:00,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108104.33333333333, ans=0.125 2024-06-20 02:02:04,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108122.66666666667, ans=0.125 2024-06-20 02:02:07,087 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.932e+02 5.610e+02 6.475e+02 9.613e+02, threshold=1.122e+03, percent-clipped=0.0 2024-06-20 02:02:09,201 INFO [train.py:1028] (0/2) Epoch 6, batch 8400, loss[loss=0.3249, simple_loss=0.3449, pruned_loss=0.1525, over 12983.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3578, pruned_loss=0.1598, over 2578101.57 frames. ], batch size: 39, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:02:13,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=108141.0, ans=0.2 2024-06-20 02:02:14,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2024-06-20 02:02:25,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108177.66666666667, ans=0.125 2024-06-20 02:02:33,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=108196.0, ans=0.1 2024-06-20 02:02:44,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=108214.33333333333, ans=0.125 2024-06-20 02:02:45,739 INFO [train.py:1028] (0/2) Epoch 6, batch 8450, loss[loss=0.3206, simple_loss=0.343, pruned_loss=0.1491, over 13166.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3579, pruned_loss=0.1596, over 2580228.11 frames. ], batch size: 112, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:02:55,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=108251.0, ans=0.1 2024-06-20 02:03:13,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108287.66666666667, ans=0.1 2024-06-20 02:03:20,868 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 5.939e+02 6.806e+02 7.908e+02 1.112e+03, threshold=1.361e+03, percent-clipped=0.0 2024-06-20 02:03:21,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=108306.0, ans=0.025 2024-06-20 02:03:22,853 INFO [train.py:1028] (0/2) Epoch 6, batch 8500, loss[loss=0.3118, simple_loss=0.3413, pruned_loss=0.1411, over 12623.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3582, pruned_loss=0.1596, over 2579627.54 frames. ], batch size: 29, lr: 9.41e-03, grad_scale: 8.0 2024-06-20 02:03:25,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=108324.33333333333, ans=0.125 2024-06-20 02:03:31,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=108342.66666666667, ans=0.2 2024-06-20 02:03:32,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=108342.66666666667, ans=0.5 2024-06-20 02:03:46,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=15.0 2024-06-20 02:03:50,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=108397.66666666667, ans=0.125 2024-06-20 02:03:56,162 INFO [train.py:1028] (0/2) Epoch 6, batch 8550, loss[loss=0.3362, simple_loss=0.3619, pruned_loss=0.1552, over 12591.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3571, pruned_loss=0.1586, over 2578378.40 frames. ], batch size: 22, lr: 9.40e-03, grad_scale: 8.0 2024-06-20 02:04:08,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=108452.66666666667, ans=0.0 2024-06-20 02:04:11,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=108452.66666666667, ans=0.0 2024-06-20 02:04:14,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108452.66666666667, ans=0.125 2024-06-20 02:04:26,996 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.304e+02 4.970e+02 5.873e+02 6.995e+02 1.256e+03, threshold=1.175e+03, percent-clipped=0.0 2024-06-20 02:04:30,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=12.0 2024-06-20 02:04:32,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108507.66666666667, ans=0.125 2024-06-20 02:04:32,530 INFO [train.py:1028] (0/2) Epoch 6, batch 8600, loss[loss=0.3371, simple_loss=0.3527, pruned_loss=0.1608, over 13163.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3572, pruned_loss=0.1586, over 2575893.92 frames. ], batch size: 112, lr: 9.40e-03, grad_scale: 8.0 2024-06-20 02:04:33,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108507.66666666667, ans=0.125 2024-06-20 02:04:47,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=108544.33333333333, ans=0.025 2024-06-20 02:04:48,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.22 vs. limit=15.0 2024-06-20 02:04:55,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=108562.66666666667, ans=0.125 2024-06-20 02:04:58,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2024-06-20 02:05:08,939 INFO [train.py:1028] (0/2) Epoch 6, batch 8650, loss[loss=0.314, simple_loss=0.3386, pruned_loss=0.1447, over 13045.00 frames. ], tot_loss[loss=0.3366, simple_loss=0.3571, pruned_loss=0.1581, over 2577657.75 frames. ], batch size: 102, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:05:18,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=108617.66666666667, ans=0.025 2024-06-20 02:05:22,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=15.0 2024-06-20 02:05:22,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=108636.0, ans=0.0 2024-06-20 02:05:24,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=108636.0, ans=0.125 2024-06-20 02:05:25,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=15.0 2024-06-20 02:05:32,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=108654.33333333333, ans=0.125 2024-06-20 02:05:39,576 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 3.415e+02 4.584e+02 5.124e+02 5.861e+02 9.283e+02, threshold=1.025e+03, percent-clipped=0.0 2024-06-20 02:05:41,814 INFO [train.py:1028] (0/2) Epoch 6, batch 8700, loss[loss=0.3571, simple_loss=0.3772, pruned_loss=0.1686, over 13216.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3578, pruned_loss=0.1591, over 2575122.89 frames. ], batch size: 59, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:05:48,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2024-06-20 02:05:49,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=108709.33333333333, ans=0.02 2024-06-20 02:05:50,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=108709.33333333333, ans=0.05 2024-06-20 02:05:51,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108709.33333333333, ans=0.125 2024-06-20 02:05:57,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.25 vs. limit=15.0 2024-06-20 02:05:59,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=108727.66666666667, ans=0.2 2024-06-20 02:06:13,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=12.0 2024-06-20 02:06:15,342 INFO [train.py:1028] (0/2) Epoch 6, batch 8750, loss[loss=0.3425, simple_loss=0.3606, pruned_loss=0.1622, over 13085.00 frames. ], tot_loss[loss=0.3397, simple_loss=0.3593, pruned_loss=0.16, over 2571747.32 frames. ], batch size: 121, lr: 9.39e-03, grad_scale: 8.0 2024-06-20 02:06:15,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108782.66666666667, ans=0.1 2024-06-20 02:06:32,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=108819.33333333333, ans=0.025 2024-06-20 02:06:40,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=108837.66666666667, ans=0.125 2024-06-20 02:06:43,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=108837.66666666667, ans=0.2 2024-06-20 02:06:54,408 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.692e+02 3.906e+02 4.666e+02 5.529e+02 8.216e+02, threshold=9.332e+02, percent-clipped=0.0 2024-06-20 02:06:54,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=108856.0, ans=0.125 2024-06-20 02:06:56,332 INFO [train.py:1028] (0/2) Epoch 6, batch 8800, loss[loss=0.3138, simple_loss=0.3439, pruned_loss=0.1418, over 13168.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3591, pruned_loss=0.1599, over 2576737.93 frames. ], batch size: 72, lr: 9.38e-03, grad_scale: 16.0 2024-06-20 02:07:00,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=108874.33333333333, ans=0.0 2024-06-20 02:07:05,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108892.66666666667, ans=0.1 2024-06-20 02:07:12,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=108911.0, ans=0.0 2024-06-20 02:07:17,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.50 vs. limit=15.0 2024-06-20 02:07:22,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=108947.66666666667, ans=0.0 2024-06-20 02:07:24,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108947.66666666667, ans=0.125 2024-06-20 02:07:25,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2024-06-20 02:07:29,765 INFO [train.py:1028] (0/2) Epoch 6, batch 8850, loss[loss=0.382, simple_loss=0.3832, pruned_loss=0.1904, over 12561.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.359, pruned_loss=0.16, over 2563935.75 frames. ], batch size: 202, lr: 9.38e-03, grad_scale: 16.0 2024-06-20 02:07:30,795 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.03 vs. limit=12.0 2024-06-20 02:07:37,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=108984.33333333333, ans=0.0 2024-06-20 02:07:39,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=108984.33333333333, ans=0.125 2024-06-20 02:07:39,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=108984.33333333333, ans=0.04949747468305833 2024-06-20 02:07:43,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109002.66666666667, ans=0.1 2024-06-20 02:07:46,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109002.66666666667, ans=0.1 2024-06-20 02:07:49,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.83 vs. limit=15.0 2024-06-20 02:07:50,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=109021.0, ans=0.0 2024-06-20 02:08:00,549 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.671e+02 3.803e+02 4.332e+02 4.868e+02 1.024e+03, threshold=8.664e+02, percent-clipped=1.0 2024-06-20 02:08:02,456 INFO [train.py:1028] (0/2) Epoch 6, batch 8900, loss[loss=0.3147, simple_loss=0.352, pruned_loss=0.1387, over 12976.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.3596, pruned_loss=0.1606, over 2561629.42 frames. ], batch size: 33, lr: 9.37e-03, grad_scale: 16.0 2024-06-20 02:08:03,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109057.66666666667, ans=0.1 2024-06-20 02:08:06,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=109057.66666666667, ans=0.0 2024-06-20 02:08:13,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109076.0, ans=0.1 2024-06-20 02:08:17,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=109094.33333333333, ans=0.0 2024-06-20 02:08:28,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=109112.66666666667, ans=0.125 2024-06-20 02:08:31,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=109131.0, ans=0.04949747468305833 2024-06-20 02:08:33,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2024-06-20 02:08:37,830 INFO [train.py:1028] (0/2) Epoch 6, batch 8950, loss[loss=0.3872, simple_loss=0.3931, pruned_loss=0.1907, over 12523.00 frames. ], tot_loss[loss=0.339, simple_loss=0.359, pruned_loss=0.1595, over 2560990.01 frames. ], batch size: 202, lr: 9.37e-03, grad_scale: 8.0 2024-06-20 02:08:45,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=109149.33333333333, ans=0.0 2024-06-20 02:08:51,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=109167.66666666667, ans=0.2 2024-06-20 02:08:56,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109186.0, ans=0.1 2024-06-20 02:09:00,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=109204.33333333333, ans=0.2 2024-06-20 02:09:06,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.80 vs. limit=15.0 2024-06-20 02:09:09,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=109222.66666666667, ans=10.0 2024-06-20 02:09:12,654 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.779e+02 3.551e+02 3.960e+02 4.545e+02 7.311e+02, threshold=7.920e+02, percent-clipped=0.0 2024-06-20 02:09:13,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2024-06-20 02:09:14,082 INFO [train.py:1028] (0/2) Epoch 6, batch 9000, loss[loss=0.3158, simple_loss=0.3441, pruned_loss=0.1437, over 13340.00 frames. ], tot_loss[loss=0.3389, simple_loss=0.3594, pruned_loss=0.1592, over 2567391.59 frames. ], batch size: 46, lr: 9.37e-03, grad_scale: 8.0 2024-06-20 02:09:14,083 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 02:09:21,754 INFO [train.py:1060] (0/2) Epoch 6, validation: loss=0.223, simple_loss=0.2821, pruned_loss=0.08192, over 351949.00 frames. 2024-06-20 02:09:21,754 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 02:09:33,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=109259.33333333333, ans=0.025 2024-06-20 02:09:39,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=109277.66666666667, ans=0.125 2024-06-20 02:09:40,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=109277.66666666667, ans=0.07 2024-06-20 02:09:52,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.23 vs. limit=15.0 2024-06-20 02:09:53,791 INFO [train.py:1028] (0/2) Epoch 6, batch 9050, loss[loss=0.3144, simple_loss=0.336, pruned_loss=0.1464, over 11072.00 frames. ], tot_loss[loss=0.3389, simple_loss=0.3595, pruned_loss=0.1592, over 2566509.65 frames. ], batch size: 16, lr: 9.36e-03, grad_scale: 8.0 2024-06-20 02:09:54,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=109332.66666666667, ans=0.07 2024-06-20 02:09:54,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=109332.66666666667, ans=0.0 2024-06-20 02:09:59,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=109351.0, ans=0.05 2024-06-20 02:10:07,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=109369.33333333333, ans=0.0 2024-06-20 02:10:21,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2024-06-20 02:10:24,432 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.085e+02 3.564e+02 4.071e+02 1.282e+03, threshold=7.128e+02, percent-clipped=1.0 2024-06-20 02:10:25,831 INFO [train.py:1028] (0/2) Epoch 6, batch 9100, loss[loss=0.315, simple_loss=0.3498, pruned_loss=0.1401, over 13234.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3584, pruned_loss=0.1582, over 2566569.71 frames. ], batch size: 72, lr: 9.36e-03, grad_scale: 8.0 2024-06-20 02:10:26,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=109424.33333333333, ans=0.2 2024-06-20 02:10:35,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=109442.66666666667, ans=0.025 2024-06-20 02:10:36,955 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.296e+01 2024-06-20 02:10:55,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=109497.66666666667, ans=0.0 2024-06-20 02:10:57,922 INFO [train.py:1028] (0/2) Epoch 6, batch 9150, loss[loss=0.3308, simple_loss=0.3606, pruned_loss=0.1505, over 13092.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3593, pruned_loss=0.1588, over 2569154.57 frames. ], batch size: 77, lr: 9.35e-03, grad_scale: 8.0 2024-06-20 02:11:04,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=109534.33333333333, ans=0.125 2024-06-20 02:11:22,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.41 vs. limit=15.0 2024-06-20 02:11:27,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=109589.33333333333, ans=0.025 2024-06-20 02:11:31,420 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.582e+02 4.005e+02 4.544e+02 9.909e+02, threshold=8.010e+02, percent-clipped=1.0 2024-06-20 02:11:32,793 INFO [train.py:1028] (0/2) Epoch 6, batch 9200, loss[loss=0.3401, simple_loss=0.3613, pruned_loss=0.1594, over 12968.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3581, pruned_loss=0.1574, over 2572688.48 frames. ], batch size: 36, lr: 9.35e-03, grad_scale: 16.0 2024-06-20 02:11:43,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=109626.0, ans=0.2 2024-06-20 02:11:44,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=109626.0, ans=0.025 2024-06-20 02:11:47,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.75 vs. limit=22.5 2024-06-20 02:11:52,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-20 02:12:08,404 INFO [train.py:1028] (0/2) Epoch 6, batch 9250, loss[loss=0.335, simple_loss=0.3636, pruned_loss=0.1532, over 13202.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3579, pruned_loss=0.1572, over 2575469.63 frames. ], batch size: 67, lr: 9.35e-03, grad_scale: 16.0 2024-06-20 02:12:12,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=109699.33333333333, ans=0.0 2024-06-20 02:12:13,675 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=9.566e+02 2024-06-20 02:12:14,123 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=15.0 2024-06-20 02:12:26,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109736.0, ans=0.1 2024-06-20 02:12:36,028 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.40 vs. limit=22.5 2024-06-20 02:12:36,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-20 02:12:39,322 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.310e+02 3.758e+02 4.156e+02 5.924e+02, threshold=7.517e+02, percent-clipped=0.0 2024-06-20 02:12:40,645 INFO [train.py:1028] (0/2) Epoch 6, batch 9300, loss[loss=0.3076, simple_loss=0.3326, pruned_loss=0.1413, over 13190.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3579, pruned_loss=0.1567, over 2572095.61 frames. ], batch size: 40, lr: 9.34e-03, grad_scale: 16.0 2024-06-20 02:12:42,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.52 vs. limit=10.0 2024-06-20 02:12:42,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=109791.0, ans=0.0 2024-06-20 02:12:42,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109791.0, ans=0.125 2024-06-20 02:12:45,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=109791.0, ans=0.125 2024-06-20 02:12:45,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=109791.0, ans=0.125 2024-06-20 02:12:57,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109827.66666666667, ans=0.1 2024-06-20 02:13:11,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=109882.66666666667, ans=15.0 2024-06-20 02:13:12,101 INFO [train.py:1028] (0/2) Epoch 6, batch 9350, loss[loss=0.2905, simple_loss=0.3275, pruned_loss=0.1267, over 12585.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3575, pruned_loss=0.1565, over 2569023.58 frames. ], batch size: 22, lr: 9.34e-03, grad_scale: 8.0 2024-06-20 02:13:13,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=109882.66666666667, ans=0.0 2024-06-20 02:13:16,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=109882.66666666667, ans=0.0 2024-06-20 02:13:20,092 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.09 vs. limit=15.0 2024-06-20 02:13:20,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=109901.0, ans=0.0 2024-06-20 02:13:21,227 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.382e+00 2024-06-20 02:13:32,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=109937.66666666667, ans=0.2 2024-06-20 02:13:34,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.10 vs. limit=6.0 2024-06-20 02:13:39,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=109956.0, ans=0.2 2024-06-20 02:13:40,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=109956.0, ans=0.025 2024-06-20 02:13:42,406 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.042e+02 4.663e+02 5.220e+02 7.888e+02, threshold=9.325e+02, percent-clipped=1.0 2024-06-20 02:13:43,071 INFO [train.py:1028] (0/2) Epoch 6, batch 9400, loss[loss=0.3371, simple_loss=0.3598, pruned_loss=0.1573, over 13345.00 frames. ], tot_loss[loss=0.3369, simple_loss=0.3586, pruned_loss=0.1576, over 2568918.40 frames. ], batch size: 52, lr: 9.34e-03, grad_scale: 8.0 2024-06-20 02:13:51,299 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-60000.pt 2024-06-20 02:14:00,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=8.0 2024-06-20 02:14:02,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=110011.0, ans=0.2 2024-06-20 02:14:02,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=110011.0, ans=0.0 2024-06-20 02:14:07,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=110029.33333333333, ans=0.125 2024-06-20 02:14:08,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=110029.33333333333, ans=0.125 2024-06-20 02:14:12,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.56 vs. limit=15.0 2024-06-20 02:14:16,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.61 vs. limit=10.0 2024-06-20 02:14:17,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=110047.66666666667, ans=0.125 2024-06-20 02:14:18,944 INFO [train.py:1028] (0/2) Epoch 6, batch 9450, loss[loss=0.3347, simple_loss=0.3592, pruned_loss=0.1552, over 12391.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3589, pruned_loss=0.1581, over 2568009.53 frames. ], batch size: 22, lr: 9.33e-03, grad_scale: 8.0 2024-06-20 02:14:24,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=110066.0, ans=0.0 2024-06-20 02:14:25,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=110084.33333333333, ans=0.2 2024-06-20 02:14:28,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.64 vs. limit=15.0 2024-06-20 02:14:35,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2024-06-20 02:14:37,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110121.0, ans=0.125 2024-06-20 02:14:41,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=110121.0, ans=0.125 2024-06-20 02:14:46,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110139.33333333333, ans=0.1 2024-06-20 02:14:46,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.17 vs. limit=10.0 2024-06-20 02:14:48,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=110139.33333333333, ans=0.125 2024-06-20 02:14:48,774 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.809e+02 4.296e+02 5.234e+02 7.435e+02, threshold=8.591e+02, percent-clipped=0.0 2024-06-20 02:14:49,660 INFO [train.py:1028] (0/2) Epoch 6, batch 9500, loss[loss=0.313, simple_loss=0.344, pruned_loss=0.141, over 13266.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3576, pruned_loss=0.1569, over 2577556.30 frames. ], batch size: 43, lr: 9.33e-03, grad_scale: 8.0 2024-06-20 02:15:03,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=110176.0, ans=0.2 2024-06-20 02:15:10,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110194.33333333333, ans=0.0 2024-06-20 02:15:21,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.40 vs. limit=12.0 2024-06-20 02:15:24,918 INFO [train.py:1028] (0/2) Epoch 6, batch 9550, loss[loss=0.3236, simple_loss=0.351, pruned_loss=0.1481, over 12894.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3569, pruned_loss=0.1565, over 2572148.14 frames. ], batch size: 39, lr: 9.32e-03, grad_scale: 8.0 2024-06-20 02:15:30,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=110249.33333333333, ans=0.2 2024-06-20 02:15:39,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110286.0, ans=0.1 2024-06-20 02:15:42,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=110286.0, ans=0.125 2024-06-20 02:15:50,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=110322.66666666667, ans=0.125 2024-06-20 02:15:50,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=110322.66666666667, ans=0.025 2024-06-20 02:15:50,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=110322.66666666667, ans=0.125 2024-06-20 02:15:55,689 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.605e+02 4.134e+02 4.891e+02 1.184e+03, threshold=8.268e+02, percent-clipped=1.0 2024-06-20 02:15:56,432 INFO [train.py:1028] (0/2) Epoch 6, batch 9600, loss[loss=0.3364, simple_loss=0.342, pruned_loss=0.1654, over 10495.00 frames. ], tot_loss[loss=0.3333, simple_loss=0.3555, pruned_loss=0.1556, over 2569875.84 frames. ], batch size: 304, lr: 9.32e-03, grad_scale: 16.0 2024-06-20 02:15:58,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=110341.0, ans=0.125 2024-06-20 02:16:02,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=110359.33333333333, ans=0.125 2024-06-20 02:16:11,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=110377.66666666667, ans=0.125 2024-06-20 02:16:17,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=110396.0, ans=0.125 2024-06-20 02:16:21,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110414.33333333333, ans=0.1 2024-06-20 02:16:27,779 INFO [train.py:1028] (0/2) Epoch 6, batch 9650, loss[loss=0.3251, simple_loss=0.3427, pruned_loss=0.1537, over 13052.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3558, pruned_loss=0.1566, over 2561089.01 frames. ], batch size: 132, lr: 9.32e-03, grad_scale: 16.0 2024-06-20 02:16:36,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=110451.0, ans=0.035 2024-06-20 02:16:40,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=110469.33333333333, ans=0.025 2024-06-20 02:16:59,496 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.340e+02 3.943e+02 4.395e+02 7.325e+02, threshold=7.886e+02, percent-clipped=0.0 2024-06-20 02:17:00,239 INFO [train.py:1028] (0/2) Epoch 6, batch 9700, loss[loss=0.3324, simple_loss=0.3583, pruned_loss=0.1532, over 13097.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3553, pruned_loss=0.1568, over 2556130.33 frames. ], batch size: 144, lr: 9.31e-03, grad_scale: 16.0 2024-06-20 02:17:02,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=110524.33333333333, ans=0.2 2024-06-20 02:17:17,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=110561.0, ans=0.035 2024-06-20 02:17:24,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=110579.33333333333, ans=0.1 2024-06-20 02:17:30,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110597.66666666667, ans=0.0 2024-06-20 02:17:32,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110616.0, ans=0.125 2024-06-20 02:17:32,731 INFO [train.py:1028] (0/2) Epoch 6, batch 9750, loss[loss=0.3168, simple_loss=0.327, pruned_loss=0.1533, over 13091.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3537, pruned_loss=0.1554, over 2552947.99 frames. ], batch size: 132, lr: 9.31e-03, grad_scale: 16.0 2024-06-20 02:17:40,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=110634.33333333333, ans=0.05 2024-06-20 02:17:43,152 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.632e+02 2024-06-20 02:17:43,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110634.33333333333, ans=0.1 2024-06-20 02:17:50,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.65 vs. limit=10.0 2024-06-20 02:17:52,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=110671.0, ans=0.125 2024-06-20 02:18:02,933 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.463e+02 3.995e+02 4.578e+02 6.727e+02, threshold=7.990e+02, percent-clipped=0.0 2024-06-20 02:18:02,961 INFO [train.py:1028] (0/2) Epoch 6, batch 9800, loss[loss=0.304, simple_loss=0.3255, pruned_loss=0.1412, over 12872.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3536, pruned_loss=0.1554, over 2545962.05 frames. ], batch size: 39, lr: 9.31e-03, grad_scale: 8.0 2024-06-20 02:18:07,028 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2024-06-20 02:18:07,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110707.66666666667, ans=0.125 2024-06-20 02:18:08,689 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.28 vs. limit=15.0 2024-06-20 02:18:10,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110726.0, ans=0.0 2024-06-20 02:18:17,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=110744.33333333333, ans=0.0 2024-06-20 02:18:21,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=110762.66666666667, ans=0.0 2024-06-20 02:18:22,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=110762.66666666667, ans=0.125 2024-06-20 02:18:25,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.95 vs. limit=10.0 2024-06-20 02:18:30,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=110781.0, ans=0.1 2024-06-20 02:18:33,750 INFO [train.py:1028] (0/2) Epoch 6, batch 9850, loss[loss=0.3071, simple_loss=0.3321, pruned_loss=0.141, over 13064.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3521, pruned_loss=0.1542, over 2538390.70 frames. ], batch size: 102, lr: 9.30e-03, grad_scale: 8.0 2024-06-20 02:18:36,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.39 vs. limit=15.0 2024-06-20 02:18:38,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110799.33333333333, ans=0.125 2024-06-20 02:18:39,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.60 vs. limit=15.0 2024-06-20 02:18:53,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2024-06-20 02:19:06,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.882e+02 3.754e+02 4.247e+02 4.914e+02 1.020e+03, threshold=8.494e+02, percent-clipped=2.0 2024-06-20 02:19:06,731 INFO [train.py:1028] (0/2) Epoch 6, batch 9900, loss[loss=0.3167, simple_loss=0.3444, pruned_loss=0.1445, over 12950.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3515, pruned_loss=0.1542, over 2531861.71 frames. ], batch size: 39, lr: 9.30e-03, grad_scale: 8.0 2024-06-20 02:19:17,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110909.33333333333, ans=0.125 2024-06-20 02:19:22,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=110927.66666666667, ans=0.025 2024-06-20 02:19:24,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=110946.0, ans=0.0 2024-06-20 02:19:28,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=110946.0, ans=0.0 2024-06-20 02:19:32,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110964.33333333333, ans=0.1 2024-06-20 02:19:32,883 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:19:35,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=15.0 2024-06-20 02:19:36,814 INFO [train.py:1028] (0/2) Epoch 6, batch 9950, loss[loss=0.3513, simple_loss=0.3728, pruned_loss=0.1649, over 12555.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3511, pruned_loss=0.1544, over 2526171.75 frames. ], batch size: 29, lr: 9.29e-03, grad_scale: 8.0 2024-06-20 02:19:52,049 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 02:19:56,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=111037.66666666667, ans=0.0 2024-06-20 02:19:59,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=111037.66666666667, ans=0.0 2024-06-20 02:20:07,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=111056.0, ans=0.125 2024-06-20 02:20:10,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.391e+02 3.752e+02 4.643e+02 7.647e+02, threshold=7.505e+02, percent-clipped=0.0 2024-06-20 02:20:10,208 INFO [train.py:1028] (0/2) Epoch 6, batch 10000, loss[loss=0.2964, simple_loss=0.317, pruned_loss=0.1379, over 12438.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3516, pruned_loss=0.155, over 2486830.80 frames. ], batch size: 22, lr: 9.29e-03, grad_scale: 16.0 2024-06-20 02:20:26,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=111111.0, ans=0.125 2024-06-20 02:20:30,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=111129.33333333333, ans=0.125 2024-06-20 02:20:36,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=111147.66666666667, ans=0.05 2024-06-20 02:20:41,876 INFO [train.py:1028] (0/2) Epoch 6, batch 10050, loss[loss=0.299, simple_loss=0.3461, pruned_loss=0.126, over 12628.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3527, pruned_loss=0.1566, over 2444184.33 frames. ], batch size: 22, lr: 9.29e-03, grad_scale: 16.0 2024-06-20 02:20:45,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=111166.0, ans=0.125 2024-06-20 02:20:54,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.89 vs. limit=15.0 2024-06-20 02:20:56,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.45 vs. limit=22.5 2024-06-20 02:20:58,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=111221.0, ans=0.125 2024-06-20 02:21:11,718 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.198e+02 3.690e+02 4.365e+02 5.854e+02, threshold=7.381e+02, percent-clipped=0.0 2024-06-20 02:21:11,748 INFO [train.py:1028] (0/2) Epoch 6, batch 10100, loss[loss=0.3347, simple_loss=0.3348, pruned_loss=0.1673, over 11675.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3502, pruned_loss=0.1543, over 2423146.65 frames. ], batch size: 16, lr: 9.28e-03, grad_scale: 16.0 2024-06-20 02:21:19,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=21.19 vs. limit=15.0 2024-06-20 02:21:25,126 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-6.pt 2024-06-20 02:23:26,012 INFO [train.py:1028] (0/2) Epoch 7, batch 0, loss[loss=0.275, simple_loss=0.3039, pruned_loss=0.123, over 12952.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3039, pruned_loss=0.123, over 12952.00 frames. ], batch size: 36, lr: 8.70e-03, grad_scale: 32.0 2024-06-20 02:23:26,013 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 02:23:33,070 INFO [train.py:1060] (0/2) Epoch 7, validation: loss=0.2255, simple_loss=0.2848, pruned_loss=0.08313, over 351949.00 frames. 2024-06-20 02:23:33,070 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16816MB 2024-06-20 02:23:34,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=111288.83333333333, ans=0.125 2024-06-20 02:23:38,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 02:23:51,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=111325.5, ans=0.025 2024-06-20 02:24:04,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=111362.16666666667, ans=0.125 2024-06-20 02:24:06,394 INFO [train.py:1028] (0/2) Epoch 7, batch 50, loss[loss=0.2792, simple_loss=0.3106, pruned_loss=0.1239, over 12628.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3315, pruned_loss=0.1456, over 574339.50 frames. ], batch size: 29, lr: 8.70e-03, grad_scale: 32.0 2024-06-20 02:24:26,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2024-06-20 02:24:30,719 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 3.685e+02 4.120e+02 4.852e+02 6.556e+02, threshold=8.240e+02, percent-clipped=0.0 2024-06-20 02:24:30,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=111435.5, ans=0.0 2024-06-20 02:24:32,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.86 vs. limit=6.0 2024-06-20 02:24:43,949 INFO [train.py:1028] (0/2) Epoch 7, batch 100, loss[loss=0.2806, simple_loss=0.3128, pruned_loss=0.1242, over 13276.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.328, pruned_loss=0.1428, over 1016933.97 frames. ], batch size: 46, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:24:47,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=111472.16666666667, ans=0.025 2024-06-20 02:24:51,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=111490.5, ans=0.125 2024-06-20 02:24:59,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111508.83333333333, ans=0.125 2024-06-20 02:25:00,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.89 vs. limit=15.0 2024-06-20 02:25:02,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111527.16666666667, ans=0.1 2024-06-20 02:25:03,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.40 vs. limit=22.5 2024-06-20 02:25:11,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.14 vs. limit=22.5 2024-06-20 02:25:11,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=111545.5, ans=0.09899494936611666 2024-06-20 02:25:15,667 INFO [train.py:1028] (0/2) Epoch 7, batch 150, loss[loss=0.3163, simple_loss=0.3412, pruned_loss=0.1457, over 12563.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3258, pruned_loss=0.1399, over 1364123.95 frames. ], batch size: 29, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:25:24,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=111582.16666666667, ans=0.0 2024-06-20 02:25:33,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=111600.5, ans=0.125 2024-06-20 02:25:36,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111618.83333333333, ans=0.1 2024-06-20 02:25:37,308 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.375e+02 3.884e+02 4.529e+02 7.630e+02, threshold=7.768e+02, percent-clipped=0.0 2024-06-20 02:25:46,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=111637.16666666667, ans=0.125 2024-06-20 02:25:47,226 INFO [train.py:1028] (0/2) Epoch 7, batch 200, loss[loss=0.3571, simple_loss=0.3571, pruned_loss=0.1785, over 12541.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3265, pruned_loss=0.1398, over 1634308.84 frames. ], batch size: 202, lr: 8.69e-03, grad_scale: 16.0 2024-06-20 02:25:47,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=111655.5, ans=0.025 2024-06-20 02:25:52,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=111655.5, ans=0.0 2024-06-20 02:26:03,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=111692.16666666667, ans=10.0 2024-06-20 02:26:04,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-20 02:26:06,391 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.46 vs. limit=15.0 2024-06-20 02:26:09,791 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.58 vs. limit=22.5 2024-06-20 02:26:16,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.26 vs. limit=15.0 2024-06-20 02:26:22,916 INFO [train.py:1028] (0/2) Epoch 7, batch 250, loss[loss=0.284, simple_loss=0.2952, pruned_loss=0.1364, over 13027.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3254, pruned_loss=0.1388, over 1845665.22 frames. ], batch size: 144, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:26:25,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=111747.16666666667, ans=0.0 2024-06-20 02:26:27,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111747.16666666667, ans=0.125 2024-06-20 02:26:33,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=111765.5, ans=0.125 2024-06-20 02:26:39,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.35 vs. limit=22.5 2024-06-20 02:26:47,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=111802.16666666667, ans=0.125 2024-06-20 02:26:48,007 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.525e+02 3.839e+02 4.348e+02 5.982e+02, threshold=7.677e+02, percent-clipped=0.0 2024-06-20 02:26:48,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=111802.16666666667, ans=0.125 2024-06-20 02:26:49,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=111802.16666666667, ans=0.2 2024-06-20 02:26:53,328 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:26:55,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=111820.5, ans=0.0 2024-06-20 02:26:58,402 INFO [train.py:1028] (0/2) Epoch 7, batch 300, loss[loss=0.2891, simple_loss=0.31, pruned_loss=0.1341, over 13171.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3256, pruned_loss=0.1387, over 2009072.92 frames. ], batch size: 112, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:27:01,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2024-06-20 02:27:25,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-20 02:27:29,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.35 vs. limit=22.5 2024-06-20 02:27:30,169 INFO [train.py:1028] (0/2) Epoch 7, batch 350, loss[loss=0.2904, simple_loss=0.3228, pruned_loss=0.129, over 12837.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3238, pruned_loss=0.1373, over 2137952.31 frames. ], batch size: 33, lr: 8.68e-03, grad_scale: 16.0 2024-06-20 02:27:32,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=111930.5, ans=0.125 2024-06-20 02:27:51,840 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.164e+02 3.644e+02 4.068e+02 7.617e+02, threshold=7.289e+02, percent-clipped=0.0 2024-06-20 02:27:59,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=112003.83333333333, ans=0.0 2024-06-20 02:28:02,009 INFO [train.py:1028] (0/2) Epoch 7, batch 400, loss[loss=0.3024, simple_loss=0.3361, pruned_loss=0.1344, over 13271.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3245, pruned_loss=0.1375, over 2238915.92 frames. ], batch size: 63, lr: 8.67e-03, grad_scale: 32.0 2024-06-20 02:28:06,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112022.16666666667, ans=0.125 2024-06-20 02:28:14,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=112040.5, ans=0.025 2024-06-20 02:28:16,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=112040.5, ans=0.2 2024-06-20 02:28:16,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=112040.5, ans=0.0 2024-06-20 02:28:22,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=112058.83333333333, ans=0.05 2024-06-20 02:28:29,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=112077.16666666667, ans=0.2 2024-06-20 02:28:36,596 INFO [train.py:1028] (0/2) Epoch 7, batch 450, loss[loss=0.2732, simple_loss=0.3135, pruned_loss=0.1165, over 13255.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3245, pruned_loss=0.1368, over 2314402.18 frames. ], batch size: 67, lr: 8.67e-03, grad_scale: 32.0 2024-06-20 02:28:38,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=112113.83333333333, ans=0.0 2024-06-20 02:28:38,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.95 vs. limit=15.0 2024-06-20 02:28:39,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=112113.83333333333, ans=0.125 2024-06-20 02:28:42,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=112132.16666666667, ans=0.125 2024-06-20 02:28:50,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=112132.16666666667, ans=0.0 2024-06-20 02:28:50,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=112132.16666666667, ans=0.09899494936611666 2024-06-20 02:28:51,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=112132.16666666667, ans=0.0 2024-06-20 02:28:51,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=112132.16666666667, ans=0.125 2024-06-20 02:28:58,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=112168.83333333333, ans=0.0 2024-06-20 02:28:59,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112168.83333333333, ans=0.125 2024-06-20 02:29:02,420 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.014e+02 3.303e+02 3.735e+02 5.209e+02, threshold=6.606e+02, percent-clipped=0.0 2024-06-20 02:29:05,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=112187.16666666667, ans=0.125 2024-06-20 02:29:12,047 INFO [train.py:1028] (0/2) Epoch 7, batch 500, loss[loss=0.2549, simple_loss=0.2872, pruned_loss=0.1112, over 13124.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3249, pruned_loss=0.1369, over 2376140.02 frames. ], batch size: 121, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:29:13,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=112205.5, ans=0.2 2024-06-20 02:29:18,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112223.83333333333, ans=0.1 2024-06-20 02:29:28,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.84 vs. limit=22.5 2024-06-20 02:29:29,379 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.31 vs. limit=10.0 2024-06-20 02:29:32,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.83 vs. limit=15.0 2024-06-20 02:29:33,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=112260.5, ans=0.2 2024-06-20 02:29:37,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.99 vs. limit=22.5 2024-06-20 02:29:41,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=112278.83333333333, ans=0.125 2024-06-20 02:29:43,379 INFO [train.py:1028] (0/2) Epoch 7, batch 550, loss[loss=0.2953, simple_loss=0.315, pruned_loss=0.1378, over 12930.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3235, pruned_loss=0.1364, over 2421352.11 frames. ], batch size: 158, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:29:56,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=112333.83333333333, ans=0.125 2024-06-20 02:29:59,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2024-06-20 02:30:00,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=112333.83333333333, ans=0.125 2024-06-20 02:30:05,442 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.220e+02 3.613e+02 4.202e+02 5.786e+02, threshold=7.226e+02, percent-clipped=0.0 2024-06-20 02:30:06,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=112352.16666666667, ans=0.0 2024-06-20 02:30:06,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=112352.16666666667, ans=0.125 2024-06-20 02:30:09,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112370.5, ans=0.0 2024-06-20 02:30:11,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112370.5, ans=0.1 2024-06-20 02:30:13,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=112370.5, ans=0.125 2024-06-20 02:30:14,867 INFO [train.py:1028] (0/2) Epoch 7, batch 600, loss[loss=0.2841, simple_loss=0.3062, pruned_loss=0.131, over 13052.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3239, pruned_loss=0.1366, over 2458844.00 frames. ], batch size: 144, lr: 8.66e-03, grad_scale: 16.0 2024-06-20 02:30:38,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=112443.83333333333, ans=0.0 2024-06-20 02:30:51,494 INFO [train.py:1028] (0/2) Epoch 7, batch 650, loss[loss=0.2764, simple_loss=0.309, pruned_loss=0.1219, over 13146.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3235, pruned_loss=0.1358, over 2490713.87 frames. ], batch size: 59, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:30:59,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.52 vs. limit=22.5 2024-06-20 02:31:13,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.02 vs. limit=15.0 2024-06-20 02:31:19,300 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.165e+02 3.568e+02 3.998e+02 5.322e+02, threshold=7.136e+02, percent-clipped=0.0 2024-06-20 02:31:28,901 INFO [train.py:1028] (0/2) Epoch 7, batch 700, loss[loss=0.3001, simple_loss=0.3197, pruned_loss=0.1402, over 13324.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3225, pruned_loss=0.1356, over 2513131.88 frames. ], batch size: 46, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:31:29,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=112572.16666666667, ans=0.125 2024-06-20 02:31:46,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=112608.83333333333, ans=0.0 2024-06-20 02:31:57,344 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:31:57,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.42 vs. limit=15.0 2024-06-20 02:32:00,865 INFO [train.py:1028] (0/2) Epoch 7, batch 750, loss[loss=0.2525, simple_loss=0.2944, pruned_loss=0.1052, over 13264.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3226, pruned_loss=0.1353, over 2528368.58 frames. ], batch size: 63, lr: 8.65e-03, grad_scale: 16.0 2024-06-20 02:32:02,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=112663.83333333333, ans=0.5 2024-06-20 02:32:15,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 02:32:17,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=112700.5, ans=0.025 2024-06-20 02:32:20,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=112718.83333333333, ans=0.125 2024-06-20 02:32:23,189 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.459e+02 3.159e+02 3.471e+02 4.010e+02 6.766e+02, threshold=6.943e+02, percent-clipped=0.0 2024-06-20 02:32:23,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=112718.83333333333, ans=0.2 2024-06-20 02:32:35,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=112755.5, ans=0.2 2024-06-20 02:32:35,774 INFO [train.py:1028] (0/2) Epoch 7, batch 800, loss[loss=0.2508, simple_loss=0.2916, pruned_loss=0.105, over 13019.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3227, pruned_loss=0.1352, over 2541895.84 frames. ], batch size: 36, lr: 8.64e-03, grad_scale: 32.0 2024-06-20 02:32:40,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=112755.5, ans=0.2 2024-06-20 02:32:44,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=112773.83333333333, ans=0.025 2024-06-20 02:32:45,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112773.83333333333, ans=0.1 2024-06-20 02:32:50,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.02 vs. limit=15.0 2024-06-20 02:32:52,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=112792.16666666667, ans=0.125 2024-06-20 02:32:54,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=112810.5, ans=0.125 2024-06-20 02:33:08,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=112828.83333333333, ans=0.125 2024-06-20 02:33:10,601 INFO [train.py:1028] (0/2) Epoch 7, batch 850, loss[loss=0.3271, simple_loss=0.3503, pruned_loss=0.1519, over 13102.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3224, pruned_loss=0.135, over 2552477.82 frames. ], batch size: 95, lr: 8.64e-03, grad_scale: 32.0 2024-06-20 02:33:13,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=112847.16666666667, ans=0.0 2024-06-20 02:33:17,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=112865.5, ans=0.2 2024-06-20 02:33:18,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=112865.5, ans=0.1 2024-06-20 02:33:21,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.50 vs. limit=15.0 2024-06-20 02:33:27,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112883.83333333333, ans=0.1 2024-06-20 02:33:29,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=112902.16666666667, ans=0.125 2024-06-20 02:33:33,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=112902.16666666667, ans=0.2 2024-06-20 02:33:33,530 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 3.077e+02 3.365e+02 3.860e+02 6.017e+02, threshold=6.730e+02, percent-clipped=0.0 2024-06-20 02:33:42,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=112938.83333333333, ans=0.0 2024-06-20 02:33:43,174 INFO [train.py:1028] (0/2) Epoch 7, batch 900, loss[loss=0.267, simple_loss=0.2992, pruned_loss=0.1174, over 13003.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3219, pruned_loss=0.135, over 2557634.00 frames. ], batch size: 36, lr: 8.64e-03, grad_scale: 16.0 2024-06-20 02:33:49,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2024-06-20 02:33:54,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=112957.16666666667, ans=0.0 2024-06-20 02:33:54,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=112957.16666666667, ans=0.025 2024-06-20 02:34:02,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=112993.83333333333, ans=0.05 2024-06-20 02:34:03,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.66 vs. limit=10.0 2024-06-20 02:34:05,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=112993.83333333333, ans=0.125 2024-06-20 02:34:07,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=112993.83333333333, ans=0.0 2024-06-20 02:34:15,783 INFO [train.py:1028] (0/2) Epoch 7, batch 950, loss[loss=0.3119, simple_loss=0.3363, pruned_loss=0.1437, over 12916.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.323, pruned_loss=0.1357, over 2560470.44 frames. ], batch size: 39, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:34:42,586 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.502e+02 3.293e+02 3.623e+02 4.262e+02 9.004e+02, threshold=7.247e+02, percent-clipped=2.0 2024-06-20 02:34:45,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=113103.83333333333, ans=0.125 2024-06-20 02:34:51,571 INFO [train.py:1028] (0/2) Epoch 7, batch 1000, loss[loss=0.2927, simple_loss=0.3244, pruned_loss=0.1305, over 13283.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3223, pruned_loss=0.1354, over 2562552.42 frames. ], batch size: 49, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:34:52,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=113122.16666666667, ans=0.125 2024-06-20 02:35:01,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=113140.5, ans=0.0 2024-06-20 02:35:05,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=12.0 2024-06-20 02:35:11,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=113158.83333333333, ans=0.2 2024-06-20 02:35:17,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.84 vs. limit=15.0 2024-06-20 02:35:19,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=113177.16666666667, ans=0.125 2024-06-20 02:35:27,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.99 vs. limit=15.0 2024-06-20 02:35:28,722 INFO [train.py:1028] (0/2) Epoch 7, batch 1050, loss[loss=0.2688, simple_loss=0.3062, pruned_loss=0.1157, over 13158.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3225, pruned_loss=0.1353, over 2564310.76 frames. ], batch size: 77, lr: 8.63e-03, grad_scale: 16.0 2024-06-20 02:35:29,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=113213.83333333333, ans=15.0 2024-06-20 02:35:35,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2024-06-20 02:35:40,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=113232.16666666667, ans=0.125 2024-06-20 02:35:40,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.93 vs. limit=22.5 2024-06-20 02:35:41,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=113250.5, ans=0.2 2024-06-20 02:35:51,957 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.293e+02 3.684e+02 4.272e+02 6.892e+02, threshold=7.367e+02, percent-clipped=0.0 2024-06-20 02:35:56,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=113287.16666666667, ans=0.0 2024-06-20 02:35:58,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=113287.16666666667, ans=0.0 2024-06-20 02:36:01,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=113305.5, ans=0.0 2024-06-20 02:36:02,054 INFO [train.py:1028] (0/2) Epoch 7, batch 1100, loss[loss=0.2786, simple_loss=0.3136, pruned_loss=0.1218, over 13256.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3227, pruned_loss=0.135, over 2569848.79 frames. ], batch size: 52, lr: 8.62e-03, grad_scale: 8.0 2024-06-20 02:36:02,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=113305.5, ans=0.125 2024-06-20 02:36:04,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113305.5, ans=0.1 2024-06-20 02:36:04,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=113305.5, ans=0.2 2024-06-20 02:36:19,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2024-06-20 02:36:20,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=113342.16666666667, ans=0.035 2024-06-20 02:36:27,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2024-06-20 02:36:32,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=113378.83333333333, ans=0.0 2024-06-20 02:36:35,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=113378.83333333333, ans=0.0 2024-06-20 02:36:37,389 INFO [train.py:1028] (0/2) Epoch 7, batch 1150, loss[loss=0.288, simple_loss=0.3203, pruned_loss=0.1279, over 13205.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3232, pruned_loss=0.1354, over 2571832.63 frames. ], batch size: 52, lr: 8.62e-03, grad_scale: 8.0 2024-06-20 02:36:41,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=113397.16666666667, ans=0.05 2024-06-20 02:36:45,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=113415.5, ans=0.0 2024-06-20 02:36:47,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=113415.5, ans=0.125 2024-06-20 02:36:49,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=113415.5, ans=0.125 2024-06-20 02:36:49,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=113415.5, ans=0.2 2024-06-20 02:36:59,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.66 vs. limit=15.0 2024-06-20 02:37:04,125 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.082e+02 3.526e+02 4.163e+02 7.819e+02, threshold=7.052e+02, percent-clipped=1.0 2024-06-20 02:37:06,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=113470.5, ans=0.0 2024-06-20 02:37:11,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=113470.5, ans=0.125 2024-06-20 02:37:12,542 INFO [train.py:1028] (0/2) Epoch 7, batch 1200, loss[loss=0.2712, simple_loss=0.3028, pruned_loss=0.1198, over 13164.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.323, pruned_loss=0.1357, over 2574326.46 frames. ], batch size: 77, lr: 8.62e-03, grad_scale: 16.0 2024-06-20 02:37:14,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.50 vs. limit=15.0 2024-06-20 02:37:24,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.66 vs. limit=15.0 2024-06-20 02:37:36,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113543.83333333333, ans=0.125 2024-06-20 02:37:42,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=113562.16666666667, ans=22.5 2024-06-20 02:37:44,934 INFO [train.py:1028] (0/2) Epoch 7, batch 1250, loss[loss=0.2627, simple_loss=0.291, pruned_loss=0.1172, over 13167.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3218, pruned_loss=0.1347, over 2583783.65 frames. ], batch size: 112, lr: 8.61e-03, grad_scale: 16.0 2024-06-20 02:37:50,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=113580.5, ans=0.04949747468305833 2024-06-20 02:37:56,273 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:38:06,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=113635.5, ans=0.0 2024-06-20 02:38:08,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-20 02:38:09,277 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.036e+02 3.306e+02 3.625e+02 5.609e+02, threshold=6.611e+02, percent-clipped=0.0 2024-06-20 02:38:14,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113653.83333333333, ans=0.1 2024-06-20 02:38:17,654 INFO [train.py:1028] (0/2) Epoch 7, batch 1300, loss[loss=0.3135, simple_loss=0.3282, pruned_loss=0.1494, over 12722.00 frames. ], tot_loss[loss=0.296, simple_loss=0.322, pruned_loss=0.135, over 2584495.73 frames. ], batch size: 176, lr: 8.61e-03, grad_scale: 8.0 2024-06-20 02:38:20,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=113672.16666666667, ans=0.2 2024-06-20 02:38:23,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=113690.5, ans=0.0 2024-06-20 02:38:36,440 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 02:38:45,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=113727.16666666667, ans=0.0 2024-06-20 02:38:46,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=113727.16666666667, ans=0.0 2024-06-20 02:38:47,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=113745.5, ans=0.125 2024-06-20 02:38:51,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113745.5, ans=0.125 2024-06-20 02:38:54,164 INFO [train.py:1028] (0/2) Epoch 7, batch 1350, loss[loss=0.2838, simple_loss=0.3171, pruned_loss=0.1252, over 13234.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3224, pruned_loss=0.1352, over 2586513.92 frames. ], batch size: 59, lr: 8.61e-03, grad_scale: 4.0 2024-06-20 02:39:00,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113782.16666666667, ans=0.1 2024-06-20 02:39:03,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=113782.16666666667, ans=0.0 2024-06-20 02:39:04,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 02:39:15,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=113800.5, ans=0.125 2024-06-20 02:39:15,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=113800.5, ans=0.125 2024-06-20 02:39:22,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=113818.83333333333, ans=0.0 2024-06-20 02:39:25,502 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.135e+02 3.601e+02 4.113e+02 9.413e+02, threshold=7.203e+02, percent-clipped=3.0 2024-06-20 02:39:25,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.59 vs. limit=22.5 2024-06-20 02:39:26,406 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.688e+01 2024-06-20 02:39:26,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=113837.16666666667, ans=0.0 2024-06-20 02:39:30,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=113837.16666666667, ans=0.09899494936611666 2024-06-20 02:39:31,933 INFO [train.py:1028] (0/2) Epoch 7, batch 1400, loss[loss=0.3154, simple_loss=0.3467, pruned_loss=0.142, over 12831.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.322, pruned_loss=0.1348, over 2588389.52 frames. ], batch size: 26, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:39:37,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.43 vs. limit=15.0 2024-06-20 02:39:45,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=113892.16666666667, ans=15.0 2024-06-20 02:39:47,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=113892.16666666667, ans=0.2 2024-06-20 02:39:48,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.20 vs. limit=22.5 2024-06-20 02:39:53,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2024-06-20 02:39:55,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=113910.5, ans=0.125 2024-06-20 02:40:05,073 INFO [train.py:1028] (0/2) Epoch 7, batch 1450, loss[loss=0.2888, simple_loss=0.3096, pruned_loss=0.134, over 13109.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3213, pruned_loss=0.1347, over 2587955.81 frames. ], batch size: 121, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:40:17,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=113983.83333333333, ans=0.025 2024-06-20 02:40:20,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2024-06-20 02:40:33,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.895e+02 3.254e+02 3.707e+02 1.249e+03, threshold=6.509e+02, percent-clipped=1.0 2024-06-20 02:40:33,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114020.5, ans=0.1 2024-06-20 02:40:40,200 INFO [train.py:1028] (0/2) Epoch 7, batch 1500, loss[loss=0.2718, simple_loss=0.2964, pruned_loss=0.1237, over 13248.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3212, pruned_loss=0.1347, over 2589584.70 frames. ], batch size: 83, lr: 8.60e-03, grad_scale: 8.0 2024-06-20 02:40:42,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 02:40:57,734 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:40:59,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.91 vs. limit=6.0 2024-06-20 02:41:01,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=114093.83333333333, ans=0.125 2024-06-20 02:41:10,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=114112.16666666667, ans=0.025 2024-06-20 02:41:15,782 INFO [train.py:1028] (0/2) Epoch 7, batch 1550, loss[loss=0.317, simple_loss=0.3388, pruned_loss=0.1476, over 13026.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3217, pruned_loss=0.135, over 2584683.05 frames. ], batch size: 102, lr: 8.59e-03, grad_scale: 8.0 2024-06-20 02:41:16,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=114130.5, ans=0.0 2024-06-20 02:41:19,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=114130.5, ans=0.0 2024-06-20 02:41:25,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.67 vs. limit=15.0 2024-06-20 02:41:28,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=12.0 2024-06-20 02:41:31,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=114167.16666666667, ans=0.2 2024-06-20 02:41:31,567 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.049e-01 2024-06-20 02:41:41,698 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.988e+02 3.266e+02 3.782e+02 7.300e+02, threshold=6.532e+02, percent-clipped=1.0 2024-06-20 02:41:45,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=114203.83333333333, ans=0.0 2024-06-20 02:41:47,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=114222.16666666667, ans=0.125 2024-06-20 02:41:47,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=114222.16666666667, ans=0.0 2024-06-20 02:41:48,071 INFO [train.py:1028] (0/2) Epoch 7, batch 1600, loss[loss=0.2803, simple_loss=0.3078, pruned_loss=0.1264, over 13165.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3222, pruned_loss=0.135, over 2579278.41 frames. ], batch size: 77, lr: 8.59e-03, grad_scale: 16.0 2024-06-20 02:41:51,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=114222.16666666667, ans=0.5 2024-06-20 02:41:56,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2024-06-20 02:41:58,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.06 vs. limit=22.5 2024-06-20 02:42:06,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=114258.83333333333, ans=0.125 2024-06-20 02:42:07,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=114277.16666666667, ans=0.0 2024-06-20 02:42:12,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114295.5, ans=0.1 2024-06-20 02:42:16,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=114295.5, ans=0.125 2024-06-20 02:42:22,657 INFO [train.py:1028] (0/2) Epoch 7, batch 1650, loss[loss=0.3089, simple_loss=0.3286, pruned_loss=0.1447, over 13167.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.323, pruned_loss=0.1357, over 2575551.61 frames. ], batch size: 95, lr: 8.59e-03, grad_scale: 16.0 2024-06-20 02:42:37,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=114350.5, ans=0.0 2024-06-20 02:42:49,035 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.588e+02 2.836e+02 3.313e+02 4.541e+02, threshold=5.671e+02, percent-clipped=0.0 2024-06-20 02:42:49,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=12.0 2024-06-20 02:42:52,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=114387.16666666667, ans=15.0 2024-06-20 02:42:55,637 INFO [train.py:1028] (0/2) Epoch 7, batch 1700, loss[loss=0.3155, simple_loss=0.3412, pruned_loss=0.1449, over 12861.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3221, pruned_loss=0.1348, over 2581585.90 frames. ], batch size: 26, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:43:00,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=114405.5, ans=0.125 2024-06-20 02:43:09,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=114423.83333333333, ans=0.125 2024-06-20 02:43:14,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=114442.16666666667, ans=0.125 2024-06-20 02:43:20,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-20 02:43:22,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114460.5, ans=0.1 2024-06-20 02:43:23,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=114460.5, ans=0.015 2024-06-20 02:43:26,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.22 vs. limit=15.0 2024-06-20 02:43:31,745 INFO [train.py:1028] (0/2) Epoch 7, batch 1750, loss[loss=0.305, simple_loss=0.3416, pruned_loss=0.1342, over 12427.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3224, pruned_loss=0.1348, over 2582043.56 frames. ], batch size: 22, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:43:48,722 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.218e+02 2024-06-20 02:43:51,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=12.0 2024-06-20 02:43:54,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=114552.16666666667, ans=0.125 2024-06-20 02:43:57,838 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.929e+02 3.362e+02 3.859e+02 5.647e+02, threshold=6.723e+02, percent-clipped=0.0 2024-06-20 02:43:57,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=114570.5, ans=0.0 2024-06-20 02:44:04,005 INFO [train.py:1028] (0/2) Epoch 7, batch 1800, loss[loss=0.2855, simple_loss=0.3135, pruned_loss=0.1287, over 13215.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3223, pruned_loss=0.1346, over 2581349.55 frames. ], batch size: 67, lr: 8.58e-03, grad_scale: 16.0 2024-06-20 02:44:04,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=114588.83333333333, ans=0.125 2024-06-20 02:44:06,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=114588.83333333333, ans=0.125 2024-06-20 02:44:14,120 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 02:44:14,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.12 vs. limit=22.5 2024-06-20 02:44:18,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=114625.5, ans=0.0 2024-06-20 02:44:22,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.78 vs. limit=22.5 2024-06-20 02:44:29,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=114643.83333333333, ans=0.125 2024-06-20 02:44:36,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2024-06-20 02:44:39,390 INFO [train.py:1028] (0/2) Epoch 7, batch 1850, loss[loss=0.2976, simple_loss=0.3211, pruned_loss=0.1371, over 13208.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.323, pruned_loss=0.1349, over 2582631.53 frames. ], batch size: 83, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:44:56,785 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.030e+01 2024-06-20 02:45:05,060 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.736e+02 3.067e+02 3.410e+02 4.860e+02, threshold=6.133e+02, percent-clipped=0.0 2024-06-20 02:45:07,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=114753.83333333333, ans=0.0 2024-06-20 02:45:14,300 INFO [train.py:1028] (0/2) Epoch 7, batch 1900, loss[loss=0.3007, simple_loss=0.3248, pruned_loss=0.1383, over 13136.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3221, pruned_loss=0.1347, over 2584507.47 frames. ], batch size: 95, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:45:28,429 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:45:47,056 INFO [train.py:1028] (0/2) Epoch 7, batch 1950, loss[loss=0.2952, simple_loss=0.325, pruned_loss=0.1327, over 13282.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3212, pruned_loss=0.1345, over 2590269.48 frames. ], batch size: 52, lr: 8.57e-03, grad_scale: 16.0 2024-06-20 02:45:47,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.74 vs. limit=22.5 2024-06-20 02:45:49,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 02:45:52,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.74 vs. limit=22.5 2024-06-20 02:45:53,306 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.178e+02 2024-06-20 02:45:58,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=114882.16666666667, ans=0.125 2024-06-20 02:46:02,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=114900.5, ans=0.125 2024-06-20 02:46:03,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=114900.5, ans=0.125 2024-06-20 02:46:03,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=114900.5, ans=0.0 2024-06-20 02:46:08,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=15.0 2024-06-20 02:46:12,789 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.618e+02 2.813e+02 3.234e+02 8.250e+02, threshold=5.625e+02, percent-clipped=1.0 2024-06-20 02:46:20,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=114937.16666666667, ans=0.0 2024-06-20 02:46:21,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.51 vs. limit=22.5 2024-06-20 02:46:22,169 INFO [train.py:1028] (0/2) Epoch 7, batch 2000, loss[loss=0.2659, simple_loss=0.3085, pruned_loss=0.1116, over 12403.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3212, pruned_loss=0.1342, over 2586634.43 frames. ], batch size: 22, lr: 8.56e-03, grad_scale: 32.0 2024-06-20 02:46:29,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=114973.83333333333, ans=0.125 2024-06-20 02:46:31,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=114973.83333333333, ans=0.125 2024-06-20 02:46:44,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=115010.5, ans=0.0 2024-06-20 02:46:45,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=115010.5, ans=0.0 2024-06-20 02:46:51,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=115028.83333333333, ans=0.05 2024-06-20 02:46:54,145 INFO [train.py:1028] (0/2) Epoch 7, batch 2050, loss[loss=0.2685, simple_loss=0.3055, pruned_loss=0.1157, over 12718.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3214, pruned_loss=0.1343, over 2582766.84 frames. ], batch size: 29, lr: 8.56e-03, grad_scale: 32.0 2024-06-20 02:46:54,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=115047.16666666667, ans=0.125 2024-06-20 02:46:59,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=115065.5, ans=0.0 2024-06-20 02:47:12,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=115083.83333333333, ans=0.125 2024-06-20 02:47:12,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=115083.83333333333, ans=0.2 2024-06-20 02:47:12,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=115083.83333333333, ans=0.0 2024-06-20 02:47:13,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=15.0 2024-06-20 02:47:16,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2024-06-20 02:47:19,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=115102.16666666667, ans=0.95 2024-06-20 02:47:21,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=115120.5, ans=0.2 2024-06-20 02:47:22,457 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.871e+02 3.171e+02 3.710e+02 5.065e+02, threshold=6.343e+02, percent-clipped=0.0 2024-06-20 02:47:27,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=115138.83333333333, ans=0.0 2024-06-20 02:47:28,264 INFO [train.py:1028] (0/2) Epoch 7, batch 2100, loss[loss=0.2788, simple_loss=0.3218, pruned_loss=0.1178, over 13151.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3216, pruned_loss=0.1339, over 2585044.50 frames. ], batch size: 59, lr: 8.56e-03, grad_scale: 16.0 2024-06-20 02:47:35,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=115157.16666666667, ans=0.04949747468305833 2024-06-20 02:47:43,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=115175.5, ans=0.125 2024-06-20 02:47:47,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=115193.83333333333, ans=0.025 2024-06-20 02:47:47,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.23 vs. limit=6.0 2024-06-20 02:47:48,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=115193.83333333333, ans=0.0 2024-06-20 02:47:48,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115193.83333333333, ans=0.125 2024-06-20 02:47:50,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115193.83333333333, ans=0.125 2024-06-20 02:47:55,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=115212.16666666667, ans=0.04949747468305833 2024-06-20 02:47:58,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=115212.16666666667, ans=0.2 2024-06-20 02:47:59,986 INFO [train.py:1028] (0/2) Epoch 7, batch 2150, loss[loss=0.2715, simple_loss=0.3057, pruned_loss=0.1186, over 13231.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3209, pruned_loss=0.1331, over 2587764.62 frames. ], batch size: 52, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:48:02,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=115230.5, ans=15.0 2024-06-20 02:48:08,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=115248.83333333333, ans=0.025 2024-06-20 02:48:13,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-20 02:48:16,841 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:48:25,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115303.83333333333, ans=0.1 2024-06-20 02:48:28,798 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.860e+02 3.126e+02 3.434e+02 5.560e+02, threshold=6.253e+02, percent-clipped=0.0 2024-06-20 02:48:34,790 INFO [train.py:1028] (0/2) Epoch 7, batch 2200, loss[loss=0.2694, simple_loss=0.297, pruned_loss=0.1209, over 13211.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3211, pruned_loss=0.1335, over 2588052.96 frames. ], batch size: 83, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:48:36,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=115322.16666666667, ans=0.125 2024-06-20 02:48:39,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.87 vs. limit=22.5 2024-06-20 02:48:42,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=115340.5, ans=0.0 2024-06-20 02:48:42,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115340.5, ans=0.1 2024-06-20 02:48:54,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=115377.16666666667, ans=0.2 2024-06-20 02:48:56,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-20 02:49:07,132 INFO [train.py:1028] (0/2) Epoch 7, batch 2250, loss[loss=0.2893, simple_loss=0.3215, pruned_loss=0.1285, over 13224.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3204, pruned_loss=0.1327, over 2587540.00 frames. ], batch size: 63, lr: 8.55e-03, grad_scale: 16.0 2024-06-20 02:49:19,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.88 vs. limit=15.0 2024-06-20 02:49:27,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=115450.5, ans=0.0 2024-06-20 02:49:29,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115468.83333333333, ans=0.1 2024-06-20 02:49:36,964 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.802e+02 3.115e+02 3.434e+02 5.571e+02, threshold=6.230e+02, percent-clipped=0.0 2024-06-20 02:49:39,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=115487.16666666667, ans=0.0 2024-06-20 02:49:43,261 INFO [train.py:1028] (0/2) Epoch 7, batch 2300, loss[loss=0.2817, simple_loss=0.3062, pruned_loss=0.1286, over 12875.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3205, pruned_loss=0.1327, over 2582027.02 frames. ], batch size: 33, lr: 8.54e-03, grad_scale: 16.0 2024-06-20 02:49:45,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.28 vs. limit=15.0 2024-06-20 02:49:48,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.51 vs. limit=22.5 2024-06-20 02:49:48,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=115505.5, ans=0.0 2024-06-20 02:49:48,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2024-06-20 02:49:57,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=115542.16666666667, ans=0.2 2024-06-20 02:50:09,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.38 vs. limit=22.5 2024-06-20 02:50:12,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=12.0 2024-06-20 02:50:16,316 INFO [train.py:1028] (0/2) Epoch 7, batch 2350, loss[loss=0.2807, simple_loss=0.3101, pruned_loss=0.1257, over 13221.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3201, pruned_loss=0.1327, over 2585002.22 frames. ], batch size: 67, lr: 8.54e-03, grad_scale: 16.0 2024-06-20 02:50:25,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2024-06-20 02:50:32,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=115633.83333333333, ans=0.0 2024-06-20 02:50:33,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=115633.83333333333, ans=0.125 2024-06-20 02:50:46,569 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 3.021e+02 3.471e+02 4.112e+02 5.806e+02, threshold=6.941e+02, percent-clipped=0.0 2024-06-20 02:50:46,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.25 vs. limit=15.0 2024-06-20 02:50:52,333 INFO [train.py:1028] (0/2) Epoch 7, batch 2400, loss[loss=0.283, simple_loss=0.3093, pruned_loss=0.1284, over 13356.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3188, pruned_loss=0.1322, over 2587744.26 frames. ], batch size: 46, lr: 8.54e-03, grad_scale: 32.0 2024-06-20 02:50:52,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=115688.83333333333, ans=0.125 2024-06-20 02:50:58,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=115707.16666666667, ans=0.125 2024-06-20 02:51:13,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=15.0 2024-06-20 02:51:13,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.56 vs. limit=22.5 2024-06-20 02:51:17,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=115743.83333333333, ans=0.125 2024-06-20 02:51:20,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=115762.16666666667, ans=0.0 2024-06-20 02:51:26,971 INFO [train.py:1028] (0/2) Epoch 7, batch 2450, loss[loss=0.2702, simple_loss=0.3029, pruned_loss=0.1187, over 13277.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3184, pruned_loss=0.1328, over 2584662.11 frames. ], batch size: 63, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:51:34,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=115798.83333333333, ans=0.5 2024-06-20 02:51:36,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=15.0 2024-06-20 02:51:40,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=115817.16666666667, ans=0.125 2024-06-20 02:51:44,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115817.16666666667, ans=0.1 2024-06-20 02:51:53,442 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.633e+02 4.202e+02 7.763e+02, threshold=7.267e+02, percent-clipped=2.0 2024-06-20 02:51:58,926 INFO [train.py:1028] (0/2) Epoch 7, batch 2500, loss[loss=0.2844, simple_loss=0.3153, pruned_loss=0.1268, over 13230.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.318, pruned_loss=0.1325, over 2588782.27 frames. ], batch size: 83, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:51:59,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=115872.16666666667, ans=0.0 2024-06-20 02:52:07,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=115890.5, ans=0.125 2024-06-20 02:52:08,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=115890.5, ans=0.0 2024-06-20 02:52:10,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=115890.5, ans=0.125 2024-06-20 02:52:34,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=115945.5, ans=0.125 2024-06-20 02:52:35,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=115963.83333333333, ans=0.0 2024-06-20 02:52:35,705 INFO [train.py:1028] (0/2) Epoch 7, batch 2550, loss[loss=0.3038, simple_loss=0.337, pruned_loss=0.1353, over 12452.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3174, pruned_loss=0.1323, over 2587533.15 frames. ], batch size: 22, lr: 8.53e-03, grad_scale: 16.0 2024-06-20 02:52:37,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=115963.83333333333, ans=0.0 2024-06-20 02:52:48,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=116000.5, ans=0.125 2024-06-20 02:52:51,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2024-06-20 02:52:53,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=116000.5, ans=0.125 2024-06-20 02:53:02,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=116018.83333333333, ans=0.0 2024-06-20 02:53:05,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=116037.16666666667, ans=0.125 2024-06-20 02:53:06,274 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.060e+02 3.371e+02 3.874e+02 6.411e+02, threshold=6.742e+02, percent-clipped=0.0 2024-06-20 02:53:09,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2024-06-20 02:53:11,230 INFO [train.py:1028] (0/2) Epoch 7, batch 2600, loss[loss=0.2731, simple_loss=0.3028, pruned_loss=0.1218, over 13272.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3155, pruned_loss=0.1313, over 2587161.88 frames. ], batch size: 52, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:53:15,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=116055.5, ans=0.025 2024-06-20 02:53:15,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=116055.5, ans=0.0 2024-06-20 02:53:18,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.87 vs. limit=15.0 2024-06-20 02:53:21,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=116073.83333333333, ans=0.0 2024-06-20 02:53:24,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.78 vs. limit=15.0 2024-06-20 02:53:37,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=116128.83333333333, ans=0.0 2024-06-20 02:53:38,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=116128.83333333333, ans=0.125 2024-06-20 02:53:42,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=116128.83333333333, ans=0.125 2024-06-20 02:53:43,815 INFO [train.py:1028] (0/2) Epoch 7, batch 2650, loss[loss=0.268, simple_loss=0.2874, pruned_loss=0.1243, over 13019.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3142, pruned_loss=0.131, over 2587836.00 frames. ], batch size: 144, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:53:45,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 02:53:47,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=116147.16666666667, ans=0.125 2024-06-20 02:53:59,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=116183.83333333333, ans=0.125 2024-06-20 02:54:12,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=116220.5, ans=0.125 2024-06-20 02:54:13,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.752e+02 2.964e+02 3.316e+02 4.432e+02, threshold=5.928e+02, percent-clipped=0.0 2024-06-20 02:54:19,131 INFO [train.py:1028] (0/2) Epoch 7, batch 2700, loss[loss=0.2872, simple_loss=0.3089, pruned_loss=0.1328, over 13234.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3127, pruned_loss=0.1308, over 2585600.34 frames. ], batch size: 89, lr: 8.52e-03, grad_scale: 16.0 2024-06-20 02:54:29,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=116257.16666666667, ans=0.125 2024-06-20 02:54:35,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=116275.5, ans=0.125 2024-06-20 02:54:47,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=116312.16666666667, ans=0.2 2024-06-20 02:54:53,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=116312.16666666667, ans=0.09899494936611666 2024-06-20 02:54:54,699 INFO [train.py:1028] (0/2) Epoch 7, batch 2750, loss[loss=0.3074, simple_loss=0.3302, pruned_loss=0.1424, over 13325.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3119, pruned_loss=0.1303, over 2581940.18 frames. ], batch size: 43, lr: 8.51e-03, grad_scale: 16.0 2024-06-20 02:55:07,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116348.83333333333, ans=0.1 2024-06-20 02:55:13,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116367.16666666667, ans=0.125 2024-06-20 02:55:16,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=15.0 2024-06-20 02:55:23,069 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.663e+02 3.101e+02 3.520e+02 6.481e+02, threshold=6.202e+02, percent-clipped=1.0 2024-06-20 02:55:28,083 INFO [train.py:1028] (0/2) Epoch 7, batch 2800, loss[loss=0.312, simple_loss=0.3123, pruned_loss=0.1559, over 10775.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3112, pruned_loss=0.1301, over 2579838.91 frames. ], batch size: 303, lr: 8.51e-03, grad_scale: 32.0 2024-06-20 02:55:35,486 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 02:55:37,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.11 vs. limit=22.5 2024-06-20 02:56:01,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=116495.5, ans=0.125 2024-06-20 02:56:03,540 INFO [train.py:1028] (0/2) Epoch 7, batch 2850, loss[loss=0.2614, simple_loss=0.2987, pruned_loss=0.112, over 13301.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3103, pruned_loss=0.1299, over 2577108.73 frames. ], batch size: 49, lr: 8.51e-03, grad_scale: 32.0 2024-06-20 02:56:05,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.75 vs. limit=6.0 2024-06-20 02:56:08,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116513.83333333333, ans=0.1 2024-06-20 02:56:13,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=116532.16666666667, ans=0.1 2024-06-20 02:56:20,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=116550.5, ans=0.2 2024-06-20 02:56:24,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116568.83333333333, ans=0.1 2024-06-20 02:56:31,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.737e+02 3.057e+02 3.325e+02 5.006e+02, threshold=6.114e+02, percent-clipped=0.0 2024-06-20 02:56:33,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-20 02:56:36,490 INFO [train.py:1028] (0/2) Epoch 7, batch 2900, loss[loss=0.2809, simple_loss=0.3065, pruned_loss=0.1276, over 13154.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3076, pruned_loss=0.1285, over 2585410.41 frames. ], batch size: 55, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:56:45,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=116605.5, ans=0.025 2024-06-20 02:56:56,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=116642.16666666667, ans=0.0 2024-06-20 02:57:00,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.45 vs. limit=22.5 2024-06-20 02:57:04,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.77 vs. limit=15.0 2024-06-20 02:57:12,763 INFO [train.py:1028] (0/2) Epoch 7, batch 2950, loss[loss=0.2522, simple_loss=0.2801, pruned_loss=0.1121, over 13292.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.307, pruned_loss=0.1283, over 2580946.11 frames. ], batch size: 43, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:57:24,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=116715.5, ans=0.125 2024-06-20 02:57:29,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=116733.83333333333, ans=0.125 2024-06-20 02:57:30,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2024-06-20 02:57:40,664 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.547e+02 2.899e+02 3.253e+02 4.782e+02, threshold=5.798e+02, percent-clipped=0.0 2024-06-20 02:57:41,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2024-06-20 02:57:44,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=116770.5, ans=0.025 2024-06-20 02:57:45,922 INFO [train.py:1028] (0/2) Epoch 7, batch 3000, loss[loss=0.2733, simple_loss=0.3099, pruned_loss=0.1183, over 13199.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3057, pruned_loss=0.1275, over 2579478.68 frames. ], batch size: 59, lr: 8.50e-03, grad_scale: 32.0 2024-06-20 02:57:45,923 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 02:57:53,713 INFO [train.py:1060] (0/2) Epoch 7, validation: loss=0.2166, simple_loss=0.2775, pruned_loss=0.07786, over 351949.00 frames. 2024-06-20 02:57:53,714 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16965MB 2024-06-20 02:58:10,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=116825.5, ans=0.125 2024-06-20 02:58:26,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=116862.16666666667, ans=0.125 2024-06-20 02:58:28,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=116862.16666666667, ans=0.125 2024-06-20 02:58:30,071 INFO [train.py:1028] (0/2) Epoch 7, batch 3050, loss[loss=0.2743, simple_loss=0.305, pruned_loss=0.1218, over 13257.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3056, pruned_loss=0.1278, over 2579818.48 frames. ], batch size: 46, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:58:41,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=116898.83333333333, ans=0.0 2024-06-20 02:58:45,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2024-06-20 02:58:46,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=116917.16666666667, ans=0.125 2024-06-20 02:58:50,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=116917.16666666667, ans=0.95 2024-06-20 02:58:51,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116935.5, ans=0.1 2024-06-20 02:58:58,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=116953.83333333333, ans=0.0 2024-06-20 02:58:59,365 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.485e+02 2.826e+02 3.204e+02 5.343e+02, threshold=5.653e+02, percent-clipped=0.0 2024-06-20 02:59:04,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=116972.16666666667, ans=0.0 2024-06-20 02:59:04,873 INFO [train.py:1028] (0/2) Epoch 7, batch 3100, loss[loss=0.2756, simple_loss=0.2966, pruned_loss=0.1273, over 13031.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3039, pruned_loss=0.1264, over 2580603.48 frames. ], batch size: 144, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:59:05,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=116972.16666666667, ans=0.0 2024-06-20 02:59:10,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2024-06-20 02:59:14,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.43 vs. limit=22.5 2024-06-20 02:59:14,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=116990.5, ans=0.0 2024-06-20 02:59:20,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=117008.83333333333, ans=0.125 2024-06-20 02:59:27,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.01 vs. limit=15.0 2024-06-20 02:59:28,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=117027.16666666667, ans=0.1 2024-06-20 02:59:29,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=117027.16666666667, ans=0.2 2024-06-20 02:59:34,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=117045.5, ans=0.125 2024-06-20 02:59:34,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-20 02:59:38,169 INFO [train.py:1028] (0/2) Epoch 7, batch 3150, loss[loss=0.2836, simple_loss=0.2991, pruned_loss=0.1341, over 12979.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3024, pruned_loss=0.1254, over 2582722.15 frames. ], batch size: 158, lr: 8.49e-03, grad_scale: 32.0 2024-06-20 02:59:39,756 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.34 vs. limit=15.0 2024-06-20 02:59:46,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=117082.16666666667, ans=0.2 2024-06-20 03:00:06,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.47 vs. limit=15.0 2024-06-20 03:00:08,952 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.386e+02 2.544e+02 2.932e+02 4.184e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-20 03:00:12,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=117137.16666666667, ans=0.125 2024-06-20 03:00:13,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=117137.16666666667, ans=0.125 2024-06-20 03:00:14,125 INFO [train.py:1028] (0/2) Epoch 7, batch 3200, loss[loss=0.2627, simple_loss=0.2976, pruned_loss=0.1139, over 13098.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3023, pruned_loss=0.1253, over 2582919.20 frames. ], batch size: 55, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:00:17,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=117155.5, ans=0.0 2024-06-20 03:00:26,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=117173.83333333333, ans=0.125 2024-06-20 03:00:34,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=117210.5, ans=0.04949747468305833 2024-06-20 03:00:50,320 INFO [train.py:1028] (0/2) Epoch 7, batch 3250, loss[loss=0.2313, simple_loss=0.2716, pruned_loss=0.09548, over 13247.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3026, pruned_loss=0.126, over 2586054.95 frames. ], batch size: 72, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:01:18,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.48 vs. limit=10.0 2024-06-20 03:01:19,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.393e+02 2.596e+02 2.860e+02 4.224e+02, threshold=5.192e+02, percent-clipped=0.0 2024-06-20 03:01:21,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=117320.5, ans=0.2 2024-06-20 03:01:22,307 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-64000.pt 2024-06-20 03:01:29,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2024-06-20 03:01:30,179 INFO [train.py:1028] (0/2) Epoch 7, batch 3300, loss[loss=0.2949, simple_loss=0.3146, pruned_loss=0.1377, over 12823.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3018, pruned_loss=0.1252, over 2582436.79 frames. ], batch size: 177, lr: 8.48e-03, grad_scale: 32.0 2024-06-20 03:01:31,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.70 vs. limit=22.5 2024-06-20 03:01:45,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=117375.5, ans=0.0 2024-06-20 03:01:48,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=117375.5, ans=0.125 2024-06-20 03:01:57,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=117393.83333333333, ans=10.0 2024-06-20 03:02:01,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=117412.16666666667, ans=0.0 2024-06-20 03:02:03,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=117412.16666666667, ans=0.125 2024-06-20 03:02:07,600 INFO [train.py:1028] (0/2) Epoch 7, batch 3350, loss[loss=0.2799, simple_loss=0.3017, pruned_loss=0.129, over 12981.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3013, pruned_loss=0.1253, over 2578011.77 frames. ], batch size: 158, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:02:10,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.73 vs. limit=15.0 2024-06-20 03:02:32,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=117485.5, ans=0.09899494936611666 2024-06-20 03:02:35,263 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.736e+02 3.017e+02 3.321e+02 4.935e+02, threshold=6.034e+02, percent-clipped=0.0 2024-06-20 03:02:43,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=117522.16666666667, ans=0.125 2024-06-20 03:02:44,476 INFO [train.py:1028] (0/2) Epoch 7, batch 3400, loss[loss=0.2969, simple_loss=0.3161, pruned_loss=0.1388, over 12679.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3007, pruned_loss=0.1252, over 2575017.66 frames. ], batch size: 22, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:02:44,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 03:02:47,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=117522.16666666667, ans=0.2 2024-06-20 03:02:49,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117522.16666666667, ans=0.1 2024-06-20 03:02:59,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=117558.83333333333, ans=0.125 2024-06-20 03:03:03,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=117558.83333333333, ans=0.125 2024-06-20 03:03:04,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=117577.16666666667, ans=0.125 2024-06-20 03:03:08,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=117577.16666666667, ans=0.1 2024-06-20 03:03:14,488 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=12.0 2024-06-20 03:03:14,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=117595.5, ans=0.125 2024-06-20 03:03:17,834 INFO [train.py:1028] (0/2) Epoch 7, batch 3450, loss[loss=0.2791, simple_loss=0.2983, pruned_loss=0.1299, over 12729.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.299, pruned_loss=0.1244, over 2576560.68 frames. ], batch size: 176, lr: 8.47e-03, grad_scale: 32.0 2024-06-20 03:03:21,488 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.81 vs. limit=12.0 2024-06-20 03:03:23,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=117632.16666666667, ans=0.125 2024-06-20 03:03:29,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=117632.16666666667, ans=0.125 2024-06-20 03:03:30,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=12.0 2024-06-20 03:03:45,290 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.567e+02 2.836e+02 3.247e+02 4.313e+02, threshold=5.672e+02, percent-clipped=0.0 2024-06-20 03:03:46,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=117687.16666666667, ans=0.125 2024-06-20 03:03:48,653 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2024-06-20 03:03:49,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117687.16666666667, ans=0.1 2024-06-20 03:03:50,959 INFO [train.py:1028] (0/2) Epoch 7, batch 3500, loss[loss=0.2473, simple_loss=0.2758, pruned_loss=0.1095, over 12941.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.2988, pruned_loss=0.124, over 2575503.11 frames. ], batch size: 33, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:03:55,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=117705.5, ans=0.125 2024-06-20 03:03:55,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=117705.5, ans=0.0 2024-06-20 03:03:56,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=117705.5, ans=0.0 2024-06-20 03:03:58,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=117705.5, ans=0.0 2024-06-20 03:03:59,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=117705.5, ans=0.025 2024-06-20 03:04:16,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2024-06-20 03:04:18,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=117760.5, ans=0.0 2024-06-20 03:04:27,507 INFO [train.py:1028] (0/2) Epoch 7, batch 3550, loss[loss=0.2435, simple_loss=0.2696, pruned_loss=0.1086, over 13120.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.2981, pruned_loss=0.1235, over 2576434.46 frames. ], batch size: 95, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:04:27,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=117797.16666666667, ans=0.0 2024-06-20 03:04:30,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=117797.16666666667, ans=0.0 2024-06-20 03:04:31,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2024-06-20 03:04:36,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=12.0 2024-06-20 03:04:58,457 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.622e+02 2.898e+02 3.303e+02 5.007e+02, threshold=5.795e+02, percent-clipped=0.0 2024-06-20 03:05:00,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=117870.5, ans=0.2 2024-06-20 03:05:01,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2024-06-20 03:05:03,945 INFO [train.py:1028] (0/2) Epoch 7, batch 3600, loss[loss=0.2685, simple_loss=0.3022, pruned_loss=0.1174, over 13041.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.2969, pruned_loss=0.1226, over 2579578.46 frames. ], batch size: 48, lr: 8.46e-03, grad_scale: 32.0 2024-06-20 03:05:16,637 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.14 vs. limit=10.0 2024-06-20 03:05:18,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=117925.5, ans=0.125 2024-06-20 03:05:19,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=117925.5, ans=0.2 2024-06-20 03:05:21,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=117925.5, ans=0.2 2024-06-20 03:05:27,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=117943.83333333333, ans=0.125 2024-06-20 03:05:34,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=117962.16666666667, ans=0.0 2024-06-20 03:05:37,504 INFO [train.py:1028] (0/2) Epoch 7, batch 3650, loss[loss=0.2708, simple_loss=0.2988, pruned_loss=0.1214, over 13047.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.2977, pruned_loss=0.1227, over 2578443.51 frames. ], batch size: 102, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:05:38,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=117980.5, ans=0.2 2024-06-20 03:05:42,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=117980.5, ans=0.125 2024-06-20 03:06:08,427 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.591e+02 2.748e+02 3.061e+02 4.305e+02, threshold=5.497e+02, percent-clipped=0.0 2024-06-20 03:06:09,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=15.0 2024-06-20 03:06:14,113 INFO [train.py:1028] (0/2) Epoch 7, batch 3700, loss[loss=0.2546, simple_loss=0.2899, pruned_loss=0.1096, over 13240.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.2969, pruned_loss=0.1223, over 2583629.52 frames. ], batch size: 72, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:06:42,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=118145.5, ans=0.1 2024-06-20 03:06:46,497 INFO [train.py:1028] (0/2) Epoch 7, batch 3750, loss[loss=0.233, simple_loss=0.2789, pruned_loss=0.0935, over 12720.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.2964, pruned_loss=0.1219, over 2585855.37 frames. ], batch size: 22, lr: 8.45e-03, grad_scale: 32.0 2024-06-20 03:07:05,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=118200.5, ans=0.125 2024-06-20 03:07:11,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.63 vs. limit=22.5 2024-06-20 03:07:17,514 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.521e+02 2.707e+02 3.048e+02 4.386e+02, threshold=5.414e+02, percent-clipped=0.0 2024-06-20 03:07:17,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=118237.16666666667, ans=0.125 2024-06-20 03:07:22,780 INFO [train.py:1028] (0/2) Epoch 7, batch 3800, loss[loss=0.2468, simple_loss=0.2746, pruned_loss=0.1095, over 13149.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.2961, pruned_loss=0.1215, over 2583980.98 frames. ], batch size: 83, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:07:29,092 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.54 vs. limit=6.0 2024-06-20 03:07:40,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=118292.16666666667, ans=0.0 2024-06-20 03:07:43,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=118310.5, ans=0.0 2024-06-20 03:07:44,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=118310.5, ans=0.025 2024-06-20 03:07:52,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=118328.83333333333, ans=12.0 2024-06-20 03:07:53,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=118328.83333333333, ans=0.125 2024-06-20 03:07:55,864 INFO [train.py:1028] (0/2) Epoch 7, batch 3850, loss[loss=0.2717, simple_loss=0.2902, pruned_loss=0.1266, over 13062.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.2954, pruned_loss=0.1208, over 2583796.39 frames. ], batch size: 144, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:07:56,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=118347.16666666667, ans=0.125 2024-06-20 03:07:58,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=118347.16666666667, ans=0.0 2024-06-20 03:08:00,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=118347.16666666667, ans=0.125 2024-06-20 03:08:06,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.52 vs. limit=10.0 2024-06-20 03:08:07,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=118365.5, ans=0.2 2024-06-20 03:08:18,175 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.749e+00 2024-06-20 03:08:23,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.21 vs. limit=22.5 2024-06-20 03:08:26,158 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.286e+02 2.504e+02 2.695e+02 3.658e+02, threshold=5.007e+02, percent-clipped=0.0 2024-06-20 03:08:31,440 INFO [train.py:1028] (0/2) Epoch 7, batch 3900, loss[loss=0.2808, simple_loss=0.306, pruned_loss=0.1278, over 13202.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.295, pruned_loss=0.1211, over 2587355.63 frames. ], batch size: 83, lr: 8.44e-03, grad_scale: 32.0 2024-06-20 03:08:32,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=118438.83333333333, ans=0.05 2024-06-20 03:08:35,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=118438.83333333333, ans=0.025 2024-06-20 03:08:49,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=118475.5, ans=0.1 2024-06-20 03:09:08,689 INFO [train.py:1028] (0/2) Epoch 7, batch 3950, loss[loss=0.2577, simple_loss=0.2763, pruned_loss=0.1196, over 13105.00 frames. ], tot_loss[loss=0.267, simple_loss=0.2938, pruned_loss=0.1201, over 2588837.86 frames. ], batch size: 132, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:09:12,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118530.5, ans=0.0 2024-06-20 03:09:14,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-20 03:09:16,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2024-06-20 03:09:18,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.76 vs. limit=15.0 2024-06-20 03:09:31,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.13 vs. limit=6.0 2024-06-20 03:09:36,478 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.205e+02 2.442e+02 2.563e+02 3.872e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-20 03:09:41,890 INFO [train.py:1028] (0/2) Epoch 7, batch 4000, loss[loss=0.2609, simple_loss=0.2931, pruned_loss=0.1143, over 12900.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.2932, pruned_loss=0.1203, over 2584112.88 frames. ], batch size: 39, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:09:44,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118622.16666666667, ans=0.1 2024-06-20 03:09:48,553 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.493e+03 2024-06-20 03:09:55,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.44 vs. limit=22.5 2024-06-20 03:10:12,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.64 vs. limit=15.0 2024-06-20 03:10:13,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=118695.5, ans=0.0 2024-06-20 03:10:16,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=118695.5, ans=0.125 2024-06-20 03:10:18,813 INFO [train.py:1028] (0/2) Epoch 7, batch 4050, loss[loss=0.3021, simple_loss=0.3049, pruned_loss=0.1497, over 10948.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.2931, pruned_loss=0.1207, over 2581733.95 frames. ], batch size: 304, lr: 8.43e-03, grad_scale: 32.0 2024-06-20 03:10:25,314 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.425e+03 2024-06-20 03:10:34,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.19 vs. limit=15.0 2024-06-20 03:10:38,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=118768.83333333333, ans=0.07 2024-06-20 03:10:41,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=27.80 vs. limit=22.5 2024-06-20 03:10:46,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.320e+02 2.579e+02 2.998e+02 4.314e+02, threshold=5.157e+02, percent-clipped=0.0 2024-06-20 03:10:47,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=118787.16666666667, ans=0.2 2024-06-20 03:10:48,527 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.79 vs. limit=22.5 2024-06-20 03:10:51,742 INFO [train.py:1028] (0/2) Epoch 7, batch 4100, loss[loss=0.2527, simple_loss=0.2714, pruned_loss=0.117, over 13169.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.2934, pruned_loss=0.1212, over 2579181.46 frames. ], batch size: 103, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:10:54,201 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.86 vs. limit=6.0 2024-06-20 03:10:54,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=118805.5, ans=0.125 2024-06-20 03:10:59,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=118805.5, ans=0.0 2024-06-20 03:11:03,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:05,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=12.0 2024-06-20 03:11:06,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=118823.83333333333, ans=0.05 2024-06-20 03:11:06,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:07,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=118823.83333333333, ans=0.125 2024-06-20 03:11:21,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2024-06-20 03:11:27,854 INFO [train.py:1028] (0/2) Epoch 7, batch 4150, loss[loss=0.2729, simple_loss=0.2976, pruned_loss=0.1241, over 13118.00 frames. ], tot_loss[loss=0.268, simple_loss=0.2936, pruned_loss=0.1212, over 2577890.76 frames. ], batch size: 55, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:11:27,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=118897.16666666667, ans=0.0 2024-06-20 03:11:29,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=12.0 2024-06-20 03:11:30,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=118897.16666666667, ans=0.0 2024-06-20 03:11:34,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=118915.5, ans=0.2 2024-06-20 03:11:43,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=118933.83333333333, ans=0.125 2024-06-20 03:11:43,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=118933.83333333333, ans=0.0 2024-06-20 03:11:47,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=118952.16666666667, ans=0.0 2024-06-20 03:11:52,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=118952.16666666667, ans=0.125 2024-06-20 03:11:56,051 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.525e+02 2.820e+02 3.215e+02 4.715e+02, threshold=5.639e+02, percent-clipped=0.0 2024-06-20 03:11:58,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118970.5, ans=0.1 2024-06-20 03:12:00,525 INFO [train.py:1028] (0/2) Epoch 7, batch 4200, loss[loss=0.2536, simple_loss=0.2787, pruned_loss=0.1143, over 12991.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.2929, pruned_loss=0.1205, over 2580504.62 frames. ], batch size: 102, lr: 8.42e-03, grad_scale: 32.0 2024-06-20 03:12:03,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=118988.83333333333, ans=0.0 2024-06-20 03:12:17,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.77 vs. limit=22.5 2024-06-20 03:12:20,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=119025.5, ans=0.07 2024-06-20 03:12:24,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=119043.83333333333, ans=0.125 2024-06-20 03:12:29,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=119062.16666666667, ans=10.0 2024-06-20 03:12:35,916 INFO [train.py:1028] (0/2) Epoch 7, batch 4250, loss[loss=0.2443, simple_loss=0.2768, pruned_loss=0.1059, over 13309.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.292, pruned_loss=0.1198, over 2581535.65 frames. ], batch size: 46, lr: 8.41e-03, grad_scale: 32.0 2024-06-20 03:12:40,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119080.5, ans=0.1 2024-06-20 03:12:51,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=119117.16666666667, ans=0.125 2024-06-20 03:13:04,281 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.470e+02 2.746e+02 3.092e+02 5.160e+02, threshold=5.491e+02, percent-clipped=0.0 2024-06-20 03:13:04,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=119153.83333333333, ans=0.0 2024-06-20 03:13:11,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=119172.16666666667, ans=0.1 2024-06-20 03:13:12,459 INFO [train.py:1028] (0/2) Epoch 7, batch 4300, loss[loss=0.2918, simple_loss=0.3218, pruned_loss=0.1309, over 13190.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.292, pruned_loss=0.1196, over 2581668.84 frames. ], batch size: 59, lr: 8.41e-03, grad_scale: 32.0 2024-06-20 03:13:16,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2024-06-20 03:13:17,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=119172.16666666667, ans=0.2 2024-06-20 03:13:18,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=119190.5, ans=0.95 2024-06-20 03:13:32,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=119227.16666666667, ans=0.0 2024-06-20 03:13:45,043 INFO [train.py:1028] (0/2) Epoch 7, batch 4350, loss[loss=0.2349, simple_loss=0.2679, pruned_loss=0.101, over 13157.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.2911, pruned_loss=0.1191, over 2585837.39 frames. ], batch size: 59, lr: 8.41e-03, grad_scale: 16.0 2024-06-20 03:13:55,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=119282.16666666667, ans=0.025 2024-06-20 03:14:04,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=119318.83333333333, ans=0.0 2024-06-20 03:14:18,954 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.488e+02 2.711e+02 3.113e+02 4.480e+02, threshold=5.422e+02, percent-clipped=0.0 2024-06-20 03:14:19,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=119337.16666666667, ans=0.125 2024-06-20 03:14:22,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=119337.16666666667, ans=0.0 2024-06-20 03:14:23,194 INFO [train.py:1028] (0/2) Epoch 7, batch 4400, loss[loss=0.2697, simple_loss=0.2893, pruned_loss=0.1251, over 13227.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.2908, pruned_loss=0.119, over 2584822.97 frames. ], batch size: 83, lr: 8.40e-03, grad_scale: 32.0 2024-06-20 03:14:25,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=119355.5, ans=0.2 2024-06-20 03:14:28,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.99 vs. limit=22.5 2024-06-20 03:14:35,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=119373.83333333333, ans=0.125 2024-06-20 03:14:47,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=119410.5, ans=0.0 2024-06-20 03:14:50,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=119428.83333333333, ans=0.125 2024-06-20 03:14:52,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=119428.83333333333, ans=0.125 2024-06-20 03:14:54,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=119428.83333333333, ans=0.025 2024-06-20 03:14:55,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=119428.83333333333, ans=0.125 2024-06-20 03:14:56,123 INFO [train.py:1028] (0/2) Epoch 7, batch 4450, loss[loss=0.2538, simple_loss=0.2846, pruned_loss=0.1115, over 12833.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.2917, pruned_loss=0.1199, over 2580485.13 frames. ], batch size: 33, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:15:00,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=119447.16666666667, ans=0.2 2024-06-20 03:15:00,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=119447.16666666667, ans=0.125 2024-06-20 03:15:08,579 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2024-06-20 03:15:28,421 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.570e+02 2.847e+02 3.179e+02 7.208e+02, threshold=5.693e+02, percent-clipped=1.0 2024-06-20 03:15:30,686 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.188e+01 2024-06-20 03:15:30,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=119520.5, ans=0.125 2024-06-20 03:15:31,762 INFO [train.py:1028] (0/2) Epoch 7, batch 4500, loss[loss=0.2597, simple_loss=0.2873, pruned_loss=0.116, over 13270.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.2905, pruned_loss=0.1192, over 2584560.93 frames. ], batch size: 89, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:15:39,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=119557.16666666667, ans=0.125 2024-06-20 03:15:40,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=119557.16666666667, ans=0.07 2024-06-20 03:15:53,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=119593.83333333333, ans=0.0 2024-06-20 03:16:04,275 INFO [train.py:1028] (0/2) Epoch 7, batch 4550, loss[loss=0.2335, simple_loss=0.2658, pruned_loss=0.1006, over 13248.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.2901, pruned_loss=0.1186, over 2587987.61 frames. ], batch size: 52, lr: 8.40e-03, grad_scale: 16.0 2024-06-20 03:16:11,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=119630.5, ans=0.0 2024-06-20 03:16:13,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=119630.5, ans=0.035 2024-06-20 03:16:14,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.30 vs. limit=10.0 2024-06-20 03:16:16,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=119648.83333333333, ans=0.125 2024-06-20 03:16:18,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=119648.83333333333, ans=15.0 2024-06-20 03:16:24,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=12.0 2024-06-20 03:16:25,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=119667.16666666667, ans=0.0 2024-06-20 03:16:26,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=119667.16666666667, ans=0.125 2024-06-20 03:16:30,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=119685.5, ans=0.0 2024-06-20 03:16:36,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=119703.83333333333, ans=0.025 2024-06-20 03:16:39,068 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.347e+02 2.548e+02 2.845e+02 6.369e+02, threshold=5.097e+02, percent-clipped=1.0 2024-06-20 03:16:40,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=119703.83333333333, ans=0.125 2024-06-20 03:16:42,466 INFO [train.py:1028] (0/2) Epoch 7, batch 4600, loss[loss=0.2838, simple_loss=0.3043, pruned_loss=0.1316, over 12526.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.29, pruned_loss=0.1185, over 2584257.51 frames. ], batch size: 202, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:16:42,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119722.16666666667, ans=0.0 2024-06-20 03:16:43,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=119722.16666666667, ans=0.0 2024-06-20 03:16:43,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.71 vs. limit=15.0 2024-06-20 03:16:44,803 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:16:56,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=119758.83333333333, ans=0.0 2024-06-20 03:16:57,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=119758.83333333333, ans=0.0 2024-06-20 03:16:59,201 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 03:17:18,235 INFO [train.py:1028] (0/2) Epoch 7, batch 4650, loss[loss=0.2812, simple_loss=0.2971, pruned_loss=0.1326, over 13107.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.2888, pruned_loss=0.1179, over 2586981.04 frames. ], batch size: 132, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:17:31,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119850.5, ans=0.1 2024-06-20 03:17:39,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119868.83333333333, ans=0.125 2024-06-20 03:17:47,545 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.379e+02 2.554e+02 2.861e+02 4.607e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-20 03:17:51,298 INFO [train.py:1028] (0/2) Epoch 7, batch 4700, loss[loss=0.268, simple_loss=0.3048, pruned_loss=0.1156, over 12415.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.2896, pruned_loss=0.1184, over 2582116.13 frames. ], batch size: 25, lr: 8.39e-03, grad_scale: 16.0 2024-06-20 03:18:02,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=119923.83333333333, ans=15.0 2024-06-20 03:18:09,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119960.5, ans=0.1 2024-06-20 03:18:15,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=119960.5, ans=0.025 2024-06-20 03:18:15,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.87 vs. limit=22.5 2024-06-20 03:18:24,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=119978.83333333333, ans=0.125 2024-06-20 03:18:26,788 INFO [train.py:1028] (0/2) Epoch 7, batch 4750, loss[loss=0.2986, simple_loss=0.3097, pruned_loss=0.1437, over 12620.00 frames. ], tot_loss[loss=0.263, simple_loss=0.2893, pruned_loss=0.1183, over 2579624.05 frames. ], batch size: 202, lr: 8.38e-03, grad_scale: 16.0 2024-06-20 03:18:34,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=120015.5, ans=0.125 2024-06-20 03:18:41,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=120033.83333333333, ans=0.125 2024-06-20 03:18:44,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=120033.83333333333, ans=0.125 2024-06-20 03:18:45,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.52 vs. limit=22.5 2024-06-20 03:18:47,918 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=15.0 2024-06-20 03:18:54,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=120070.5, ans=0.05 2024-06-20 03:18:56,368 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.311e+02 2.626e+02 3.007e+02 5.190e+02, threshold=5.252e+02, percent-clipped=1.0 2024-06-20 03:19:03,750 INFO [train.py:1028] (0/2) Epoch 7, batch 4800, loss[loss=0.2443, simple_loss=0.2701, pruned_loss=0.1093, over 13211.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.2888, pruned_loss=0.1181, over 2576925.40 frames. ], batch size: 63, lr: 8.38e-03, grad_scale: 32.0 2024-06-20 03:19:12,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=120107.16666666667, ans=0.2 2024-06-20 03:19:13,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=120107.16666666667, ans=0.125 2024-06-20 03:19:16,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=120125.5, ans=0.025 2024-06-20 03:19:17,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=15.0 2024-06-20 03:19:21,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-20 03:19:36,028 INFO [train.py:1028] (0/2) Epoch 7, batch 4850, loss[loss=0.2573, simple_loss=0.2817, pruned_loss=0.1164, over 13226.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.2883, pruned_loss=0.118, over 2574987.34 frames. ], batch size: 89, lr: 8.38e-03, grad_scale: 32.0 2024-06-20 03:19:37,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.06 vs. limit=22.5 2024-06-20 03:19:40,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.47 vs. limit=22.5 2024-06-20 03:19:45,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=120198.83333333333, ans=0.125 2024-06-20 03:19:50,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=120217.16666666667, ans=0.1 2024-06-20 03:19:50,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=47.38 vs. limit=15.0 2024-06-20 03:19:52,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120217.16666666667, ans=0.125 2024-06-20 03:20:06,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=8.0 2024-06-20 03:20:09,539 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.372e+02 2.668e+02 2.933e+02 4.071e+02, threshold=5.336e+02, percent-clipped=0.0 2024-06-20 03:20:11,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=120253.83333333333, ans=0.0 2024-06-20 03:20:13,007 INFO [train.py:1028] (0/2) Epoch 7, batch 4900, loss[loss=0.2305, simple_loss=0.2717, pruned_loss=0.09469, over 13201.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.288, pruned_loss=0.1175, over 2575968.90 frames. ], batch size: 59, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:20:13,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.69 vs. limit=22.5 2024-06-20 03:20:14,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=120272.16666666667, ans=0.2 2024-06-20 03:20:16,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=120272.16666666667, ans=0.125 2024-06-20 03:20:19,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=120290.5, ans=0.0 2024-06-20 03:20:22,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=120290.5, ans=0.025 2024-06-20 03:20:24,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2024-06-20 03:20:27,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120308.83333333333, ans=0.125 2024-06-20 03:20:30,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.98 vs. limit=10.0 2024-06-20 03:20:33,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=120327.16666666667, ans=0.125 2024-06-20 03:20:38,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=120327.16666666667, ans=0.0 2024-06-20 03:20:38,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=15.0 2024-06-20 03:20:45,972 INFO [train.py:1028] (0/2) Epoch 7, batch 4950, loss[loss=0.2664, simple_loss=0.2763, pruned_loss=0.1283, over 10865.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.2886, pruned_loss=0.1183, over 2570577.02 frames. ], batch size: 303, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:20:59,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=120400.5, ans=0.125 2024-06-20 03:21:09,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=120418.83333333333, ans=0.0 2024-06-20 03:21:14,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=120437.16666666667, ans=0.035 2024-06-20 03:21:15,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=120437.16666666667, ans=0.125 2024-06-20 03:21:16,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=120437.16666666667, ans=0.0 2024-06-20 03:21:17,982 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.277e+02 2.597e+02 2.962e+02 4.569e+02, threshold=5.195e+02, percent-clipped=0.0 2024-06-20 03:21:21,333 INFO [train.py:1028] (0/2) Epoch 7, batch 5000, loss[loss=0.2621, simple_loss=0.2863, pruned_loss=0.1189, over 13156.00 frames. ], tot_loss[loss=0.261, simple_loss=0.2875, pruned_loss=0.1172, over 2573867.37 frames. ], batch size: 95, lr: 8.37e-03, grad_scale: 32.0 2024-06-20 03:21:21,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120455.5, ans=0.125 2024-06-20 03:21:23,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=120455.5, ans=0.07 2024-06-20 03:21:37,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=120492.16666666667, ans=0.04949747468305833 2024-06-20 03:21:48,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120528.83333333333, ans=0.0 2024-06-20 03:21:50,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=120528.83333333333, ans=0.0 2024-06-20 03:21:51,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.71 vs. limit=10.0 2024-06-20 03:21:54,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=120547.16666666667, ans=0.125 2024-06-20 03:21:54,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=120547.16666666667, ans=0.125 2024-06-20 03:21:54,753 INFO [train.py:1028] (0/2) Epoch 7, batch 5050, loss[loss=0.2522, simple_loss=0.2861, pruned_loss=0.1091, over 12932.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.2878, pruned_loss=0.117, over 2572164.88 frames. ], batch size: 36, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:21:56,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=120547.16666666667, ans=0.2 2024-06-20 03:22:41,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=120602.16666666667, ans=0.1 2024-06-20 03:22:47,525 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.294e+02 2.491e+02 2.842e+02 3.979e+02, threshold=4.982e+02, percent-clipped=0.0 2024-06-20 03:22:52,762 INFO [train.py:1028] (0/2) Epoch 7, batch 5100, loss[loss=0.2857, simple_loss=0.3085, pruned_loss=0.1314, over 12983.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.2878, pruned_loss=0.1175, over 2568363.09 frames. ], batch size: 39, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:22:58,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=120638.83333333333, ans=0.0 2024-06-20 03:23:02,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=120657.16666666667, ans=0.125 2024-06-20 03:23:19,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.02 vs. limit=10.0 2024-06-20 03:23:23,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=120712.16666666667, ans=0.125 2024-06-20 03:23:28,123 INFO [train.py:1028] (0/2) Epoch 7, batch 5150, loss[loss=0.2578, simple_loss=0.2788, pruned_loss=0.1184, over 13093.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.2881, pruned_loss=0.1181, over 2570605.21 frames. ], batch size: 132, lr: 8.36e-03, grad_scale: 32.0 2024-06-20 03:23:28,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=120730.5, ans=0.2 2024-06-20 03:23:34,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.60 vs. limit=22.5 2024-06-20 03:23:38,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=120748.83333333333, ans=0.0 2024-06-20 03:23:45,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=120767.16666666667, ans=0.125 2024-06-20 03:23:45,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.77 vs. limit=10.0 2024-06-20 03:23:46,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.27 vs. limit=10.0 2024-06-20 03:23:49,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=120785.5, ans=0.0 2024-06-20 03:23:57,552 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.303e+02 2.458e+02 2.816e+02 3.894e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-20 03:23:57,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=120803.83333333333, ans=0.2 2024-06-20 03:23:58,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=120803.83333333333, ans=0.07 2024-06-20 03:23:59,986 INFO [train.py:1028] (0/2) Epoch 7, batch 5200, loss[loss=0.262, simple_loss=0.2909, pruned_loss=0.1165, over 13122.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.2874, pruned_loss=0.1176, over 2573352.98 frames. ], batch size: 95, lr: 8.35e-03, grad_scale: 32.0 2024-06-20 03:24:03,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=120822.16666666667, ans=0.0 2024-06-20 03:24:04,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=120822.16666666667, ans=0.2 2024-06-20 03:24:07,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.23 vs. limit=15.0 2024-06-20 03:24:10,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=12.0 2024-06-20 03:24:12,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2024-06-20 03:24:25,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=120877.16666666667, ans=0.0 2024-06-20 03:24:33,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120895.5, ans=0.1 2024-06-20 03:24:33,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=120895.5, ans=0.0 2024-06-20 03:24:35,729 INFO [train.py:1028] (0/2) Epoch 7, batch 5250, loss[loss=0.2556, simple_loss=0.287, pruned_loss=0.1121, over 13290.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.2878, pruned_loss=0.1178, over 2570799.54 frames. ], batch size: 52, lr: 8.35e-03, grad_scale: 16.0 2024-06-20 03:24:38,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=120913.83333333333, ans=0.125 2024-06-20 03:24:44,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=8.0 2024-06-20 03:24:53,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-20 03:24:56,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.63 vs. limit=15.0 2024-06-20 03:24:56,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.77 vs. limit=15.0 2024-06-20 03:25:06,981 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.261e+02 2.482e+02 2.850e+02 4.798e+02, threshold=4.964e+02, percent-clipped=0.0 2024-06-20 03:25:08,909 INFO [train.py:1028] (0/2) Epoch 7, batch 5300, loss[loss=0.2506, simple_loss=0.2716, pruned_loss=0.1148, over 13075.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.2881, pruned_loss=0.1182, over 2568273.37 frames. ], batch size: 144, lr: 8.35e-03, grad_scale: 16.0 2024-06-20 03:25:11,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121005.5, ans=0.1 2024-06-20 03:25:30,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=121042.16666666667, ans=0.2 2024-06-20 03:25:33,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=121060.5, ans=0.125 2024-06-20 03:25:37,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121060.5, ans=0.1 2024-06-20 03:25:45,439 INFO [train.py:1028] (0/2) Epoch 7, batch 5350, loss[loss=0.235, simple_loss=0.2772, pruned_loss=0.0964, over 12016.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.2878, pruned_loss=0.1179, over 2574657.06 frames. ], batch size: 18, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:25:50,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121097.16666666667, ans=0.1 2024-06-20 03:25:51,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=121115.5, ans=0.07 2024-06-20 03:25:53,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121115.5, ans=0.125 2024-06-20 03:26:11,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121152.16666666667, ans=0.1 2024-06-20 03:26:12,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=121152.16666666667, ans=0.125 2024-06-20 03:26:12,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=121152.16666666667, ans=0.125 2024-06-20 03:26:19,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.219e+02 2.469e+02 2.731e+02 4.565e+02, threshold=4.939e+02, percent-clipped=0.0 2024-06-20 03:26:19,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121170.5, ans=0.125 2024-06-20 03:26:21,513 INFO [train.py:1028] (0/2) Epoch 7, batch 5400, loss[loss=0.3221, simple_loss=0.3267, pruned_loss=0.1588, over 12236.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.2877, pruned_loss=0.118, over 2567220.25 frames. ], batch size: 241, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:26:31,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=121207.16666666667, ans=0.2 2024-06-20 03:26:43,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121243.83333333333, ans=0.125 2024-06-20 03:26:43,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2024-06-20 03:26:43,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.35 vs. limit=22.5 2024-06-20 03:26:55,091 INFO [train.py:1028] (0/2) Epoch 7, batch 5450, loss[loss=0.2486, simple_loss=0.2766, pruned_loss=0.1103, over 12472.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.287, pruned_loss=0.1174, over 2570132.58 frames. ], batch size: 25, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:27:15,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=121317.16666666667, ans=0.0 2024-06-20 03:27:16,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.48 vs. limit=10.0 2024-06-20 03:27:19,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2024-06-20 03:27:20,439 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.289e+01 2024-06-20 03:27:24,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=121353.83333333333, ans=10.0 2024-06-20 03:27:29,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.228e+02 2.507e+02 2.768e+02 3.960e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-20 03:27:31,311 INFO [train.py:1028] (0/2) Epoch 7, batch 5500, loss[loss=0.317, simple_loss=0.3181, pruned_loss=0.158, over 12217.00 frames. ], tot_loss[loss=0.261, simple_loss=0.2873, pruned_loss=0.1174, over 2564831.66 frames. ], batch size: 241, lr: 8.34e-03, grad_scale: 16.0 2024-06-20 03:27:32,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.75 vs. limit=15.0 2024-06-20 03:27:33,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.31 vs. limit=15.0 2024-06-20 03:27:38,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=15.0 2024-06-20 03:27:38,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121390.5, ans=0.1 2024-06-20 03:27:45,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=121408.83333333333, ans=0.0 2024-06-20 03:27:45,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=121408.83333333333, ans=0.125 2024-06-20 03:28:04,119 INFO [train.py:1028] (0/2) Epoch 7, batch 5550, loss[loss=0.2591, simple_loss=0.2882, pruned_loss=0.115, over 13313.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.2868, pruned_loss=0.1169, over 2569840.62 frames. ], batch size: 43, lr: 8.33e-03, grad_scale: 16.0 2024-06-20 03:28:15,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 03:28:16,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121482.16666666667, ans=0.1 2024-06-20 03:28:17,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121482.16666666667, ans=0.1 2024-06-20 03:28:23,490 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2024-06-20 03:28:38,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.185e+02 2.488e+02 2.819e+02 3.782e+02, threshold=4.975e+02, percent-clipped=0.0 2024-06-20 03:28:39,946 INFO [train.py:1028] (0/2) Epoch 7, batch 5600, loss[loss=0.262, simple_loss=0.2871, pruned_loss=0.1185, over 13232.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.2858, pruned_loss=0.1163, over 2571545.85 frames. ], batch size: 89, lr: 8.33e-03, grad_scale: 32.0 2024-06-20 03:28:47,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=121573.83333333333, ans=0.0 2024-06-20 03:28:55,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=121592.16666666667, ans=0.95 2024-06-20 03:28:56,196 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:29:15,920 INFO [train.py:1028] (0/2) Epoch 7, batch 5650, loss[loss=0.2941, simple_loss=0.3061, pruned_loss=0.141, over 12601.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.2859, pruned_loss=0.116, over 2574912.98 frames. ], batch size: 202, lr: 8.33e-03, grad_scale: 16.0 2024-06-20 03:29:21,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=121647.16666666667, ans=10.0 2024-06-20 03:29:21,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.47 vs. limit=10.0 2024-06-20 03:29:30,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=121683.83333333333, ans=0.125 2024-06-20 03:29:30,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=121683.83333333333, ans=0.125 2024-06-20 03:29:32,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=121683.83333333333, ans=0.2 2024-06-20 03:29:36,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-20 03:29:47,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.213e+02 2.396e+02 2.699e+02 4.366e+02, threshold=4.793e+02, percent-clipped=0.0 2024-06-20 03:29:47,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=121720.5, ans=0.0 2024-06-20 03:29:48,811 INFO [train.py:1028] (0/2) Epoch 7, batch 5700, loss[loss=0.2286, simple_loss=0.2632, pruned_loss=0.09705, over 13231.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.2858, pruned_loss=0.1158, over 2578926.15 frames. ], batch size: 63, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:29:53,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.23 vs. limit=15.0 2024-06-20 03:30:00,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=121757.16666666667, ans=0.025 2024-06-20 03:30:03,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=121775.5, ans=0.0 2024-06-20 03:30:13,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=121793.83333333333, ans=0.125 2024-06-20 03:30:15,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=121793.83333333333, ans=0.0 2024-06-20 03:30:22,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.34 vs. limit=22.5 2024-06-20 03:30:25,720 INFO [train.py:1028] (0/2) Epoch 7, batch 5750, loss[loss=0.3, simple_loss=0.312, pruned_loss=0.144, over 12770.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.2875, pruned_loss=0.1169, over 2579886.63 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:30:29,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=121830.5, ans=0.0 2024-06-20 03:30:30,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121830.5, ans=0.1 2024-06-20 03:30:31,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=121830.5, ans=0.125 2024-06-20 03:30:58,609 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.499e+02 2.844e+02 3.088e+02 5.345e+02, threshold=5.687e+02, percent-clipped=1.0 2024-06-20 03:30:59,854 INFO [train.py:1028] (0/2) Epoch 7, batch 5800, loss[loss=0.2725, simple_loss=0.2853, pruned_loss=0.1299, over 12759.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.2894, pruned_loss=0.1184, over 2579074.95 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 16.0 2024-06-20 03:31:06,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=121940.5, ans=0.0 2024-06-20 03:31:08,720 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:31:09,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.69 vs. limit=15.0 2024-06-20 03:31:14,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=15.0 2024-06-20 03:31:16,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121940.5, ans=0.125 2024-06-20 03:31:18,578 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.542e+02 2024-06-20 03:31:37,697 INFO [train.py:1028] (0/2) Epoch 7, batch 5850, loss[loss=0.2801, simple_loss=0.3003, pruned_loss=0.13, over 12538.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.2914, pruned_loss=0.1194, over 2577926.90 frames. ], batch size: 202, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:31:49,301 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=1.016e-02 2024-06-20 03:31:52,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=122050.5, ans=0.0 2024-06-20 03:31:54,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.37 vs. limit=12.0 2024-06-20 03:31:59,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=122068.83333333333, ans=0.125 2024-06-20 03:32:01,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=122068.83333333333, ans=0.125 2024-06-20 03:32:12,601 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.369e+02 2.652e+02 3.014e+02 4.273e+02, threshold=5.303e+02, percent-clipped=0.0 2024-06-20 03:32:14,114 INFO [train.py:1028] (0/2) Epoch 7, batch 5900, loss[loss=0.2432, simple_loss=0.2711, pruned_loss=0.1076, over 13139.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.2937, pruned_loss=0.1205, over 2578635.73 frames. ], batch size: 121, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:32:18,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=122105.5, ans=0.2 2024-06-20 03:32:20,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=122123.83333333333, ans=10.0 2024-06-20 03:32:22,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=122123.83333333333, ans=0.0 2024-06-20 03:32:25,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.02 vs. limit=22.5 2024-06-20 03:32:26,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2024-06-20 03:32:28,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122142.16666666667, ans=0.1 2024-06-20 03:32:30,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=122142.16666666667, ans=0.05 2024-06-20 03:32:37,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=122160.5, ans=0.0 2024-06-20 03:32:40,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=122178.83333333333, ans=0.2 2024-06-20 03:32:40,571 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:32:40,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122178.83333333333, ans=0.1 2024-06-20 03:32:45,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=122178.83333333333, ans=10.0 2024-06-20 03:32:45,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=122178.83333333333, ans=0.2 2024-06-20 03:32:48,093 INFO [train.py:1028] (0/2) Epoch 7, batch 5950, loss[loss=0.2549, simple_loss=0.2828, pruned_loss=0.1135, over 13135.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.2952, pruned_loss=0.1211, over 2582275.23 frames. ], batch size: 121, lr: 8.31e-03, grad_scale: 16.0 2024-06-20 03:32:53,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=122197.16666666667, ans=0.0 2024-06-20 03:32:56,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=122215.5, ans=0.0 2024-06-20 03:33:01,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=122233.83333333333, ans=0.0 2024-06-20 03:33:21,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=26.53 vs. limit=22.5 2024-06-20 03:33:24,068 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.377e+02 2.580e+02 3.005e+02 4.739e+02, threshold=5.159e+02, percent-clipped=0.0 2024-06-20 03:33:25,328 INFO [train.py:1028] (0/2) Epoch 7, batch 6000, loss[loss=0.3249, simple_loss=0.3353, pruned_loss=0.1573, over 12181.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.2969, pruned_loss=0.1222, over 2575839.35 frames. ], batch size: 240, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:33:25,329 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 03:33:31,553 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.4554, 3.7684, 3.2938, 4.6477], device='cuda:0') 2024-06-20 03:33:33,296 INFO [train.py:1060] (0/2) Epoch 7, validation: loss=0.2156, simple_loss=0.277, pruned_loss=0.07715, over 351949.00 frames. 2024-06-20 03:33:33,296 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16965MB 2024-06-20 03:33:34,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=122288.83333333333, ans=0.125 2024-06-20 03:33:34,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=122288.83333333333, ans=0.2 2024-06-20 03:33:42,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=122307.16666666667, ans=0.0 2024-06-20 03:34:00,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=122362.16666666667, ans=0.0 2024-06-20 03:34:05,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=122362.16666666667, ans=0.125 2024-06-20 03:34:07,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=122380.5, ans=0.125 2024-06-20 03:34:07,642 INFO [train.py:1028] (0/2) Epoch 7, batch 6050, loss[loss=0.2369, simple_loss=0.2719, pruned_loss=0.101, over 12866.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.2984, pruned_loss=0.1225, over 2578486.91 frames. ], batch size: 39, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:34:09,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=122380.5, ans=0.09899494936611666 2024-06-20 03:34:11,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=122380.5, ans=0.125 2024-06-20 03:34:15,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=122398.83333333333, ans=0.0 2024-06-20 03:34:21,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=122398.83333333333, ans=0.125 2024-06-20 03:34:22,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=122398.83333333333, ans=0.125 2024-06-20 03:34:44,461 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.234e+02 2.408e+02 2.721e+02 3.814e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-20 03:34:45,866 INFO [train.py:1028] (0/2) Epoch 7, batch 6100, loss[loss=0.2709, simple_loss=0.2965, pruned_loss=0.1227, over 13098.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.2993, pruned_loss=0.1227, over 2580375.76 frames. ], batch size: 121, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:34:46,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=122472.16666666667, ans=0.125 2024-06-20 03:34:49,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=122472.16666666667, ans=0.1 2024-06-20 03:34:51,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122472.16666666667, ans=0.1 2024-06-20 03:34:53,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=122490.5, ans=0.125 2024-06-20 03:35:00,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122508.83333333333, ans=0.1 2024-06-20 03:35:01,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=122508.83333333333, ans=0.1 2024-06-20 03:35:07,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=122527.16666666667, ans=0.125 2024-06-20 03:35:18,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=122545.5, ans=0.2 2024-06-20 03:35:19,518 INFO [train.py:1028] (0/2) Epoch 7, batch 6150, loss[loss=0.305, simple_loss=0.3156, pruned_loss=0.1471, over 10852.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3008, pruned_loss=0.1237, over 2578267.80 frames. ], batch size: 303, lr: 8.30e-03, grad_scale: 32.0 2024-06-20 03:35:33,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=122582.16666666667, ans=0.025 2024-06-20 03:35:43,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.57 vs. limit=10.0 2024-06-20 03:35:54,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=122637.16666666667, ans=0.04949747468305833 2024-06-20 03:35:57,340 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.462e+02 2.680e+02 2.939e+02 3.804e+02, threshold=5.360e+02, percent-clipped=0.0 2024-06-20 03:35:58,688 INFO [train.py:1028] (0/2) Epoch 7, batch 6200, loss[loss=0.3122, simple_loss=0.3361, pruned_loss=0.1441, over 13261.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3024, pruned_loss=0.1244, over 2575715.46 frames. ], batch size: 89, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:36:05,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=122673.83333333333, ans=0.07 2024-06-20 03:36:19,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=122692.16666666667, ans=0.0 2024-06-20 03:36:22,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=122710.5, ans=0.2 2024-06-20 03:36:23,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2024-06-20 03:36:34,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=122747.16666666667, ans=0.0 2024-06-20 03:36:35,391 INFO [train.py:1028] (0/2) Epoch 7, batch 6250, loss[loss=0.2698, simple_loss=0.2995, pruned_loss=0.1201, over 13228.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.304, pruned_loss=0.1252, over 2568664.69 frames. ], batch size: 83, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:36:39,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=122747.16666666667, ans=0.125 2024-06-20 03:36:47,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=122765.5, ans=0.125 2024-06-20 03:36:49,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=12.0 2024-06-20 03:36:51,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=122783.83333333333, ans=0.0 2024-06-20 03:36:53,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2024-06-20 03:36:57,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122802.16666666667, ans=0.1 2024-06-20 03:37:01,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=122820.5, ans=0.2 2024-06-20 03:37:04,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122820.5, ans=0.1 2024-06-20 03:37:04,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=122820.5, ans=0.2 2024-06-20 03:37:06,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=122820.5, ans=0.125 2024-06-20 03:37:06,618 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.294e+02 2.622e+02 2.981e+02 5.644e+02, threshold=5.244e+02, percent-clipped=1.0 2024-06-20 03:37:07,924 INFO [train.py:1028] (0/2) Epoch 7, batch 6300, loss[loss=0.2188, simple_loss=0.2611, pruned_loss=0.08818, over 11130.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3057, pruned_loss=0.126, over 2563541.48 frames. ], batch size: 16, lr: 8.29e-03, grad_scale: 32.0 2024-06-20 03:37:09,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=122838.83333333333, ans=0.0 2024-06-20 03:37:13,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=122838.83333333333, ans=0.0 2024-06-20 03:37:44,005 INFO [train.py:1028] (0/2) Epoch 7, batch 6350, loss[loss=0.3358, simple_loss=0.3555, pruned_loss=0.1581, over 12586.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3071, pruned_loss=0.126, over 2573091.65 frames. ], batch size: 202, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:37:47,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=122930.5, ans=0.125 2024-06-20 03:37:47,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.25 vs. limit=6.0 2024-06-20 03:37:49,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=122930.5, ans=0.125 2024-06-20 03:38:02,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122967.16666666667, ans=0.125 2024-06-20 03:38:07,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=122985.5, ans=0.125 2024-06-20 03:38:14,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=123003.83333333333, ans=0.0 2024-06-20 03:38:15,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.235e+02 2.504e+02 2.773e+02 3.984e+02, threshold=5.008e+02, percent-clipped=0.0 2024-06-20 03:38:17,170 INFO [train.py:1028] (0/2) Epoch 7, batch 6400, loss[loss=0.2824, simple_loss=0.3184, pruned_loss=0.1232, over 13203.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3095, pruned_loss=0.1273, over 2573982.79 frames. ], batch size: 67, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:38:26,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=123040.5, ans=0.125 2024-06-20 03:38:33,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=123058.83333333333, ans=0.125 2024-06-20 03:38:34,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=123058.83333333333, ans=0.025 2024-06-20 03:38:34,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=123058.83333333333, ans=0.125 2024-06-20 03:38:40,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123077.16666666667, ans=0.1 2024-06-20 03:38:53,335 INFO [train.py:1028] (0/2) Epoch 7, batch 6450, loss[loss=0.324, simple_loss=0.3385, pruned_loss=0.1548, over 12604.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3116, pruned_loss=0.1282, over 2580491.13 frames. ], batch size: 202, lr: 8.28e-03, grad_scale: 32.0 2024-06-20 03:38:53,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=123113.83333333333, ans=0.125 2024-06-20 03:38:56,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=123113.83333333333, ans=0.0 2024-06-20 03:38:58,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=123113.83333333333, ans=0.2 2024-06-20 03:38:58,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=123113.83333333333, ans=0.0 2024-06-20 03:38:58,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=123113.83333333333, ans=0.0 2024-06-20 03:39:00,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=123132.16666666667, ans=0.125 2024-06-20 03:39:06,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=123150.5, ans=0.125 2024-06-20 03:39:10,994 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.25 vs. limit=22.5 2024-06-20 03:39:23,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.39 vs. limit=15.0 2024-06-20 03:39:24,726 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.304e+02 2.497e+02 2.882e+02 5.271e+02, threshold=4.994e+02, percent-clipped=1.0 2024-06-20 03:39:25,972 INFO [train.py:1028] (0/2) Epoch 7, batch 6500, loss[loss=0.3433, simple_loss=0.3474, pruned_loss=0.1696, over 10983.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.314, pruned_loss=0.1294, over 2583274.29 frames. ], batch size: 303, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:39:27,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.83 vs. limit=12.0 2024-06-20 03:39:28,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123205.5, ans=0.125 2024-06-20 03:39:41,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 03:39:46,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=123242.16666666667, ans=0.125 2024-06-20 03:40:00,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=123297.16666666667, ans=0.125 2024-06-20 03:40:01,351 INFO [train.py:1028] (0/2) Epoch 7, batch 6550, loss[loss=0.2636, simple_loss=0.2983, pruned_loss=0.1144, over 12714.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3144, pruned_loss=0.1293, over 2588268.74 frames. ], batch size: 22, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:40:08,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.85 vs. limit=15.0 2024-06-20 03:40:15,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=123333.83333333333, ans=0.0 2024-06-20 03:40:17,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=123333.83333333333, ans=0.125 2024-06-20 03:40:22,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=123352.16666666667, ans=0.035 2024-06-20 03:40:22,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=123352.16666666667, ans=0.09899494936611666 2024-06-20 03:40:23,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=123352.16666666667, ans=0.2 2024-06-20 03:40:31,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-20 03:40:35,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.373e+02 2.639e+02 2.924e+02 4.252e+02, threshold=5.278e+02, percent-clipped=0.0 2024-06-20 03:40:36,913 INFO [train.py:1028] (0/2) Epoch 7, batch 6600, loss[loss=0.2624, simple_loss=0.2941, pruned_loss=0.1153, over 13296.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3146, pruned_loss=0.1292, over 2591188.84 frames. ], batch size: 72, lr: 8.27e-03, grad_scale: 32.0 2024-06-20 03:40:37,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.51 vs. limit=15.0 2024-06-20 03:40:38,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=123388.83333333333, ans=0.125 2024-06-20 03:40:38,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=123388.83333333333, ans=0.0 2024-06-20 03:40:41,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123388.83333333333, ans=0.1 2024-06-20 03:40:44,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=123407.16666666667, ans=0.125 2024-06-20 03:40:48,683 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=8.838e-02 2024-06-20 03:40:49,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=123425.5, ans=0.025 2024-06-20 03:41:06,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=123462.16666666667, ans=0.125 2024-06-20 03:41:09,178 INFO [train.py:1028] (0/2) Epoch 7, batch 6650, loss[loss=0.3196, simple_loss=0.3403, pruned_loss=0.1495, over 12942.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3176, pruned_loss=0.131, over 2585813.46 frames. ], batch size: 158, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:41:09,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-20 03:41:10,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=123480.5, ans=0.2 2024-06-20 03:41:13,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.99 vs. limit=15.0 2024-06-20 03:41:23,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=123517.16666666667, ans=0.125 2024-06-20 03:41:33,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.78 vs. limit=22.5 2024-06-20 03:41:33,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123535.5, ans=0.125 2024-06-20 03:41:36,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2024-06-20 03:41:46,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.454e+02 2.649e+02 3.041e+02 4.260e+02, threshold=5.297e+02, percent-clipped=0.0 2024-06-20 03:41:47,924 INFO [train.py:1028] (0/2) Epoch 7, batch 6700, loss[loss=0.3372, simple_loss=0.3543, pruned_loss=0.16, over 12787.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3189, pruned_loss=0.1318, over 2584617.24 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:41:52,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=123572.16666666667, ans=0.2 2024-06-20 03:41:57,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=123590.5, ans=0.0 2024-06-20 03:41:58,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=123590.5, ans=0.125 2024-06-20 03:42:00,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2024-06-20 03:42:07,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=123627.16666666667, ans=0.2 2024-06-20 03:42:12,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=123627.16666666667, ans=0.1 2024-06-20 03:42:18,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=123645.5, ans=0.0 2024-06-20 03:42:19,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.98 vs. limit=15.0 2024-06-20 03:42:21,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123663.83333333333, ans=0.1 2024-06-20 03:42:21,463 INFO [train.py:1028] (0/2) Epoch 7, batch 6750, loss[loss=0.3659, simple_loss=0.3687, pruned_loss=0.1816, over 12209.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3188, pruned_loss=0.1318, over 2577649.21 frames. ], batch size: 241, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:42:38,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=123700.5, ans=0.125 2024-06-20 03:42:40,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=123700.5, ans=0.07 2024-06-20 03:42:42,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=123700.5, ans=0.125 2024-06-20 03:42:44,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.11 vs. limit=22.5 2024-06-20 03:42:50,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=123737.16666666667, ans=0.0 2024-06-20 03:42:52,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=123737.16666666667, ans=0.125 2024-06-20 03:42:55,546 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.388e+02 2.766e+02 3.010e+02 4.712e+02, threshold=5.532e+02, percent-clipped=0.0 2024-06-20 03:42:56,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=123755.5, ans=0.125 2024-06-20 03:42:56,799 INFO [train.py:1028] (0/2) Epoch 7, batch 6800, loss[loss=0.2557, simple_loss=0.2899, pruned_loss=0.1107, over 13181.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.32, pruned_loss=0.132, over 2579545.12 frames. ], batch size: 67, lr: 8.26e-03, grad_scale: 32.0 2024-06-20 03:43:06,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=123773.83333333333, ans=0.04949747468305833 2024-06-20 03:43:12,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=123792.16666666667, ans=0.125 2024-06-20 03:43:14,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.69 vs. limit=22.5 2024-06-20 03:43:14,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=123792.16666666667, ans=0.125 2024-06-20 03:43:16,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123810.5, ans=0.1 2024-06-20 03:43:26,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=123828.83333333333, ans=0.0 2024-06-20 03:43:26,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.74 vs. limit=15.0 2024-06-20 03:43:28,947 INFO [train.py:1028] (0/2) Epoch 7, batch 6850, loss[loss=0.3002, simple_loss=0.3406, pruned_loss=0.13, over 13301.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3204, pruned_loss=0.1316, over 2583520.49 frames. ], batch size: 63, lr: 8.25e-03, grad_scale: 32.0 2024-06-20 03:43:29,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=123847.16666666667, ans=0.0 2024-06-20 03:43:31,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=123847.16666666667, ans=0.125 2024-06-20 03:43:35,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=123865.5, ans=0.09899494936611666 2024-06-20 03:43:36,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=12.0 2024-06-20 03:43:38,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=123865.5, ans=0.0 2024-06-20 03:43:53,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=123902.16666666667, ans=0.2 2024-06-20 03:44:04,159 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.286e+02 2.541e+02 2.820e+02 3.805e+02, threshold=5.082e+02, percent-clipped=0.0 2024-06-20 03:44:05,381 INFO [train.py:1028] (0/2) Epoch 7, batch 6900, loss[loss=0.2846, simple_loss=0.3187, pruned_loss=0.1253, over 13024.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3214, pruned_loss=0.132, over 2585007.68 frames. ], batch size: 48, lr: 8.25e-03, grad_scale: 32.0 2024-06-20 03:44:17,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=123957.16666666667, ans=0.125 2024-06-20 03:44:18,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=123975.5, ans=0.025 2024-06-20 03:44:18,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=123975.5, ans=10.0 2024-06-20 03:44:20,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=123975.5, ans=0.2 2024-06-20 03:44:41,554 INFO [train.py:1028] (0/2) Epoch 7, batch 6950, loss[loss=0.2627, simple_loss=0.2966, pruned_loss=0.1144, over 11870.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3212, pruned_loss=0.1316, over 2580381.93 frames. ], batch size: 17, lr: 8.25e-03, grad_scale: 16.0 2024-06-20 03:44:54,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=124067.16666666667, ans=0.125 2024-06-20 03:45:14,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.246e+02 2.472e+02 2.850e+02 3.703e+02, threshold=4.943e+02, percent-clipped=0.0 2024-06-20 03:45:14,689 INFO [train.py:1028] (0/2) Epoch 7, batch 7000, loss[loss=0.321, simple_loss=0.3419, pruned_loss=0.15, over 12954.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3204, pruned_loss=0.1306, over 2578382.24 frames. ], batch size: 158, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:45:35,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=124177.16666666667, ans=0.09899494936611666 2024-06-20 03:45:35,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=124177.16666666667, ans=0.025 2024-06-20 03:45:41,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124195.5, ans=0.125 2024-06-20 03:45:48,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2024-06-20 03:45:48,356 INFO [train.py:1028] (0/2) Epoch 7, batch 7050, loss[loss=0.3181, simple_loss=0.3411, pruned_loss=0.1475, over 12753.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.321, pruned_loss=0.1307, over 2585666.84 frames. ], batch size: 176, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:46:05,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=124250.5, ans=0.125 2024-06-20 03:46:06,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=15.0 2024-06-20 03:46:08,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=124250.5, ans=0.0 2024-06-20 03:46:13,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=124268.83333333333, ans=0.07 2024-06-20 03:46:14,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=124268.83333333333, ans=0.125 2024-06-20 03:46:15,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=124268.83333333333, ans=0.0 2024-06-20 03:46:23,421 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.285e+02 2.499e+02 2.793e+02 4.009e+02, threshold=4.999e+02, percent-clipped=0.0 2024-06-20 03:46:24,054 INFO [train.py:1028] (0/2) Epoch 7, batch 7100, loss[loss=0.3104, simple_loss=0.3453, pruned_loss=0.1377, over 13201.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3224, pruned_loss=0.1318, over 2576007.21 frames. ], batch size: 112, lr: 8.24e-03, grad_scale: 16.0 2024-06-20 03:46:38,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.09 vs. limit=15.0 2024-06-20 03:46:39,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=124323.83333333333, ans=0.125 2024-06-20 03:46:39,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=124323.83333333333, ans=0.0 2024-06-20 03:46:41,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=124342.16666666667, ans=0.125 2024-06-20 03:46:51,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=124360.5, ans=0.0 2024-06-20 03:46:58,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=124378.83333333333, ans=0.125 2024-06-20 03:47:00,724 INFO [train.py:1028] (0/2) Epoch 7, batch 7150, loss[loss=0.3668, simple_loss=0.3779, pruned_loss=0.1779, over 12477.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3235, pruned_loss=0.132, over 2575399.76 frames. ], batch size: 202, lr: 8.23e-03, grad_scale: 16.0 2024-06-20 03:47:03,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124397.16666666667, ans=0.1 2024-06-20 03:47:25,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=124452.16666666667, ans=0.025 2024-06-20 03:47:32,574 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.252e+02 2.494e+02 2.765e+02 4.689e+02, threshold=4.988e+02, percent-clipped=0.0 2024-06-20 03:47:32,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=124488.83333333333, ans=0.05 2024-06-20 03:47:33,204 INFO [train.py:1028] (0/2) Epoch 7, batch 7200, loss[loss=0.3017, simple_loss=0.334, pruned_loss=0.1347, over 13152.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3247, pruned_loss=0.1325, over 2579838.04 frames. ], batch size: 112, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:47:41,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=124507.16666666667, ans=0.0 2024-06-20 03:47:50,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=124525.5, ans=0.0 2024-06-20 03:47:54,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=124543.83333333333, ans=0.07 2024-06-20 03:47:58,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.30 vs. limit=10.0 2024-06-20 03:48:03,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2024-06-20 03:48:04,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=124562.16666666667, ans=0.0 2024-06-20 03:48:09,276 INFO [train.py:1028] (0/2) Epoch 7, batch 7250, loss[loss=0.2456, simple_loss=0.2902, pruned_loss=0.1005, over 12871.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3259, pruned_loss=0.1327, over 2580667.92 frames. ], batch size: 36, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:48:10,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=124580.5, ans=0.2 2024-06-20 03:48:14,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=12.0 2024-06-20 03:48:15,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124598.83333333333, ans=0.1 2024-06-20 03:48:18,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=124598.83333333333, ans=10.0 2024-06-20 03:48:19,669 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:48:23,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=124617.16666666667, ans=0.125 2024-06-20 03:48:37,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=124653.83333333333, ans=0.125 2024-06-20 03:48:44,226 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-68000.pt 2024-06-20 03:48:51,095 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.132e+02 2.298e+02 2.578e+02 3.533e+02, threshold=4.595e+02, percent-clipped=0.0 2024-06-20 03:48:51,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=124672.16666666667, ans=0.025 2024-06-20 03:48:51,745 INFO [train.py:1028] (0/2) Epoch 7, batch 7300, loss[loss=0.3072, simple_loss=0.333, pruned_loss=0.1407, over 12921.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3273, pruned_loss=0.1335, over 2580842.95 frames. ], batch size: 36, lr: 8.23e-03, grad_scale: 32.0 2024-06-20 03:48:55,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124672.16666666667, ans=0.1 2024-06-20 03:48:56,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=124672.16666666667, ans=0.125 2024-06-20 03:49:10,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=124727.16666666667, ans=0.0 2024-06-20 03:49:24,435 INFO [train.py:1028] (0/2) Epoch 7, batch 7350, loss[loss=0.3211, simple_loss=0.3586, pruned_loss=0.1418, over 13390.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3285, pruned_loss=0.1341, over 2582055.33 frames. ], batch size: 46, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:49:25,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=124763.83333333333, ans=0.0 2024-06-20 03:49:30,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=19.03 vs. limit=15.0 2024-06-20 03:49:42,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124800.5, ans=0.125 2024-06-20 03:49:49,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.74 vs. limit=22.5 2024-06-20 03:49:54,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124837.16666666667, ans=0.125 2024-06-20 03:49:57,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.291e+02 2.553e+02 2.865e+02 4.209e+02, threshold=5.106e+02, percent-clipped=0.0 2024-06-20 03:49:57,821 INFO [train.py:1028] (0/2) Epoch 7, batch 7400, loss[loss=0.3013, simple_loss=0.3356, pruned_loss=0.1335, over 13270.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3274, pruned_loss=0.1332, over 2587430.70 frames. ], batch size: 63, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:50:03,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=124855.5, ans=0.2 2024-06-20 03:50:04,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 03:50:05,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=15.0 2024-06-20 03:50:11,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=124873.83333333333, ans=0.0 2024-06-20 03:50:14,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=124892.16666666667, ans=0.0 2024-06-20 03:50:15,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=124892.16666666667, ans=0.125 2024-06-20 03:50:16,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=15.0 2024-06-20 03:50:31,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=124928.83333333333, ans=0.0 2024-06-20 03:50:34,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2024-06-20 03:50:34,805 INFO [train.py:1028] (0/2) Epoch 7, batch 7450, loss[loss=0.2692, simple_loss=0.3095, pruned_loss=0.1144, over 12569.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.328, pruned_loss=0.1335, over 2580057.19 frames. ], batch size: 29, lr: 8.22e-03, grad_scale: 32.0 2024-06-20 03:50:35,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=124947.16666666667, ans=0.0 2024-06-20 03:50:41,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=124965.5, ans=0.2 2024-06-20 03:50:42,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2024-06-20 03:50:54,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=124983.83333333333, ans=0.1 2024-06-20 03:51:02,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-20 03:51:11,255 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.228e+02 2.483e+02 2.678e+02 4.093e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-20 03:51:11,889 INFO [train.py:1028] (0/2) Epoch 7, batch 7500, loss[loss=0.3298, simple_loss=0.3417, pruned_loss=0.159, over 10697.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3292, pruned_loss=0.1345, over 2577710.77 frames. ], batch size: 304, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:51:20,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=125057.16666666667, ans=0.0 2024-06-20 03:51:24,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125075.5, ans=0.125 2024-06-20 03:51:25,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=125075.5, ans=0.0 2024-06-20 03:51:28,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=125075.5, ans=0.2 2024-06-20 03:51:38,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=125112.16666666667, ans=0.125 2024-06-20 03:51:45,006 INFO [train.py:1028] (0/2) Epoch 7, batch 7550, loss[loss=0.2884, simple_loss=0.3113, pruned_loss=0.1328, over 12965.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3301, pruned_loss=0.1351, over 2576620.60 frames. ], batch size: 158, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:51:45,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=125130.5, ans=0.1 2024-06-20 03:51:47,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=15.0 2024-06-20 03:51:48,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125130.5, ans=0.125 2024-06-20 03:51:54,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=125148.83333333333, ans=0.025 2024-06-20 03:51:55,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125148.83333333333, ans=0.125 2024-06-20 03:52:01,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=125167.16666666667, ans=0.2 2024-06-20 03:52:14,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=125203.83333333333, ans=0.0 2024-06-20 03:52:20,548 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.339e+02 2.532e+02 2.868e+02 4.356e+02, threshold=5.063e+02, percent-clipped=0.0 2024-06-20 03:52:21,384 INFO [train.py:1028] (0/2) Epoch 7, batch 7600, loss[loss=0.2807, simple_loss=0.3164, pruned_loss=0.1225, over 13178.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.33, pruned_loss=0.1349, over 2575977.01 frames. ], batch size: 83, lr: 8.21e-03, grad_scale: 32.0 2024-06-20 03:52:26,827 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 03:52:38,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=125258.83333333333, ans=0.2 2024-06-20 03:52:38,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125258.83333333333, ans=0.125 2024-06-20 03:52:44,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.15 vs. limit=10.0 2024-06-20 03:52:50,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=125295.5, ans=0.125 2024-06-20 03:52:56,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=8.0 2024-06-20 03:52:58,353 INFO [train.py:1028] (0/2) Epoch 7, batch 7650, loss[loss=0.2807, simple_loss=0.3159, pruned_loss=0.1227, over 12815.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3307, pruned_loss=0.1353, over 2572347.87 frames. ], batch size: 33, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:53:11,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125350.5, ans=0.1 2024-06-20 03:53:13,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125350.5, ans=0.125 2024-06-20 03:53:17,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=125350.5, ans=0.0 2024-06-20 03:53:19,098 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.900e+01 2024-06-20 03:53:21,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=125368.83333333333, ans=0.0 2024-06-20 03:53:23,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=125368.83333333333, ans=0.0 2024-06-20 03:53:23,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=125368.83333333333, ans=0.025 2024-06-20 03:53:25,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=125387.16666666667, ans=10.0 2024-06-20 03:53:31,707 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.224e+02 2.553e+02 2.901e+02 4.356e+02, threshold=5.106e+02, percent-clipped=0.0 2024-06-20 03:53:32,346 INFO [train.py:1028] (0/2) Epoch 7, batch 7700, loss[loss=0.3135, simple_loss=0.3518, pruned_loss=0.1376, over 13216.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3318, pruned_loss=0.1358, over 2569818.27 frames. ], batch size: 63, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:53:48,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125442.16666666667, ans=0.125 2024-06-20 03:53:50,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=125442.16666666667, ans=0.0 2024-06-20 03:53:53,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.95 vs. limit=15.0 2024-06-20 03:53:54,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.11 vs. limit=10.0 2024-06-20 03:53:57,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=125460.5, ans=0.125 2024-06-20 03:54:08,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.04 vs. limit=15.0 2024-06-20 03:54:08,545 INFO [train.py:1028] (0/2) Epoch 7, batch 7750, loss[loss=0.2611, simple_loss=0.302, pruned_loss=0.1101, over 13219.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3328, pruned_loss=0.1368, over 2574630.76 frames. ], batch size: 72, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:54:21,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=125533.83333333333, ans=0.04949747468305833 2024-06-20 03:54:24,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=125533.83333333333, ans=0.0 2024-06-20 03:54:26,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.24 vs. limit=10.0 2024-06-20 03:54:28,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=125552.16666666667, ans=0.07 2024-06-20 03:54:30,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2024-06-20 03:54:33,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=125552.16666666667, ans=22.5 2024-06-20 03:54:41,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=125570.5, ans=0.5 2024-06-20 03:54:44,470 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.099e+02 2.254e+02 2.407e+02 3.252e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-20 03:54:45,169 INFO [train.py:1028] (0/2) Epoch 7, batch 7800, loss[loss=0.2967, simple_loss=0.3369, pruned_loss=0.1282, over 13151.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3322, pruned_loss=0.1358, over 2578940.76 frames. ], batch size: 95, lr: 8.20e-03, grad_scale: 32.0 2024-06-20 03:54:46,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.93 vs. limit=10.0 2024-06-20 03:54:50,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=125607.16666666667, ans=0.125 2024-06-20 03:54:52,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=125607.16666666667, ans=0.125 2024-06-20 03:55:04,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125625.5, ans=0.1 2024-06-20 03:55:15,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125662.16666666667, ans=0.0 2024-06-20 03:55:16,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=125662.16666666667, ans=0.0 2024-06-20 03:55:18,999 INFO [train.py:1028] (0/2) Epoch 7, batch 7850, loss[loss=0.2789, simple_loss=0.3151, pruned_loss=0.1214, over 11762.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3337, pruned_loss=0.1367, over 2571960.09 frames. ], batch size: 17, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:55:21,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125680.5, ans=0.125 2024-06-20 03:55:23,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.24 vs. limit=22.5 2024-06-20 03:55:24,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=125680.5, ans=0.0 2024-06-20 03:55:33,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125717.16666666667, ans=0.0 2024-06-20 03:55:33,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=125717.16666666667, ans=0.0 2024-06-20 03:55:41,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=125735.5, ans=0.0 2024-06-20 03:55:45,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.06 vs. limit=22.5 2024-06-20 03:55:50,800 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.218e+02 2.493e+02 2.893e+02 4.281e+02, threshold=4.985e+02, percent-clipped=0.0 2024-06-20 03:55:51,423 INFO [train.py:1028] (0/2) Epoch 7, batch 7900, loss[loss=0.2949, simple_loss=0.3275, pruned_loss=0.1312, over 13166.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.334, pruned_loss=0.1371, over 2572234.52 frames. ], batch size: 77, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:56:06,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125790.5, ans=0.125 2024-06-20 03:56:07,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125790.5, ans=0.125 2024-06-20 03:56:20,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=125827.16666666667, ans=0.125 2024-06-20 03:56:21,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-20 03:56:22,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2024-06-20 03:56:27,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=125845.5, ans=0.0 2024-06-20 03:56:29,431 INFO [train.py:1028] (0/2) Epoch 7, batch 7950, loss[loss=0.3072, simple_loss=0.3255, pruned_loss=0.1445, over 10601.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3339, pruned_loss=0.1369, over 2574341.44 frames. ], batch size: 304, lr: 8.19e-03, grad_scale: 32.0 2024-06-20 03:56:30,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125863.83333333333, ans=0.125 2024-06-20 03:56:34,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=125863.83333333333, ans=0.125 2024-06-20 03:56:42,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=125882.16666666667, ans=0.125 2024-06-20 03:56:43,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=125882.16666666667, ans=0.2 2024-06-20 03:56:43,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=125882.16666666667, ans=0.125 2024-06-20 03:56:46,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2024-06-20 03:56:50,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=125900.5, ans=0.1 2024-06-20 03:56:59,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=125937.16666666667, ans=0.09899494936611666 2024-06-20 03:57:05,572 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.094e+02 2.302e+02 2.785e+02 4.899e+02, threshold=4.603e+02, percent-clipped=0.0 2024-06-20 03:57:06,251 INFO [train.py:1028] (0/2) Epoch 7, batch 8000, loss[loss=0.2791, simple_loss=0.3189, pruned_loss=0.1197, over 12583.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.335, pruned_loss=0.1373, over 2572325.57 frames. ], batch size: 29, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:57:21,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125992.16666666667, ans=0.125 2024-06-20 03:57:26,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126010.5, ans=0.125 2024-06-20 03:57:31,081 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=15.0 2024-06-20 03:57:39,087 INFO [train.py:1028] (0/2) Epoch 7, batch 8050, loss[loss=0.2836, simple_loss=0.3184, pruned_loss=0.1244, over 13139.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3341, pruned_loss=0.1366, over 2571874.50 frames. ], batch size: 83, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:57:45,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126065.5, ans=0.125 2024-06-20 03:57:55,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-20 03:57:59,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=126083.83333333333, ans=0.125 2024-06-20 03:58:04,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2024-06-20 03:58:05,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=126102.16666666667, ans=0.125 2024-06-20 03:58:06,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-20 03:58:14,221 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.136e+02 2.299e+02 2.519e+02 3.627e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-20 03:58:14,904 INFO [train.py:1028] (0/2) Epoch 7, batch 8100, loss[loss=0.2968, simple_loss=0.3375, pruned_loss=0.128, over 13125.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3345, pruned_loss=0.1365, over 2576299.35 frames. ], batch size: 112, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:58:16,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2024-06-20 03:58:21,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=126157.16666666667, ans=0.025 2024-06-20 03:58:27,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=1.74 vs. limit=15.0 2024-06-20 03:58:29,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=126175.5, ans=0.035 2024-06-20 03:58:32,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=126175.5, ans=0.2 2024-06-20 03:58:37,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=126193.83333333333, ans=0.0 2024-06-20 03:58:46,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=126212.16666666667, ans=0.125 2024-06-20 03:58:50,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=126212.16666666667, ans=0.125 2024-06-20 03:58:51,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126212.16666666667, ans=0.1 2024-06-20 03:58:53,419 INFO [train.py:1028] (0/2) Epoch 7, batch 8150, loss[loss=0.3027, simple_loss=0.326, pruned_loss=0.1397, over 13094.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3346, pruned_loss=0.1362, over 2580279.86 frames. ], batch size: 121, lr: 8.18e-03, grad_scale: 32.0 2024-06-20 03:58:57,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=12.0 2024-06-20 03:59:07,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126267.16666666667, ans=0.125 2024-06-20 03:59:11,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=126267.16666666667, ans=0.0 2024-06-20 03:59:13,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126285.5, ans=0.1 2024-06-20 03:59:16,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=126285.5, ans=0.125 2024-06-20 03:59:25,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 2.102e+02 2.236e+02 2.437e+02 3.028e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-20 03:59:26,322 INFO [train.py:1028] (0/2) Epoch 7, batch 8200, loss[loss=0.293, simple_loss=0.3261, pruned_loss=0.13, over 13114.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3351, pruned_loss=0.136, over 2584099.90 frames. ], batch size: 112, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 03:59:27,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=126322.16666666667, ans=0.125 2024-06-20 03:59:27,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=126322.16666666667, ans=0.125 2024-06-20 03:59:32,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=126340.5, ans=0.2 2024-06-20 03:59:34,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=126340.5, ans=0.125 2024-06-20 03:59:36,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=126340.5, ans=0.0 2024-06-20 03:59:41,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.51 vs. limit=22.5 2024-06-20 03:59:42,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=126358.83333333333, ans=0.1 2024-06-20 04:00:02,874 INFO [train.py:1028] (0/2) Epoch 7, batch 8250, loss[loss=0.2987, simple_loss=0.3372, pruned_loss=0.1301, over 13312.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3357, pruned_loss=0.1365, over 2584613.41 frames. ], batch size: 52, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 04:00:04,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=126413.83333333333, ans=0.0 2024-06-20 04:00:21,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=126450.5, ans=0.2 2024-06-20 04:00:22,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=126450.5, ans=0.0 2024-06-20 04:00:27,525 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=4.801e+02 2024-06-20 04:00:31,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=126487.16666666667, ans=0.2 2024-06-20 04:00:32,579 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.74 vs. limit=22.5 2024-06-20 04:00:32,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=126487.16666666667, ans=0.0 2024-06-20 04:00:35,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.169e+02 2.350e+02 2.705e+02 3.497e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-20 04:00:36,267 INFO [train.py:1028] (0/2) Epoch 7, batch 8300, loss[loss=0.3113, simple_loss=0.3369, pruned_loss=0.1429, over 13001.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3343, pruned_loss=0.1355, over 2580626.43 frames. ], batch size: 102, lr: 8.17e-03, grad_scale: 32.0 2024-06-20 04:00:39,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=12.0 2024-06-20 04:00:41,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.82 vs. limit=10.0 2024-06-20 04:00:49,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=126523.83333333333, ans=0.025 2024-06-20 04:01:03,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126560.5, ans=0.125 2024-06-20 04:01:08,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=126578.83333333333, ans=0.025 2024-06-20 04:01:13,836 INFO [train.py:1028] (0/2) Epoch 7, batch 8350, loss[loss=0.2954, simple_loss=0.3313, pruned_loss=0.1297, over 13183.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3337, pruned_loss=0.1348, over 2580436.10 frames. ], batch size: 112, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:01:32,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=126633.83333333333, ans=0.05 2024-06-20 04:01:40,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.44 vs. limit=15.0 2024-06-20 04:01:40,750 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=12.0 2024-06-20 04:01:42,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=126670.5, ans=0.0 2024-06-20 04:01:48,107 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.099e+02 2.322e+02 2.571e+02 3.909e+02, threshold=4.644e+02, percent-clipped=0.0 2024-06-20 04:01:48,894 INFO [train.py:1028] (0/2) Epoch 7, batch 8400, loss[loss=0.3108, simple_loss=0.3365, pruned_loss=0.1425, over 12949.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3336, pruned_loss=0.135, over 2577002.57 frames. ], batch size: 39, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:01:49,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=15.0 2024-06-20 04:01:55,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=126707.16666666667, ans=0.0 2024-06-20 04:02:01,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=126725.5, ans=0.125 2024-06-20 04:02:02,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=126725.5, ans=0.125 2024-06-20 04:02:12,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2024-06-20 04:02:15,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-20 04:02:17,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 04:02:25,810 INFO [train.py:1028] (0/2) Epoch 7, batch 8450, loss[loss=0.3281, simple_loss=0.3634, pruned_loss=0.1464, over 13102.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3343, pruned_loss=0.1352, over 2579394.98 frames. ], batch size: 112, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:02:27,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126780.5, ans=0.125 2024-06-20 04:02:31,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=126780.5, ans=0.05 2024-06-20 04:02:40,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-20 04:02:41,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=126817.16666666667, ans=0.0 2024-06-20 04:02:42,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=126817.16666666667, ans=0.125 2024-06-20 04:03:04,469 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.162e+02 2.289e+02 2.556e+02 3.325e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-20 04:03:05,189 INFO [train.py:1028] (0/2) Epoch 7, batch 8500, loss[loss=0.2771, simple_loss=0.3108, pruned_loss=0.1217, over 12543.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3352, pruned_loss=0.1356, over 2576994.67 frames. ], batch size: 29, lr: 8.16e-03, grad_scale: 32.0 2024-06-20 04:03:05,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=126872.16666666667, ans=0.2 2024-06-20 04:03:07,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=126872.16666666667, ans=0.09899494936611666 2024-06-20 04:03:14,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=126890.5, ans=0.0 2024-06-20 04:03:15,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126890.5, ans=0.1 2024-06-20 04:03:20,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=22.5 2024-06-20 04:03:24,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2024-06-20 04:03:26,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=126927.16666666667, ans=0.025 2024-06-20 04:03:27,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2024-06-20 04:03:32,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=126945.5, ans=0.2 2024-06-20 04:03:37,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=126945.5, ans=0.025 2024-06-20 04:03:39,002 INFO [train.py:1028] (0/2) Epoch 7, batch 8550, loss[loss=0.2842, simple_loss=0.3261, pruned_loss=0.1212, over 12763.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.335, pruned_loss=0.1353, over 2575691.20 frames. ], batch size: 22, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:03:48,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=126982.16666666667, ans=0.0 2024-06-20 04:03:57,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=127000.5, ans=0.0 2024-06-20 04:03:58,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.99 vs. limit=22.5 2024-06-20 04:04:01,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=127018.83333333333, ans=0.125 2024-06-20 04:04:01,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.13 vs. limit=15.0 2024-06-20 04:04:11,482 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.065e+02 2.254e+02 2.560e+02 3.907e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-20 04:04:12,092 INFO [train.py:1028] (0/2) Epoch 7, batch 8600, loss[loss=0.3012, simple_loss=0.3258, pruned_loss=0.1383, over 13139.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3364, pruned_loss=0.1358, over 2574659.83 frames. ], batch size: 112, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:04:14,220 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:04:29,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=127092.16666666667, ans=0.125 2024-06-20 04:04:29,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=127092.16666666667, ans=0.04949747468305833 2024-06-20 04:04:32,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=127092.16666666667, ans=0.125 2024-06-20 04:04:39,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=127110.5, ans=0.0 2024-06-20 04:04:40,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=127110.5, ans=0.0 2024-06-20 04:04:41,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.77 vs. limit=10.0 2024-06-20 04:04:50,223 INFO [train.py:1028] (0/2) Epoch 7, batch 8650, loss[loss=0.2794, simple_loss=0.3117, pruned_loss=0.1235, over 13028.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3368, pruned_loss=0.1358, over 2576678.86 frames. ], batch size: 102, lr: 8.15e-03, grad_scale: 32.0 2024-06-20 04:05:00,100 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:05:00,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.99 vs. limit=10.0 2024-06-20 04:05:00,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=127165.5, ans=0.0 2024-06-20 04:05:05,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127165.5, ans=0.1 2024-06-20 04:05:07,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.61 vs. limit=22.5 2024-06-20 04:05:11,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=127183.83333333333, ans=0.0 2024-06-20 04:05:12,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=127183.83333333333, ans=0.0 2024-06-20 04:05:17,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=127202.16666666667, ans=0.125 2024-06-20 04:05:19,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=127220.5, ans=0.04949747468305833 2024-06-20 04:05:23,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2024-06-20 04:05:25,893 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.093e+02 2.303e+02 2.586e+02 3.647e+02, threshold=4.606e+02, percent-clipped=0.0 2024-06-20 04:05:26,559 INFO [train.py:1028] (0/2) Epoch 7, batch 8700, loss[loss=0.308, simple_loss=0.3488, pruned_loss=0.1336, over 13193.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3371, pruned_loss=0.1363, over 2572206.85 frames. ], batch size: 59, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:05:38,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127257.16666666667, ans=0.125 2024-06-20 04:05:44,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=127275.5, ans=0.125 2024-06-20 04:05:50,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=127293.83333333333, ans=0.05 2024-06-20 04:05:52,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127293.83333333333, ans=0.125 2024-06-20 04:05:59,724 INFO [train.py:1028] (0/2) Epoch 7, batch 8750, loss[loss=0.2856, simple_loss=0.3136, pruned_loss=0.1288, over 13112.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3366, pruned_loss=0.136, over 2567219.20 frames. ], batch size: 121, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:06:00,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=127330.5, ans=0.1 2024-06-20 04:06:07,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=127348.83333333333, ans=22.5 2024-06-20 04:06:25,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=127385.5, ans=0.125 2024-06-20 04:06:26,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=127385.5, ans=0.5 2024-06-20 04:06:27,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127385.5, ans=0.1 2024-06-20 04:06:27,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=127385.5, ans=0.1 2024-06-20 04:06:28,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.92 vs. limit=22.5 2024-06-20 04:06:33,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=127403.83333333333, ans=0.125 2024-06-20 04:06:34,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=127403.83333333333, ans=0.0 2024-06-20 04:06:35,852 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.072e+02 2.268e+02 2.504e+02 3.569e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-20 04:06:36,595 INFO [train.py:1028] (0/2) Epoch 7, batch 8800, loss[loss=0.3095, simple_loss=0.3506, pruned_loss=0.1342, over 13275.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3375, pruned_loss=0.1367, over 2572959.50 frames. ], batch size: 72, lr: 8.14e-03, grad_scale: 32.0 2024-06-20 04:06:39,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=127422.16666666667, ans=0.125 2024-06-20 04:06:44,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.61 vs. limit=15.0 2024-06-20 04:06:49,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=127440.5, ans=0.07 2024-06-20 04:06:53,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=127458.83333333333, ans=0.0 2024-06-20 04:06:55,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=127458.83333333333, ans=0.2 2024-06-20 04:06:57,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=127458.83333333333, ans=0.0 2024-06-20 04:07:07,243 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.00 vs. limit=22.5 2024-06-20 04:07:08,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127495.5, ans=0.1 2024-06-20 04:07:11,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=15.0 2024-06-20 04:07:11,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.32 vs. limit=15.0 2024-06-20 04:07:13,620 INFO [train.py:1028] (0/2) Epoch 7, batch 8850, loss[loss=0.3167, simple_loss=0.3446, pruned_loss=0.1444, over 12629.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3379, pruned_loss=0.1373, over 2562168.71 frames. ], batch size: 202, lr: 8.13e-03, grad_scale: 32.0 2024-06-20 04:07:19,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.79 vs. limit=22.5 2024-06-20 04:07:21,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-20 04:07:23,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=127532.16666666667, ans=0.2 2024-06-20 04:07:36,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.64 vs. limit=22.5 2024-06-20 04:07:39,204 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:07:43,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=127587.16666666667, ans=0.0 2024-06-20 04:07:43,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2024-06-20 04:07:46,210 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.197e+02 2.424e+02 2.632e+02 3.905e+02, threshold=4.848e+02, percent-clipped=0.0 2024-06-20 04:07:46,894 INFO [train.py:1028] (0/2) Epoch 7, batch 8900, loss[loss=0.3169, simple_loss=0.3516, pruned_loss=0.1411, over 12853.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3387, pruned_loss=0.1379, over 2560083.37 frames. ], batch size: 33, lr: 8.13e-03, grad_scale: 32.0 2024-06-20 04:07:51,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.90 vs. limit=6.0 2024-06-20 04:07:57,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=127623.83333333333, ans=0.0 2024-06-20 04:08:04,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.50 vs. limit=15.0 2024-06-20 04:08:09,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=127660.5, ans=0.2 2024-06-20 04:08:10,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=127660.5, ans=0.0 2024-06-20 04:08:11,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=127660.5, ans=0.09899494936611666 2024-06-20 04:08:14,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=12.0 2024-06-20 04:08:23,299 INFO [train.py:1028] (0/2) Epoch 7, batch 8950, loss[loss=0.3501, simple_loss=0.3657, pruned_loss=0.1673, over 12566.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3388, pruned_loss=0.1374, over 2561260.60 frames. ], batch size: 202, lr: 8.13e-03, grad_scale: 64.0 2024-06-20 04:08:55,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=127770.5, ans=0.125 2024-06-20 04:08:59,411 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.147e+02 2.353e+02 2.653e+02 5.180e+02, threshold=4.707e+02, percent-clipped=1.0 2024-06-20 04:09:00,043 INFO [train.py:1028] (0/2) Epoch 7, batch 9000, loss[loss=0.2999, simple_loss=0.3308, pruned_loss=0.1345, over 13306.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3383, pruned_loss=0.1366, over 2567674.32 frames. ], batch size: 46, lr: 8.13e-03, grad_scale: 64.0 2024-06-20 04:09:00,044 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 04:09:04,720 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2139, 1.8331, 1.6531, 1.2780, 1.6687, 1.2769, 1.7031, 1.6444], device='cuda:0') 2024-06-20 04:09:07,801 INFO [train.py:1060] (0/2) Epoch 7, validation: loss=0.2116, simple_loss=0.2733, pruned_loss=0.07497, over 351949.00 frames. 2024-06-20 04:09:07,801 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16965MB 2024-06-20 04:09:12,656 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.90 vs. limit=15.0 2024-06-20 04:09:15,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=127807.16666666667, ans=0.0 2024-06-20 04:09:18,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=127807.16666666667, ans=0.0 2024-06-20 04:09:21,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=127825.5, ans=0.0 2024-06-20 04:09:26,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=127825.5, ans=0.2 2024-06-20 04:09:35,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=127862.16666666667, ans=0.125 2024-06-20 04:09:38,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127862.16666666667, ans=0.1 2024-06-20 04:09:40,511 INFO [train.py:1028] (0/2) Epoch 7, batch 9050, loss[loss=0.2416, simple_loss=0.2894, pruned_loss=0.09685, over 11628.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3398, pruned_loss=0.1373, over 2567894.79 frames. ], batch size: 17, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:09:41,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-20 04:09:42,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=127880.5, ans=0.125 2024-06-20 04:09:54,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=127917.16666666667, ans=0.09899494936611666 2024-06-20 04:10:04,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 04:10:04,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=127935.5, ans=0.125 2024-06-20 04:10:05,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=127953.83333333333, ans=0.0 2024-06-20 04:10:06,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=127953.83333333333, ans=0.025 2024-06-20 04:10:06,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=127953.83333333333, ans=0.125 2024-06-20 04:10:10,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2024-06-20 04:10:12,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.022e+02 2.190e+02 2.409e+02 3.025e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-20 04:10:12,909 INFO [train.py:1028] (0/2) Epoch 7, batch 9100, loss[loss=0.2847, simple_loss=0.3308, pruned_loss=0.1193, over 13242.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3387, pruned_loss=0.1363, over 2569079.54 frames. ], batch size: 72, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:10:14,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.72 vs. limit=22.5 2024-06-20 04:10:19,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2024-06-20 04:10:35,914 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:10:44,833 INFO [train.py:1028] (0/2) Epoch 7, batch 9150, loss[loss=0.3054, simple_loss=0.3382, pruned_loss=0.1363, over 13160.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3388, pruned_loss=0.1362, over 2569103.93 frames. ], batch size: 77, lr: 8.12e-03, grad_scale: 64.0 2024-06-20 04:10:45,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-20 04:10:53,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=128082.16666666667, ans=0.125 2024-06-20 04:10:56,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=128082.16666666667, ans=0.2 2024-06-20 04:10:56,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2024-06-20 04:11:10,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=128137.16666666667, ans=0.125 2024-06-20 04:11:18,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128137.16666666667, ans=0.1 2024-06-20 04:11:19,013 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.095e+02 2.387e+02 2.748e+02 4.088e+02, threshold=4.775e+02, percent-clipped=0.0 2024-06-20 04:11:19,748 INFO [train.py:1028] (0/2) Epoch 7, batch 9200, loss[loss=0.3116, simple_loss=0.359, pruned_loss=0.1321, over 12997.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.338, pruned_loss=0.1351, over 2572242.97 frames. ], batch size: 36, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:11:21,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=128155.5, ans=0.07 2024-06-20 04:11:27,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.92 vs. limit=15.0 2024-06-20 04:11:32,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.99 vs. limit=22.5 2024-06-20 04:11:37,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=128210.5, ans=0.125 2024-06-20 04:11:43,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.80 vs. limit=22.5 2024-06-20 04:11:50,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=128247.16666666667, ans=0.0 2024-06-20 04:11:51,087 INFO [train.py:1028] (0/2) Epoch 7, batch 9250, loss[loss=0.2947, simple_loss=0.3421, pruned_loss=0.1236, over 13192.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3374, pruned_loss=0.1345, over 2574249.00 frames. ], batch size: 67, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:11:58,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128265.5, ans=0.125 2024-06-20 04:12:18,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=128320.5, ans=0.0 2024-06-20 04:12:19,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=128320.5, ans=0.2 2024-06-20 04:12:23,037 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 2.026e+02 2.193e+02 2.383e+02 3.728e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-20 04:12:23,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=128338.83333333333, ans=0.2 2024-06-20 04:12:23,706 INFO [train.py:1028] (0/2) Epoch 7, batch 9300, loss[loss=0.2972, simple_loss=0.3304, pruned_loss=0.132, over 12953.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3369, pruned_loss=0.134, over 2571218.79 frames. ], batch size: 39, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:12:26,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2024-06-20 04:12:40,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=128375.5, ans=0.0 2024-06-20 04:12:42,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=128375.5, ans=0.0 2024-06-20 04:12:43,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=128375.5, ans=0.0 2024-06-20 04:12:57,435 INFO [train.py:1028] (0/2) Epoch 7, batch 9350, loss[loss=0.3055, simple_loss=0.334, pruned_loss=0.1385, over 12498.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.337, pruned_loss=0.1344, over 2567937.89 frames. ], batch size: 22, lr: 8.11e-03, grad_scale: 64.0 2024-06-20 04:12:57,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=128430.5, ans=0.125 2024-06-20 04:13:14,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128467.16666666667, ans=0.1 2024-06-20 04:13:24,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=128503.83333333333, ans=0.0 2024-06-20 04:13:27,763 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.132e+02 2.252e+02 2.529e+02 3.663e+02, threshold=4.504e+02, percent-clipped=0.0 2024-06-20 04:13:28,439 INFO [train.py:1028] (0/2) Epoch 7, batch 9400, loss[loss=0.3181, simple_loss=0.3494, pruned_loss=0.1434, over 13218.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3369, pruned_loss=0.1345, over 2567969.49 frames. ], batch size: 52, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:13:42,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.97 vs. limit=15.0 2024-06-20 04:13:56,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=128595.5, ans=0.0 2024-06-20 04:13:59,199 INFO [train.py:1028] (0/2) Epoch 7, batch 9450, loss[loss=0.3174, simple_loss=0.3552, pruned_loss=0.1397, over 12665.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3379, pruned_loss=0.1358, over 2567408.46 frames. ], batch size: 22, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:14:12,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=128650.5, ans=0.0 2024-06-20 04:14:22,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=128668.83333333333, ans=0.125 2024-06-20 04:14:27,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=128687.16666666667, ans=0.0 2024-06-20 04:14:28,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=128687.16666666667, ans=0.025 2024-06-20 04:14:29,346 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.111e+02 2.313e+02 2.707e+02 3.495e+02, threshold=4.627e+02, percent-clipped=0.0 2024-06-20 04:14:32,294 INFO [train.py:1028] (0/2) Epoch 7, batch 9500, loss[loss=0.2819, simple_loss=0.3273, pruned_loss=0.1182, over 13271.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3369, pruned_loss=0.1345, over 2576862.42 frames. ], batch size: 43, lr: 8.10e-03, grad_scale: 64.0 2024-06-20 04:14:33,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2024-06-20 04:14:33,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=128705.5, ans=0.0 2024-06-20 04:14:40,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2024-06-20 04:14:53,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2024-06-20 04:15:01,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=128778.83333333333, ans=0.05 2024-06-20 04:15:03,045 INFO [train.py:1028] (0/2) Epoch 7, batch 9550, loss[loss=0.2492, simple_loss=0.2889, pruned_loss=0.1048, over 12957.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3367, pruned_loss=0.1343, over 2571270.43 frames. ], batch size: 39, lr: 8.09e-03, grad_scale: 64.0 2024-06-20 04:15:14,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=128815.5, ans=0.125 2024-06-20 04:15:30,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=128870.5, ans=0.0 2024-06-20 04:15:34,756 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 2.183e+02 2.469e+02 2.901e+02 4.662e+02, threshold=4.938e+02, percent-clipped=1.0 2024-06-20 04:15:35,416 INFO [train.py:1028] (0/2) Epoch 7, batch 9600, loss[loss=0.3154, simple_loss=0.3338, pruned_loss=0.1485, over 10424.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3366, pruned_loss=0.1343, over 2570620.38 frames. ], batch size: 304, lr: 8.09e-03, grad_scale: 64.0 2024-06-20 04:15:39,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=128888.83333333333, ans=0.125 2024-06-20 04:15:39,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=128888.83333333333, ans=0.125 2024-06-20 04:15:45,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2024-06-20 04:15:47,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.74 vs. limit=22.5 2024-06-20 04:15:48,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=128925.5, ans=0.125 2024-06-20 04:16:01,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128962.16666666667, ans=0.125 2024-06-20 04:16:01,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=128962.16666666667, ans=0.2 2024-06-20 04:16:02,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=16.03 vs. limit=15.0 2024-06-20 04:16:03,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=128962.16666666667, ans=0.0 2024-06-20 04:16:03,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=128962.16666666667, ans=0.0 2024-06-20 04:16:05,832 INFO [train.py:1028] (0/2) Epoch 7, batch 9650, loss[loss=0.3213, simple_loss=0.3474, pruned_loss=0.1476, over 13072.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3367, pruned_loss=0.1349, over 2561218.71 frames. ], batch size: 132, lr: 8.09e-03, grad_scale: 32.0 2024-06-20 04:16:08,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=128980.5, ans=0.05 2024-06-20 04:16:14,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.83 vs. limit=12.0 2024-06-20 04:16:15,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=15.0 2024-06-20 04:16:24,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2024-06-20 04:16:25,887 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.781e+01 2024-06-20 04:16:36,132 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.126e+02 2.322e+02 2.631e+02 4.430e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-20 04:16:36,160 INFO [train.py:1028] (0/2) Epoch 7, batch 9700, loss[loss=0.3036, simple_loss=0.3336, pruned_loss=0.1368, over 13040.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3373, pruned_loss=0.1356, over 2556033.30 frames. ], batch size: 144, lr: 8.09e-03, grad_scale: 32.0 2024-06-20 04:16:36,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=129072.16666666667, ans=0.0 2024-06-20 04:16:37,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=129072.16666666667, ans=0.0 2024-06-20 04:16:55,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=129127.16666666667, ans=0.125 2024-06-20 04:16:58,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=129127.16666666667, ans=0.0 2024-06-20 04:17:09,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=129163.83333333333, ans=6.0 2024-06-20 04:17:09,933 INFO [train.py:1028] (0/2) Epoch 7, batch 9750, loss[loss=0.2918, simple_loss=0.3172, pruned_loss=0.1332, over 13100.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3348, pruned_loss=0.1343, over 2553006.44 frames. ], batch size: 132, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:17:13,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129163.83333333333, ans=0.125 2024-06-20 04:17:16,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129182.16666666667, ans=0.125 2024-06-20 04:17:17,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=129182.16666666667, ans=0.125 2024-06-20 04:17:23,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129200.5, ans=0.125 2024-06-20 04:17:26,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=129200.5, ans=0.2 2024-06-20 04:17:28,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129218.83333333333, ans=0.1 2024-06-20 04:17:34,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=129237.16666666667, ans=0.2 2024-06-20 04:17:39,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=129237.16666666667, ans=0.0 2024-06-20 04:17:40,820 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.017e+02 2.158e+02 2.396e+02 3.987e+02, threshold=4.316e+02, percent-clipped=0.0 2024-06-20 04:17:40,848 INFO [train.py:1028] (0/2) Epoch 7, batch 9800, loss[loss=0.2836, simple_loss=0.3198, pruned_loss=0.1237, over 12882.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3337, pruned_loss=0.1335, over 2545790.29 frames. ], batch size: 39, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:17:45,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-20 04:17:48,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=15.0 2024-06-20 04:17:54,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=129292.16666666667, ans=0.125 2024-06-20 04:17:55,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=129292.16666666667, ans=0.125 2024-06-20 04:17:57,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=129292.16666666667, ans=0.125 2024-06-20 04:18:10,584 INFO [train.py:1028] (0/2) Epoch 7, batch 9850, loss[loss=0.3187, simple_loss=0.3423, pruned_loss=0.1476, over 13049.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3331, pruned_loss=0.1332, over 2538487.59 frames. ], batch size: 102, lr: 8.08e-03, grad_scale: 32.0 2024-06-20 04:18:22,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=129365.5, ans=0.0 2024-06-20 04:18:41,647 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.089e+02 2.217e+02 2.392e+02 3.088e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 04:18:41,676 INFO [train.py:1028] (0/2) Epoch 7, batch 9900, loss[loss=0.3071, simple_loss=0.3445, pruned_loss=0.1349, over 12992.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3328, pruned_loss=0.1335, over 2529511.35 frames. ], batch size: 39, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:18:41,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129438.83333333333, ans=0.1 2024-06-20 04:18:47,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-20 04:18:53,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129457.16666666667, ans=0.125 2024-06-20 04:18:59,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=129475.5, ans=0.125 2024-06-20 04:19:03,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-20 04:19:04,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2024-06-20 04:19:05,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129493.83333333333, ans=0.125 2024-06-20 04:19:09,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=129512.16666666667, ans=0.0 2024-06-20 04:19:10,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=15.0 2024-06-20 04:19:12,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=129530.5, ans=0.0 2024-06-20 04:19:12,878 INFO [train.py:1028] (0/2) Epoch 7, batch 9950, loss[loss=0.3165, simple_loss=0.3531, pruned_loss=0.14, over 12564.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3317, pruned_loss=0.1335, over 2525473.07 frames. ], batch size: 29, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:19:16,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=129530.5, ans=0.0 2024-06-20 04:19:21,951 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.621e+01 2024-06-20 04:19:23,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-20 04:19:27,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=129567.16666666667, ans=0.025 2024-06-20 04:19:30,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=129567.16666666667, ans=0.0 2024-06-20 04:19:33,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129585.5, ans=0.1 2024-06-20 04:19:34,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=129585.5, ans=0.0 2024-06-20 04:19:44,299 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.092e+02 2.284e+02 2.594e+02 3.631e+02, threshold=4.569e+02, percent-clipped=0.0 2024-06-20 04:19:44,331 INFO [train.py:1028] (0/2) Epoch 7, batch 10000, loss[loss=0.2671, simple_loss=0.3123, pruned_loss=0.111, over 12715.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3323, pruned_loss=0.1341, over 2485516.24 frames. ], batch size: 22, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:19:47,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=129622.16666666667, ans=0.125 2024-06-20 04:19:56,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=129640.5, ans=0.0 2024-06-20 04:20:05,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=129677.16666666667, ans=0.125 2024-06-20 04:20:14,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=129695.5, ans=0.125 2024-06-20 04:20:16,598 INFO [train.py:1028] (0/2) Epoch 7, batch 10050, loss[loss=0.3074, simple_loss=0.345, pruned_loss=0.1349, over 12539.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3329, pruned_loss=0.1356, over 2444939.63 frames. ], batch size: 22, lr: 8.07e-03, grad_scale: 32.0 2024-06-20 04:20:24,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129732.16666666667, ans=0.125 2024-06-20 04:20:25,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=129732.16666666667, ans=0.125 2024-06-20 04:20:37,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=129768.83333333333, ans=0.95 2024-06-20 04:20:37,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2024-06-20 04:20:41,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=129787.16666666667, ans=0.07 2024-06-20 04:20:42,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=27.34 vs. limit=22.5 2024-06-20 04:20:43,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=129787.16666666667, ans=0.125 2024-06-20 04:20:47,040 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.086e+02 2.257e+02 2.510e+02 3.489e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-20 04:20:47,070 INFO [train.py:1028] (0/2) Epoch 7, batch 10100, loss[loss=0.2785, simple_loss=0.3154, pruned_loss=0.1208, over 11359.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3321, pruned_loss=0.1347, over 2425013.76 frames. ], batch size: 17, lr: 8.06e-03, grad_scale: 32.0 2024-06-20 04:20:48,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=129805.5, ans=0.0 2024-06-20 04:21:00,513 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-7.pt 2024-06-20 04:23:04,767 INFO [train.py:1028] (0/2) Epoch 8, batch 0, loss[loss=0.2596, simple_loss=0.3023, pruned_loss=0.1085, over 12973.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3023, pruned_loss=0.1085, over 12973.00 frames. ], batch size: 36, lr: 7.60e-03, grad_scale: 32.0 2024-06-20 04:23:04,769 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 04:23:11,863 INFO [train.py:1060] (0/2) Epoch 8, validation: loss=0.2135, simple_loss=0.2755, pruned_loss=0.07574, over 351949.00 frames. 2024-06-20 04:23:11,863 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16965MB 2024-06-20 04:23:22,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=129855.0, ans=0.125 2024-06-20 04:23:42,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=129910.0, ans=0.0 2024-06-20 04:23:43,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=129910.0, ans=0.2 2024-06-20 04:23:45,375 INFO [train.py:1028] (0/2) Epoch 8, batch 50, loss[loss=0.2882, simple_loss=0.3314, pruned_loss=0.1225, over 12674.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3127, pruned_loss=0.1246, over 573730.62 frames. ], batch size: 29, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:23:48,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=129928.33333333333, ans=0.125 2024-06-20 04:23:49,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 04:24:04,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129983.33333333333, ans=0.1 2024-06-20 04:24:06,424 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 1.946e+02 2.137e+02 2.339e+02 3.831e+02, threshold=4.274e+02, percent-clipped=0.0 2024-06-20 04:24:17,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2024-06-20 04:24:23,255 INFO [train.py:1028] (0/2) Epoch 8, batch 100, loss[loss=0.2583, simple_loss=0.2961, pruned_loss=0.1102, over 13346.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3102, pruned_loss=0.1233, over 1017349.37 frames. ], batch size: 46, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:24:25,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=130020.0, ans=0.0 2024-06-20 04:24:35,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=130056.66666666667, ans=0.125 2024-06-20 04:24:45,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130075.0, ans=0.1 2024-06-20 04:24:46,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130075.0, ans=0.125 2024-06-20 04:24:47,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130075.0, ans=0.1 2024-06-20 04:24:49,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2024-06-20 04:24:50,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=130093.33333333333, ans=0.0 2024-06-20 04:24:55,358 INFO [train.py:1028] (0/2) Epoch 8, batch 150, loss[loss=0.2809, simple_loss=0.3173, pruned_loss=0.1222, over 12769.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3088, pruned_loss=0.1211, over 1364719.90 frames. ], batch size: 29, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:25:13,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=130148.33333333333, ans=0.1 2024-06-20 04:25:16,732 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.845e+02 2.009e+02 2.176e+02 2.733e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 04:25:20,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=130166.66666666667, ans=0.125 2024-06-20 04:25:21,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.74 vs. limit=22.5 2024-06-20 04:25:22,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2024-06-20 04:25:27,791 INFO [train.py:1028] (0/2) Epoch 8, batch 200, loss[loss=0.3006, simple_loss=0.3201, pruned_loss=0.1406, over 12668.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3094, pruned_loss=0.1214, over 1633271.09 frames. ], batch size: 202, lr: 7.59e-03, grad_scale: 32.0 2024-06-20 04:25:30,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=130203.33333333333, ans=0.2 2024-06-20 04:25:33,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130221.66666666667, ans=0.1 2024-06-20 04:25:35,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=130221.66666666667, ans=0.125 2024-06-20 04:25:39,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-20 04:25:41,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-20 04:25:43,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=130240.0, ans=0.125 2024-06-20 04:25:46,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130258.33333333333, ans=0.1 2024-06-20 04:25:59,832 INFO [train.py:1028] (0/2) Epoch 8, batch 250, loss[loss=0.248, simple_loss=0.2769, pruned_loss=0.1096, over 13070.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3085, pruned_loss=0.1207, over 1844636.20 frames. ], batch size: 144, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:26:01,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=130295.0, ans=0.0 2024-06-20 04:26:08,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.02 vs. limit=22.5 2024-06-20 04:26:26,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=130350.0, ans=0.125 2024-06-20 04:26:27,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.953e+02 2.197e+02 2.441e+02 3.300e+02, threshold=4.394e+02, percent-clipped=0.0 2024-06-20 04:26:38,019 INFO [train.py:1028] (0/2) Epoch 8, batch 300, loss[loss=0.2824, simple_loss=0.3064, pruned_loss=0.1292, over 13147.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3089, pruned_loss=0.1206, over 2008514.38 frames. ], batch size: 112, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:26:41,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.41 vs. limit=10.0 2024-06-20 04:26:43,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=130386.66666666667, ans=0.025 2024-06-20 04:26:47,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=130405.0, ans=0.125 2024-06-20 04:26:52,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=130423.33333333333, ans=0.025 2024-06-20 04:26:53,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=130423.33333333333, ans=0.0 2024-06-20 04:26:54,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=130423.33333333333, ans=0.1 2024-06-20 04:27:02,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=130441.66666666667, ans=0.125 2024-06-20 04:27:09,903 INFO [train.py:1028] (0/2) Epoch 8, batch 350, loss[loss=0.2938, simple_loss=0.3339, pruned_loss=0.1268, over 12847.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3084, pruned_loss=0.1201, over 2138186.96 frames. ], batch size: 33, lr: 7.58e-03, grad_scale: 32.0 2024-06-20 04:27:21,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=130496.66666666667, ans=0.0 2024-06-20 04:27:25,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=130515.0, ans=0.0 2024-06-20 04:27:31,389 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.931e+02 2.079e+02 2.331e+02 3.600e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 04:27:34,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=130533.33333333333, ans=0.2 2024-06-20 04:27:38,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.30 vs. limit=22.5 2024-06-20 04:27:42,155 INFO [train.py:1028] (0/2) Epoch 8, batch 400, loss[loss=0.2713, simple_loss=0.3101, pruned_loss=0.1162, over 13318.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.308, pruned_loss=0.1194, over 2239700.25 frames. ], batch size: 63, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:27:42,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=130570.0, ans=0.125 2024-06-20 04:27:42,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=130570.0, ans=0.0 2024-06-20 04:27:44,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=130570.0, ans=0.1 2024-06-20 04:27:53,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=130588.33333333333, ans=0.125 2024-06-20 04:28:10,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=130643.33333333333, ans=0.125 2024-06-20 04:28:17,528 INFO [train.py:1028] (0/2) Epoch 8, batch 450, loss[loss=0.2546, simple_loss=0.2923, pruned_loss=0.1085, over 13251.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3079, pruned_loss=0.1195, over 2313492.07 frames. ], batch size: 67, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:28:17,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=130661.66666666667, ans=0.0 2024-06-20 04:28:18,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=130661.66666666667, ans=0.125 2024-06-20 04:28:18,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=130661.66666666667, ans=0.1 2024-06-20 04:28:29,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=130680.0, ans=0.2 2024-06-20 04:28:37,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=130698.33333333333, ans=0.025 2024-06-20 04:28:41,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130716.66666666667, ans=0.125 2024-06-20 04:28:41,945 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.030e+02 2.195e+02 2.445e+02 3.763e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-20 04:28:52,975 INFO [train.py:1028] (0/2) Epoch 8, batch 500, loss[loss=0.2825, simple_loss=0.3166, pruned_loss=0.1242, over 13115.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3081, pruned_loss=0.1195, over 2376096.11 frames. ], batch size: 121, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:28:56,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=130753.33333333333, ans=0.09899494936611666 2024-06-20 04:28:56,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=130753.33333333333, ans=0.0 2024-06-20 04:29:08,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130790.0, ans=0.125 2024-06-20 04:29:19,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.94 vs. limit=22.5 2024-06-20 04:29:22,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=130826.66666666667, ans=0.125 2024-06-20 04:29:24,573 INFO [train.py:1028] (0/2) Epoch 8, batch 550, loss[loss=0.2693, simple_loss=0.3005, pruned_loss=0.119, over 12888.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3077, pruned_loss=0.1193, over 2420704.89 frames. ], batch size: 158, lr: 7.57e-03, grad_scale: 32.0 2024-06-20 04:29:28,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=130845.0, ans=0.0 2024-06-20 04:29:41,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130881.66666666667, ans=0.1 2024-06-20 04:29:45,675 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.935e+02 2.087e+02 2.296e+02 3.128e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 04:29:47,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=15.0 2024-06-20 04:29:55,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=15.0 2024-06-20 04:29:56,725 INFO [train.py:1028] (0/2) Epoch 8, batch 600, loss[loss=0.2639, simple_loss=0.2913, pruned_loss=0.1182, over 13040.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3064, pruned_loss=0.1186, over 2458843.30 frames. ], batch size: 144, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:30:23,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=130991.66666666667, ans=0.125 2024-06-20 04:30:30,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131010.0, ans=0.1 2024-06-20 04:30:33,023 INFO [train.py:1028] (0/2) Epoch 8, batch 650, loss[loss=0.2782, simple_loss=0.3247, pruned_loss=0.1159, over 13221.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.306, pruned_loss=0.1179, over 2489414.33 frames. ], batch size: 59, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:30:40,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.97 vs. limit=22.5 2024-06-20 04:30:41,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=131028.33333333333, ans=0.0 2024-06-20 04:30:46,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=131046.66666666667, ans=0.0 2024-06-20 04:30:57,407 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.852e+02 1.968e+02 2.136e+02 2.544e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 04:31:02,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131101.66666666666, ans=0.1 2024-06-20 04:31:03,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=131101.66666666666, ans=0.025 2024-06-20 04:31:08,106 INFO [train.py:1028] (0/2) Epoch 8, batch 700, loss[loss=0.2453, simple_loss=0.2922, pruned_loss=0.09916, over 13257.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3055, pruned_loss=0.1178, over 2512177.39 frames. ], batch size: 46, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:31:10,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=131120.0, ans=0.125 2024-06-20 04:31:34,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131193.33333333334, ans=0.1 2024-06-20 04:31:40,208 INFO [train.py:1028] (0/2) Epoch 8, batch 750, loss[loss=0.2702, simple_loss=0.312, pruned_loss=0.1142, over 13248.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3061, pruned_loss=0.1179, over 2527288.43 frames. ], batch size: 63, lr: 7.56e-03, grad_scale: 32.0 2024-06-20 04:31:55,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=13.20 vs. limit=15.0 2024-06-20 04:32:01,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.971e+02 2.173e+02 2.578e+02 4.718e+02, threshold=4.347e+02, percent-clipped=3.0 2024-06-20 04:32:03,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=131266.66666666666, ans=0.0 2024-06-20 04:32:12,325 INFO [train.py:1028] (0/2) Epoch 8, batch 800, loss[loss=0.2411, simple_loss=0.2889, pruned_loss=0.09662, over 12951.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3057, pruned_loss=0.1175, over 2539935.20 frames. ], batch size: 36, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:32:14,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=131303.33333333334, ans=0.125 2024-06-20 04:32:18,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=131321.66666666666, ans=0.125 2024-06-20 04:32:39,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.11 vs. limit=22.5 2024-06-20 04:32:41,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=131358.33333333334, ans=0.125 2024-06-20 04:32:43,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=131376.66666666666, ans=0.0 2024-06-20 04:32:47,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=26.16 vs. limit=22.5 2024-06-20 04:32:51,029 INFO [train.py:1028] (0/2) Epoch 8, batch 850, loss[loss=0.2703, simple_loss=0.3065, pruned_loss=0.1171, over 13123.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3045, pruned_loss=0.1166, over 2550885.66 frames. ], batch size: 95, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:32:51,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=131395.0, ans=0.125 2024-06-20 04:32:56,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=131413.33333333334, ans=10.0 2024-06-20 04:32:58,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=131413.33333333334, ans=0.2 2024-06-20 04:32:58,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=131413.33333333334, ans=0.125 2024-06-20 04:33:00,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=131413.33333333334, ans=0.0 2024-06-20 04:33:11,869 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.878e+02 2.052e+02 2.275e+02 3.212e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 04:33:18,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=12.0 2024-06-20 04:33:22,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.15 vs. limit=22.5 2024-06-20 04:33:23,107 INFO [train.py:1028] (0/2) Epoch 8, batch 900, loss[loss=0.2594, simple_loss=0.2992, pruned_loss=0.1098, over 12898.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3052, pruned_loss=0.1171, over 2555637.81 frames. ], batch size: 36, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:33:24,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=131486.66666666666, ans=0.025 2024-06-20 04:33:25,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=131486.66666666666, ans=0.0 2024-06-20 04:33:25,733 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:33:26,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=131486.66666666666, ans=0.1 2024-06-20 04:33:48,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2024-06-20 04:33:50,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=131560.0, ans=0.0 2024-06-20 04:33:55,908 INFO [train.py:1028] (0/2) Epoch 8, batch 950, loss[loss=0.2533, simple_loss=0.3029, pruned_loss=0.1019, over 12921.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3058, pruned_loss=0.1175, over 2558205.54 frames. ], batch size: 39, lr: 7.55e-03, grad_scale: 32.0 2024-06-20 04:33:59,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=131578.33333333334, ans=0.025 2024-06-20 04:34:16,855 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.900e+02 2.110e+02 2.325e+02 3.310e+02, threshold=4.221e+02, percent-clipped=0.0 2024-06-20 04:34:25,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=131651.66666666666, ans=0.2 2024-06-20 04:34:28,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=131651.66666666666, ans=0.05 2024-06-20 04:34:30,975 INFO [train.py:1028] (0/2) Epoch 8, batch 1000, loss[loss=0.2957, simple_loss=0.33, pruned_loss=0.1307, over 13292.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3058, pruned_loss=0.1181, over 2561099.67 frames. ], batch size: 49, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:34:33,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=131670.0, ans=0.125 2024-06-20 04:34:41,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=131688.33333333334, ans=0.025 2024-06-20 04:34:57,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131725.0, ans=0.1 2024-06-20 04:35:00,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=131743.33333333334, ans=0.125 2024-06-20 04:35:06,773 INFO [train.py:1028] (0/2) Epoch 8, batch 1050, loss[loss=0.294, simple_loss=0.3303, pruned_loss=0.1288, over 13188.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3061, pruned_loss=0.118, over 2564977.68 frames. ], batch size: 77, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:35:12,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=131761.66666666666, ans=0.125 2024-06-20 04:35:13,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131780.0, ans=0.1 2024-06-20 04:35:14,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.84 vs. limit=6.0 2024-06-20 04:35:15,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=131780.0, ans=0.125 2024-06-20 04:35:27,624 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.910e+02 2.088e+02 2.447e+02 3.403e+02, threshold=4.176e+02, percent-clipped=0.0 2024-06-20 04:35:28,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.95 vs. limit=15.0 2024-06-20 04:35:30,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=15.0 2024-06-20 04:35:39,422 INFO [train.py:1028] (0/2) Epoch 8, batch 1100, loss[loss=0.2803, simple_loss=0.3108, pruned_loss=0.1249, over 13236.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3062, pruned_loss=0.118, over 2568853.10 frames. ], batch size: 52, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:35:42,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=131853.33333333334, ans=0.125 2024-06-20 04:35:43,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=131853.33333333334, ans=0.125 2024-06-20 04:35:47,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2024-06-20 04:35:58,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=131908.33333333334, ans=0.0 2024-06-20 04:36:09,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.42 vs. limit=22.5 2024-06-20 04:36:11,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=131945.0, ans=0.0 2024-06-20 04:36:11,958 INFO [train.py:1028] (0/2) Epoch 8, batch 1150, loss[loss=0.2801, simple_loss=0.3118, pruned_loss=0.1242, over 13248.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.307, pruned_loss=0.1187, over 2570709.02 frames. ], batch size: 52, lr: 7.54e-03, grad_scale: 32.0 2024-06-20 04:36:23,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=131963.33333333334, ans=0.0 2024-06-20 04:36:29,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=131981.66666666666, ans=0.0 2024-06-20 04:36:32,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.38 vs. limit=12.0 2024-06-20 04:36:33,244 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-72000.pt 2024-06-20 04:36:44,050 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.893e+02 2.037e+02 2.311e+02 3.157e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-20 04:36:45,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=132000.0, ans=0.125 2024-06-20 04:36:53,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=132018.33333333334, ans=0.125 2024-06-20 04:36:55,113 INFO [train.py:1028] (0/2) Epoch 8, batch 1200, loss[loss=0.287, simple_loss=0.32, pruned_loss=0.1271, over 13164.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3069, pruned_loss=0.1187, over 2573581.01 frames. ], batch size: 77, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:37:26,805 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=8.713e+01 2024-06-20 04:37:26,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=132128.33333333334, ans=0.0 2024-06-20 04:37:27,365 INFO [train.py:1028] (0/2) Epoch 8, batch 1250, loss[loss=0.2541, simple_loss=0.2926, pruned_loss=0.1078, over 13206.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3061, pruned_loss=0.1181, over 2582532.31 frames. ], batch size: 112, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:37:35,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=132146.66666666666, ans=0.125 2024-06-20 04:37:43,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=132165.0, ans=0.2 2024-06-20 04:37:49,085 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.880e+02 1.982e+02 2.228e+02 3.428e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 04:37:55,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=132201.66666666666, ans=0.2 2024-06-20 04:37:57,204 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=7.514e+01 2024-06-20 04:37:59,694 INFO [train.py:1028] (0/2) Epoch 8, batch 1300, loss[loss=0.3027, simple_loss=0.3259, pruned_loss=0.1398, over 12732.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3057, pruned_loss=0.1178, over 2582667.73 frames. ], batch size: 176, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:38:09,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=12.0 2024-06-20 04:38:22,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=132275.0, ans=0.125 2024-06-20 04:38:22,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=132275.0, ans=0.09899494936611666 2024-06-20 04:38:33,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=132293.33333333334, ans=0.07 2024-06-20 04:38:34,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=132293.33333333334, ans=0.125 2024-06-20 04:38:36,286 INFO [train.py:1028] (0/2) Epoch 8, batch 1350, loss[loss=0.2663, simple_loss=0.3116, pruned_loss=0.1105, over 13156.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3066, pruned_loss=0.1181, over 2585586.85 frames. ], batch size: 59, lr: 7.53e-03, grad_scale: 32.0 2024-06-20 04:38:37,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=132311.66666666666, ans=0.0 2024-06-20 04:38:43,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=132330.0, ans=0.125 2024-06-20 04:38:52,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=15.0 2024-06-20 04:38:56,108 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:39:01,737 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.826e+02 2.016e+02 2.228e+02 3.319e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 04:39:02,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132366.66666666666, ans=0.1 2024-06-20 04:39:07,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2024-06-20 04:39:09,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=132385.0, ans=0.025 2024-06-20 04:39:10,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=132385.0, ans=0.025 2024-06-20 04:39:13,075 INFO [train.py:1028] (0/2) Epoch 8, batch 1400, loss[loss=0.2638, simple_loss=0.3119, pruned_loss=0.1079, over 12547.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3064, pruned_loss=0.118, over 2587414.73 frames. ], batch size: 25, lr: 7.52e-03, grad_scale: 32.0 2024-06-20 04:39:14,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=132403.33333333334, ans=0.0 2024-06-20 04:39:17,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.04 vs. limit=15.0 2024-06-20 04:39:18,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=132421.66666666666, ans=0.09899494936611666 2024-06-20 04:39:19,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132421.66666666666, ans=0.1 2024-06-20 04:39:23,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132421.66666666666, ans=0.1 2024-06-20 04:39:27,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=132440.0, ans=0.5 2024-06-20 04:39:34,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.22 vs. limit=15.0 2024-06-20 04:39:38,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=132476.66666666666, ans=0.0 2024-06-20 04:39:41,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.78 vs. limit=22.5 2024-06-20 04:39:44,570 INFO [train.py:1028] (0/2) Epoch 8, batch 1450, loss[loss=0.2676, simple_loss=0.3004, pruned_loss=0.1174, over 13067.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3059, pruned_loss=0.1178, over 2586722.75 frames. ], batch size: 121, lr: 7.52e-03, grad_scale: 32.0 2024-06-20 04:40:00,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2024-06-20 04:40:05,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 1.890e+02 2.006e+02 2.301e+02 3.397e+02, threshold=4.012e+02, percent-clipped=0.0 2024-06-20 04:40:07,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132550.0, ans=0.1 2024-06-20 04:40:08,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=132550.0, ans=0.0 2024-06-20 04:40:13,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=12.0 2024-06-20 04:40:16,333 INFO [train.py:1028] (0/2) Epoch 8, batch 1500, loss[loss=0.3004, simple_loss=0.3355, pruned_loss=0.1327, over 13249.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3058, pruned_loss=0.118, over 2588595.98 frames. ], batch size: 83, lr: 7.52e-03, grad_scale: 64.0 2024-06-20 04:40:19,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=132586.66666666666, ans=0.0 2024-06-20 04:40:28,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=132605.0, ans=0.125 2024-06-20 04:40:32,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132623.33333333334, ans=0.0 2024-06-20 04:40:34,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=132623.33333333334, ans=0.125 2024-06-20 04:40:37,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=132623.33333333334, ans=0.1 2024-06-20 04:40:54,456 INFO [train.py:1028] (0/2) Epoch 8, batch 1550, loss[loss=0.2556, simple_loss=0.2941, pruned_loss=0.1085, over 13141.00 frames. ], tot_loss[loss=0.271, simple_loss=0.306, pruned_loss=0.118, over 2584397.63 frames. ], batch size: 103, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:40:57,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=132678.33333333334, ans=0.125 2024-06-20 04:41:01,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132696.66666666666, ans=0.125 2024-06-20 04:41:06,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=132696.66666666666, ans=0.0 2024-06-20 04:41:07,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132715.0, ans=0.1 2024-06-20 04:41:09,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=132715.0, ans=0.025 2024-06-20 04:41:11,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=132715.0, ans=0.125 2024-06-20 04:41:15,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=132733.33333333334, ans=0.2 2024-06-20 04:41:16,202 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.869e+02 2.006e+02 2.177e+02 2.963e+02, threshold=4.011e+02, percent-clipped=0.0 2024-06-20 04:41:25,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=132751.66666666666, ans=0.125 2024-06-20 04:41:26,729 INFO [train.py:1028] (0/2) Epoch 8, batch 1600, loss[loss=0.2866, simple_loss=0.322, pruned_loss=0.1256, over 13101.00 frames. ], tot_loss[loss=0.271, simple_loss=0.306, pruned_loss=0.118, over 2579806.96 frames. ], batch size: 77, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:41:34,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132788.33333333334, ans=0.125 2024-06-20 04:41:42,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=132806.66666666666, ans=15.0 2024-06-20 04:41:43,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.05 vs. limit=15.0 2024-06-20 04:41:51,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=132843.33333333334, ans=0.2 2024-06-20 04:41:52,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.42 vs. limit=15.0 2024-06-20 04:41:53,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=132843.33333333334, ans=0.05 2024-06-20 04:41:53,247 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.48 vs. limit=15.0 2024-06-20 04:41:58,173 INFO [train.py:1028] (0/2) Epoch 8, batch 1650, loss[loss=0.267, simple_loss=0.2986, pruned_loss=0.1177, over 13150.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3064, pruned_loss=0.1184, over 2576406.89 frames. ], batch size: 95, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:42:02,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=132861.66666666666, ans=0.1 2024-06-20 04:42:22,976 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 2.000e+02 2.347e+02 2.779e+02 4.375e+02, threshold=4.694e+02, percent-clipped=3.0 2024-06-20 04:42:34,092 INFO [train.py:1028] (0/2) Epoch 8, batch 1700, loss[loss=0.2572, simple_loss=0.2983, pruned_loss=0.1081, over 12859.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3064, pruned_loss=0.1178, over 2581498.47 frames. ], batch size: 26, lr: 7.51e-03, grad_scale: 64.0 2024-06-20 04:42:48,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=132971.66666666666, ans=0.125 2024-06-20 04:42:59,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=133008.33333333334, ans=0.0 2024-06-20 04:43:06,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133026.66666666666, ans=0.125 2024-06-20 04:43:08,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=133045.0, ans=0.125 2024-06-20 04:43:09,331 INFO [train.py:1028] (0/2) Epoch 8, batch 1750, loss[loss=0.2719, simple_loss=0.3116, pruned_loss=0.1161, over 12416.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3064, pruned_loss=0.1179, over 2582288.48 frames. ], batch size: 22, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:43:19,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=133063.33333333334, ans=0.0 2024-06-20 04:43:29,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.43 vs. limit=15.0 2024-06-20 04:43:30,761 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.879e+02 2.009e+02 2.239e+02 3.803e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 04:43:32,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133100.0, ans=0.1 2024-06-20 04:43:35,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=133118.33333333334, ans=0.125 2024-06-20 04:43:37,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=133118.33333333334, ans=0.2 2024-06-20 04:43:38,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=15.0 2024-06-20 04:43:41,292 INFO [train.py:1028] (0/2) Epoch 8, batch 1800, loss[loss=0.2315, simple_loss=0.2765, pruned_loss=0.09326, over 13199.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3066, pruned_loss=0.1182, over 2583498.18 frames. ], batch size: 67, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:43:45,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133136.66666666666, ans=0.1 2024-06-20 04:44:01,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=15.0 2024-06-20 04:44:13,798 INFO [train.py:1028] (0/2) Epoch 8, batch 1850, loss[loss=0.2649, simple_loss=0.2957, pruned_loss=0.117, over 13210.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3063, pruned_loss=0.1181, over 2584248.73 frames. ], batch size: 83, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:44:18,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=133228.33333333334, ans=0.07 2024-06-20 04:44:31,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=133265.0, ans=0.05 2024-06-20 04:44:36,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=133283.33333333334, ans=0.0 2024-06-20 04:44:38,414 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.898e+02 2.086e+02 2.250e+02 3.242e+02, threshold=4.173e+02, percent-clipped=0.0 2024-06-20 04:44:39,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=133283.33333333334, ans=0.5 2024-06-20 04:44:42,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-20 04:44:51,817 INFO [train.py:1028] (0/2) Epoch 8, batch 1900, loss[loss=0.2578, simple_loss=0.2881, pruned_loss=0.1138, over 13215.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3063, pruned_loss=0.1184, over 2586629.40 frames. ], batch size: 95, lr: 7.50e-03, grad_scale: 64.0 2024-06-20 04:44:52,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=133320.0, ans=0.125 2024-06-20 04:44:52,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=12.0 2024-06-20 04:45:03,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=133338.33333333334, ans=0.125 2024-06-20 04:45:09,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=133356.66666666666, ans=0.0 2024-06-20 04:45:13,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=133375.0, ans=0.2 2024-06-20 04:45:16,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133375.0, ans=0.125 2024-06-20 04:45:16,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=133375.0, ans=0.125 2024-06-20 04:45:22,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=133393.33333333334, ans=0.0 2024-06-20 04:45:24,147 INFO [train.py:1028] (0/2) Epoch 8, batch 1950, loss[loss=0.2748, simple_loss=0.3148, pruned_loss=0.1174, over 13248.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.306, pruned_loss=0.1183, over 2592401.60 frames. ], batch size: 52, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:45:29,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2024-06-20 04:45:32,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.14 vs. limit=15.0 2024-06-20 04:45:33,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133430.0, ans=0.125 2024-06-20 04:45:45,350 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.965e+02 2.146e+02 2.380e+02 3.098e+02, threshold=4.291e+02, percent-clipped=0.0 2024-06-20 04:45:56,515 INFO [train.py:1028] (0/2) Epoch 8, batch 2000, loss[loss=0.2834, simple_loss=0.3294, pruned_loss=0.1187, over 12640.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3065, pruned_loss=0.1186, over 2588559.75 frames. ], batch size: 22, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:45:59,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=133503.33333333334, ans=0.07 2024-06-20 04:46:06,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.27 vs. limit=6.0 2024-06-20 04:46:06,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=133521.66666666666, ans=0.07 2024-06-20 04:46:09,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.97 vs. limit=15.0 2024-06-20 04:46:16,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=133540.0, ans=10.0 2024-06-20 04:46:17,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=133540.0, ans=0.2 2024-06-20 04:46:24,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=133576.66666666666, ans=0.125 2024-06-20 04:46:26,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133576.66666666666, ans=0.1 2024-06-20 04:46:26,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=133576.66666666666, ans=0.1 2024-06-20 04:46:28,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=133576.66666666666, ans=0.0 2024-06-20 04:46:31,792 INFO [train.py:1028] (0/2) Epoch 8, batch 2050, loss[loss=0.2502, simple_loss=0.2977, pruned_loss=0.1014, over 12563.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3066, pruned_loss=0.1186, over 2584532.04 frames. ], batch size: 29, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:46:44,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=133613.33333333334, ans=0.025 2024-06-20 04:46:45,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133613.33333333334, ans=0.1 2024-06-20 04:46:47,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=133613.33333333334, ans=0.0 2024-06-20 04:46:54,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.09 vs. limit=10.0 2024-06-20 04:46:57,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.843e+02 1.980e+02 2.155e+02 2.736e+02, threshold=3.959e+02, percent-clipped=0.0 2024-06-20 04:47:06,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=133668.33333333334, ans=0.125 2024-06-20 04:47:09,145 INFO [train.py:1028] (0/2) Epoch 8, batch 2100, loss[loss=0.2364, simple_loss=0.2793, pruned_loss=0.09675, over 13235.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3063, pruned_loss=0.1181, over 2586620.45 frames. ], batch size: 59, lr: 7.49e-03, grad_scale: 64.0 2024-06-20 04:47:26,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=133723.33333333334, ans=0.2 2024-06-20 04:47:32,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=133741.66666666666, ans=0.0 2024-06-20 04:47:35,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=133760.0, ans=0.0 2024-06-20 04:47:41,092 INFO [train.py:1028] (0/2) Epoch 8, batch 2150, loss[loss=0.2623, simple_loss=0.3154, pruned_loss=0.1046, over 13347.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3064, pruned_loss=0.1178, over 2589204.98 frames. ], batch size: 52, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:47:47,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-20 04:48:02,683 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.825e+02 1.965e+02 2.203e+02 3.024e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 04:48:04,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=133833.33333333334, ans=0.025 2024-06-20 04:48:09,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=12.0 2024-06-20 04:48:11,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=133851.66666666666, ans=0.025 2024-06-20 04:48:12,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=133851.66666666666, ans=0.0 2024-06-20 04:48:13,764 INFO [train.py:1028] (0/2) Epoch 8, batch 2200, loss[loss=0.2968, simple_loss=0.3202, pruned_loss=0.1367, over 13164.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3062, pruned_loss=0.1176, over 2589425.27 frames. ], batch size: 83, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:48:20,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133888.33333333334, ans=0.1 2024-06-20 04:48:21,300 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-20 04:48:21,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=133888.33333333334, ans=0.125 2024-06-20 04:48:23,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-20 04:48:38,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=133925.0, ans=0.09899494936611666 2024-06-20 04:48:42,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133943.33333333334, ans=0.1 2024-06-20 04:48:46,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=133943.33333333334, ans=0.2 2024-06-20 04:48:47,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=133943.33333333334, ans=0.125 2024-06-20 04:48:48,861 INFO [train.py:1028] (0/2) Epoch 8, batch 2250, loss[loss=0.2834, simple_loss=0.3151, pruned_loss=0.1258, over 13219.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3061, pruned_loss=0.1175, over 2587860.40 frames. ], batch size: 63, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:48:57,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=133980.0, ans=0.0 2024-06-20 04:49:01,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=133980.0, ans=0.125 2024-06-20 04:49:13,133 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.937e+02 2.149e+02 2.614e+02 4.057e+02, threshold=4.298e+02, percent-clipped=1.0 2024-06-20 04:49:16,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=134016.66666666666, ans=0.025 2024-06-20 04:49:17,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.15 vs. limit=15.0 2024-06-20 04:49:17,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=134035.0, ans=0.5 2024-06-20 04:49:18,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=134035.0, ans=0.125 2024-06-20 04:49:24,212 INFO [train.py:1028] (0/2) Epoch 8, batch 2300, loss[loss=0.2542, simple_loss=0.2926, pruned_loss=0.1078, over 12999.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3061, pruned_loss=0.1177, over 2582292.08 frames. ], batch size: 33, lr: 7.48e-03, grad_scale: 64.0 2024-06-20 04:49:26,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134053.33333333334, ans=0.1 2024-06-20 04:49:56,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0 2024-06-20 04:49:57,043 INFO [train.py:1028] (0/2) Epoch 8, batch 2350, loss[loss=0.2708, simple_loss=0.302, pruned_loss=0.1198, over 13226.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3057, pruned_loss=0.1177, over 2585255.99 frames. ], batch size: 67, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:49:57,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=134145.0, ans=0.0 2024-06-20 04:50:00,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134145.0, ans=0.1 2024-06-20 04:50:11,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=134181.66666666666, ans=0.125 2024-06-20 04:50:23,213 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.881e+02 2.019e+02 2.216e+02 2.953e+02, threshold=4.038e+02, percent-clipped=0.0 2024-06-20 04:50:34,375 INFO [train.py:1028] (0/2) Epoch 8, batch 2400, loss[loss=0.2662, simple_loss=0.3069, pruned_loss=0.1128, over 13269.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3049, pruned_loss=0.1174, over 2588235.34 frames. ], batch size: 46, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:50:43,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=134255.0, ans=0.125 2024-06-20 04:51:01,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=134310.0, ans=0.0 2024-06-20 04:51:02,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=134310.0, ans=0.07 2024-06-20 04:51:04,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-20 04:51:08,841 INFO [train.py:1028] (0/2) Epoch 8, batch 2450, loss[loss=0.2732, simple_loss=0.3113, pruned_loss=0.1175, over 13277.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3041, pruned_loss=0.1175, over 2583934.14 frames. ], batch size: 63, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:51:12,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=134328.33333333334, ans=0.04949747468305833 2024-06-20 04:51:16,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=134346.66666666666, ans=0.125 2024-06-20 04:51:17,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=12.0 2024-06-20 04:51:18,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=134346.66666666666, ans=0.125 2024-06-20 04:51:30,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=134383.33333333334, ans=0.035 2024-06-20 04:51:30,652 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.874e+02 1.988e+02 2.256e+02 3.124e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 04:51:40,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=134420.0, ans=0.2 2024-06-20 04:51:41,470 INFO [train.py:1028] (0/2) Epoch 8, batch 2500, loss[loss=0.2723, simple_loss=0.3077, pruned_loss=0.1185, over 13177.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3034, pruned_loss=0.117, over 2587175.64 frames. ], batch size: 83, lr: 7.47e-03, grad_scale: 64.0 2024-06-20 04:52:12,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=134493.33333333334, ans=0.0 2024-06-20 04:52:14,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2024-06-20 04:52:15,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=134493.33333333334, ans=10.0 2024-06-20 04:52:15,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=134493.33333333334, ans=0.125 2024-06-20 04:52:17,176 INFO [train.py:1028] (0/2) Epoch 8, batch 2550, loss[loss=0.2549, simple_loss=0.2965, pruned_loss=0.1066, over 12472.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.302, pruned_loss=0.1161, over 2586277.66 frames. ], batch size: 22, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:52:40,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2024-06-20 04:52:41,912 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.867e+02 2.044e+02 2.234e+02 2.958e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 04:52:42,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-20 04:52:45,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.45 vs. limit=15.0 2024-06-20 04:52:48,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=134585.0, ans=0.125 2024-06-20 04:52:48,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=134585.0, ans=0.125 2024-06-20 04:52:49,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.64 vs. limit=22.5 2024-06-20 04:52:50,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=134585.0, ans=0.125 2024-06-20 04:52:53,207 INFO [train.py:1028] (0/2) Epoch 8, batch 2600, loss[loss=0.2594, simple_loss=0.2968, pruned_loss=0.111, over 13251.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3008, pruned_loss=0.1159, over 2585860.44 frames. ], batch size: 52, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:53:09,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=134640.0, ans=0.125 2024-06-20 04:53:09,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=134640.0, ans=0.0 2024-06-20 04:53:19,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-20 04:53:20,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=134676.66666666666, ans=0.125 2024-06-20 04:53:25,929 INFO [train.py:1028] (0/2) Epoch 8, batch 2650, loss[loss=0.2379, simple_loss=0.2667, pruned_loss=0.1046, over 13018.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.2992, pruned_loss=0.1152, over 2587239.47 frames. ], batch size: 144, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:53:28,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=134695.0, ans=0.125 2024-06-20 04:53:33,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=134713.33333333334, ans=0.125 2024-06-20 04:53:34,570 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 04:53:35,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=134713.33333333334, ans=0.125 2024-06-20 04:53:37,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.38 vs. limit=10.0 2024-06-20 04:53:37,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=134713.33333333334, ans=0.125 2024-06-20 04:53:37,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=134713.33333333334, ans=0.125 2024-06-20 04:53:45,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134750.0, ans=0.125 2024-06-20 04:53:45,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=134750.0, ans=0.0 2024-06-20 04:53:46,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2024-06-20 04:53:47,043 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.870e+02 1.982e+02 2.217e+02 3.323e+02, threshold=3.965e+02, percent-clipped=0.0 2024-06-20 04:53:56,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=134768.33333333334, ans=0.0 2024-06-20 04:53:58,360 INFO [train.py:1028] (0/2) Epoch 8, batch 2700, loss[loss=0.2546, simple_loss=0.2845, pruned_loss=0.1123, over 13245.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.2976, pruned_loss=0.1147, over 2586779.13 frames. ], batch size: 89, lr: 7.46e-03, grad_scale: 64.0 2024-06-20 04:54:00,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=134786.66666666666, ans=0.0 2024-06-20 04:54:00,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=134786.66666666666, ans=10.0 2024-06-20 04:54:05,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=2.59 vs. limit=15.0 2024-06-20 04:54:07,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=134805.0, ans=0.2 2024-06-20 04:54:11,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=134805.0, ans=0.0 2024-06-20 04:54:15,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=134823.33333333334, ans=0.025 2024-06-20 04:54:21,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=134841.66666666666, ans=0.2 2024-06-20 04:54:30,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2024-06-20 04:54:32,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.84 vs. limit=22.5 2024-06-20 04:54:35,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=134860.0, ans=0.125 2024-06-20 04:54:35,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=134860.0, ans=0.5 2024-06-20 04:54:36,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=134878.33333333334, ans=0.0 2024-06-20 04:54:37,050 INFO [train.py:1028] (0/2) Epoch 8, batch 2750, loss[loss=0.2751, simple_loss=0.3037, pruned_loss=0.1233, over 13223.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.2964, pruned_loss=0.1137, over 2583740.24 frames. ], batch size: 43, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:54:37,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=134878.33333333334, ans=0.05 2024-06-20 04:54:50,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=134915.0, ans=0.125 2024-06-20 04:54:57,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=134933.33333333334, ans=0.125 2024-06-20 04:54:58,715 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.772e+02 1.968e+02 2.085e+02 2.881e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 04:55:00,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=134933.33333333334, ans=0.0 2024-06-20 04:55:04,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=134951.66666666666, ans=0.0 2024-06-20 04:55:06,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=134951.66666666666, ans=0.0 2024-06-20 04:55:08,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.81 vs. limit=15.0 2024-06-20 04:55:09,800 INFO [train.py:1028] (0/2) Epoch 8, batch 2800, loss[loss=0.2668, simple_loss=0.2877, pruned_loss=0.1229, over 10845.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.2958, pruned_loss=0.1136, over 2581375.02 frames. ], batch size: 304, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:55:11,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=134970.0, ans=0.125 2024-06-20 04:55:13,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=134970.0, ans=0.2 2024-06-20 04:55:15,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=134988.33333333334, ans=0.2 2024-06-20 04:55:26,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135006.66666666666, ans=0.1 2024-06-20 04:55:27,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.79 vs. limit=15.0 2024-06-20 04:55:39,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=135043.33333333334, ans=0.0 2024-06-20 04:55:41,617 INFO [train.py:1028] (0/2) Epoch 8, batch 2850, loss[loss=0.2905, simple_loss=0.3225, pruned_loss=0.1292, over 13313.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.2953, pruned_loss=0.1135, over 2579499.18 frames. ], batch size: 49, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:55:49,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2024-06-20 04:55:49,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=135061.66666666666, ans=0.125 2024-06-20 04:55:57,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=135098.33333333334, ans=0.125 2024-06-20 04:56:05,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=135116.66666666666, ans=0.125 2024-06-20 04:56:05,526 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.884e+02 2.077e+02 2.389e+02 3.806e+02, threshold=4.155e+02, percent-clipped=0.0 2024-06-20 04:56:09,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2024-06-20 04:56:13,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=135135.0, ans=0.0 2024-06-20 04:56:19,160 INFO [train.py:1028] (0/2) Epoch 8, batch 2900, loss[loss=0.254, simple_loss=0.2939, pruned_loss=0.1071, over 13135.00 frames. ], tot_loss[loss=0.259, simple_loss=0.293, pruned_loss=0.1125, over 2586860.24 frames. ], batch size: 55, lr: 7.45e-03, grad_scale: 64.0 2024-06-20 04:56:40,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.57 vs. limit=22.5 2024-06-20 04:56:41,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=135208.33333333334, ans=0.125 2024-06-20 04:56:49,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.90 vs. limit=6.0 2024-06-20 04:56:51,975 INFO [train.py:1028] (0/2) Epoch 8, batch 2950, loss[loss=0.2513, simple_loss=0.2829, pruned_loss=0.1099, over 13293.00 frames. ], tot_loss[loss=0.259, simple_loss=0.2931, pruned_loss=0.1125, over 2579987.70 frames. ], batch size: 43, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:56:53,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=135245.0, ans=0.0 2024-06-20 04:56:56,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=135245.0, ans=0.125 2024-06-20 04:57:05,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=135281.66666666666, ans=0.0 2024-06-20 04:57:09,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=135281.66666666666, ans=0.125 2024-06-20 04:57:13,752 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.806e+02 1.932e+02 2.126e+02 3.138e+02, threshold=3.865e+02, percent-clipped=0.0 2024-06-20 04:57:14,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=135300.0, ans=0.2 2024-06-20 04:57:16,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=135300.0, ans=0.0 2024-06-20 04:57:24,562 INFO [train.py:1028] (0/2) Epoch 8, batch 3000, loss[loss=0.2643, simple_loss=0.2971, pruned_loss=0.1157, over 13216.00 frames. ], tot_loss[loss=0.257, simple_loss=0.2913, pruned_loss=0.1114, over 2578557.57 frames. ], batch size: 59, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:57:24,563 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 04:57:28,907 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.0040, 1.8453, 1.7654, 1.4944], device='cuda:0') 2024-06-20 04:57:32,459 INFO [train.py:1060] (0/2) Epoch 8, validation: loss=0.208, simple_loss=0.2702, pruned_loss=0.07292, over 351949.00 frames. 2024-06-20 04:57:32,459 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 16965MB 2024-06-20 04:57:33,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=135336.66666666666, ans=0.2 2024-06-20 04:57:37,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=135336.66666666666, ans=0.07 2024-06-20 04:57:45,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135355.0, ans=0.1 2024-06-20 04:57:54,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=135373.33333333334, ans=0.125 2024-06-20 04:58:05,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.70 vs. limit=15.0 2024-06-20 04:58:06,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=135410.0, ans=0.0 2024-06-20 04:58:08,940 INFO [train.py:1028] (0/2) Epoch 8, batch 3050, loss[loss=0.2294, simple_loss=0.2658, pruned_loss=0.09649, over 13309.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2905, pruned_loss=0.1114, over 2578894.76 frames. ], batch size: 46, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:58:11,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=135428.33333333334, ans=0.0 2024-06-20 04:58:16,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=135428.33333333334, ans=0.125 2024-06-20 04:58:17,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=135428.33333333334, ans=0.2 2024-06-20 04:58:33,722 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.797e+02 1.907e+02 2.114e+02 2.900e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 04:58:35,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=135483.33333333334, ans=0.2 2024-06-20 04:58:36,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=135483.33333333334, ans=0.125 2024-06-20 04:58:44,694 INFO [train.py:1028] (0/2) Epoch 8, batch 3100, loss[loss=0.2501, simple_loss=0.2806, pruned_loss=0.1098, over 13022.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.2892, pruned_loss=0.1105, over 2579154.07 frames. ], batch size: 144, lr: 7.44e-03, grad_scale: 64.0 2024-06-20 04:58:48,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=135520.0, ans=0.0 2024-06-20 04:58:49,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=135520.0, ans=0.0 2024-06-20 04:58:49,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=135520.0, ans=0.0 2024-06-20 04:58:52,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=135538.33333333334, ans=0.2 2024-06-20 04:59:17,061 INFO [train.py:1028] (0/2) Epoch 8, batch 3150, loss[loss=0.2841, simple_loss=0.304, pruned_loss=0.1321, over 12960.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.2886, pruned_loss=0.1103, over 2581197.72 frames. ], batch size: 158, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 04:59:38,541 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.909e+02 2.068e+02 2.341e+02 3.439e+02, threshold=4.137e+02, percent-clipped=0.0 2024-06-20 04:59:38,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=135666.66666666666, ans=0.125 2024-06-20 04:59:39,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=135666.66666666666, ans=0.0 2024-06-20 04:59:51,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135685.0, ans=0.1 2024-06-20 04:59:52,832 INFO [train.py:1028] (0/2) Epoch 8, batch 3200, loss[loss=0.2385, simple_loss=0.2802, pruned_loss=0.09843, over 13176.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.2872, pruned_loss=0.1096, over 2581448.87 frames. ], batch size: 55, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 05:00:04,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135721.66666666666, ans=0.1 2024-06-20 05:00:04,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 05:00:15,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.02 vs. limit=10.0 2024-06-20 05:00:23,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=135776.66666666666, ans=0.125 2024-06-20 05:00:29,059 INFO [train.py:1028] (0/2) Epoch 8, batch 3250, loss[loss=0.2531, simple_loss=0.2932, pruned_loss=0.1065, over 13096.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.2862, pruned_loss=0.1093, over 2585490.02 frames. ], batch size: 71, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 05:00:32,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135795.0, ans=0.125 2024-06-20 05:00:36,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=12.0 2024-06-20 05:00:42,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=135813.33333333334, ans=0.07 2024-06-20 05:00:42,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=135813.33333333334, ans=0.0 2024-06-20 05:00:47,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=135831.66666666666, ans=0.04949747468305833 2024-06-20 05:00:52,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=135850.0, ans=0.125 2024-06-20 05:00:52,678 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.820e+02 2.058e+02 2.250e+02 4.292e+02, threshold=4.116e+02, percent-clipped=1.0 2024-06-20 05:00:52,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=135850.0, ans=0.125 2024-06-20 05:00:58,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=135868.33333333334, ans=0.05 2024-06-20 05:01:02,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=135868.33333333334, ans=0.09899494936611666 2024-06-20 05:01:03,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2024-06-20 05:01:04,571 INFO [train.py:1028] (0/2) Epoch 8, batch 3300, loss[loss=0.2546, simple_loss=0.2879, pruned_loss=0.1107, over 12764.00 frames. ], tot_loss[loss=0.252, simple_loss=0.2859, pruned_loss=0.1091, over 2584172.83 frames. ], batch size: 176, lr: 7.43e-03, grad_scale: 64.0 2024-06-20 05:01:16,109 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2024-06-20 05:01:20,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=135923.33333333334, ans=10.0 2024-06-20 05:01:26,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=1.98 vs. limit=15.0 2024-06-20 05:01:27,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=135941.66666666666, ans=0.0 2024-06-20 05:01:30,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=135960.0, ans=0.125 2024-06-20 05:01:31,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.28 vs. limit=22.5 2024-06-20 05:01:31,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=135960.0, ans=0.2 2024-06-20 05:01:40,414 INFO [train.py:1028] (0/2) Epoch 8, batch 3350, loss[loss=0.2485, simple_loss=0.2768, pruned_loss=0.1101, over 12967.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.2853, pruned_loss=0.109, over 2578451.06 frames. ], batch size: 158, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:01:50,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=135996.66666666666, ans=0.125 2024-06-20 05:01:54,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=136015.0, ans=0.125 2024-06-20 05:01:54,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=136015.0, ans=0.2 2024-06-20 05:02:01,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=136033.33333333334, ans=0.125 2024-06-20 05:02:02,193 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.848e+02 1.987e+02 2.217e+02 3.289e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-20 05:02:15,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136070.0, ans=0.1 2024-06-20 05:02:15,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=136070.0, ans=0.125 2024-06-20 05:02:16,192 INFO [train.py:1028] (0/2) Epoch 8, batch 3400, loss[loss=0.3077, simple_loss=0.3362, pruned_loss=0.1396, over 12345.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.2853, pruned_loss=0.1095, over 2576581.19 frames. ], batch size: 22, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:02:18,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=136070.0, ans=0.125 2024-06-20 05:02:20,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=136070.0, ans=0.025 2024-06-20 05:02:22,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=136088.33333333334, ans=0.125 2024-06-20 05:02:23,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=136088.33333333334, ans=0.025 2024-06-20 05:02:30,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=136106.66666666666, ans=0.5 2024-06-20 05:02:36,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136125.0, ans=0.0 2024-06-20 05:02:37,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.35 vs. limit=15.0 2024-06-20 05:02:47,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=136143.33333333334, ans=0.0 2024-06-20 05:02:49,118 INFO [train.py:1028] (0/2) Epoch 8, batch 3450, loss[loss=0.2722, simple_loss=0.2954, pruned_loss=0.1245, over 12745.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.2845, pruned_loss=0.1091, over 2576833.12 frames. ], batch size: 176, lr: 7.42e-03, grad_scale: 64.0 2024-06-20 05:02:51,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136161.66666666666, ans=0.0 2024-06-20 05:02:52,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=136161.66666666666, ans=0.5 2024-06-20 05:03:07,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136198.33333333334, ans=0.1 2024-06-20 05:03:10,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.781e+02 1.932e+02 2.153e+02 2.844e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 05:03:22,687 INFO [train.py:1028] (0/2) Epoch 8, batch 3500, loss[loss=0.2403, simple_loss=0.2808, pruned_loss=0.09993, over 12833.00 frames. ], tot_loss[loss=0.25, simple_loss=0.2835, pruned_loss=0.1083, over 2576629.98 frames. ], batch size: 33, lr: 7.42e-03, grad_scale: 128.0 2024-06-20 05:03:24,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=136253.33333333334, ans=0.125 2024-06-20 05:03:28,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.22 vs. limit=22.5 2024-06-20 05:03:33,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=136271.66666666666, ans=15.0 2024-06-20 05:03:41,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2024-06-20 05:03:43,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=136290.0, ans=0.0 2024-06-20 05:03:46,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.83 vs. limit=22.5 2024-06-20 05:03:48,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2024-06-20 05:03:59,374 INFO [train.py:1028] (0/2) Epoch 8, batch 3550, loss[loss=0.2281, simple_loss=0.2597, pruned_loss=0.09827, over 13152.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2832, pruned_loss=0.1077, over 2577013.94 frames. ], batch size: 95, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:04:00,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=136345.0, ans=0.125 2024-06-20 05:04:05,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.46 vs. limit=15.0 2024-06-20 05:04:11,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=136363.33333333334, ans=0.125 2024-06-20 05:04:16,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=15.0 2024-06-20 05:04:17,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=12.0 2024-06-20 05:04:18,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=136381.66666666666, ans=0.125 2024-06-20 05:04:23,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=136400.0, ans=0.0 2024-06-20 05:04:24,122 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.786e+02 1.910e+02 2.334e+02 3.313e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 05:04:34,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.84 vs. limit=15.0 2024-06-20 05:04:35,467 INFO [train.py:1028] (0/2) Epoch 8, batch 3600, loss[loss=0.2366, simple_loss=0.2735, pruned_loss=0.0998, over 12986.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2828, pruned_loss=0.1076, over 2580569.44 frames. ], batch size: 48, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:04:47,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136455.0, ans=0.125 2024-06-20 05:04:53,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=136473.33333333334, ans=0.2 2024-06-20 05:05:06,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.81 vs. limit=22.5 2024-06-20 05:05:07,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=136510.0, ans=0.125 2024-06-20 05:05:08,159 INFO [train.py:1028] (0/2) Epoch 8, batch 3650, loss[loss=0.2321, simple_loss=0.2581, pruned_loss=0.103, over 13054.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.2833, pruned_loss=0.1078, over 2578726.97 frames. ], batch size: 102, lr: 7.41e-03, grad_scale: 128.0 2024-06-20 05:05:09,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.90 vs. limit=15.0 2024-06-20 05:05:09,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=136528.33333333334, ans=0.125 2024-06-20 05:05:10,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=136528.33333333334, ans=0.125 2024-06-20 05:05:25,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=136565.0, ans=0.125 2024-06-20 05:05:27,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=136583.33333333334, ans=0.09899494936611666 2024-06-20 05:05:30,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.720e+02 1.874e+02 2.055e+02 2.444e+02, threshold=3.749e+02, percent-clipped=0.0 2024-06-20 05:05:34,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.71 vs. limit=15.0 2024-06-20 05:05:44,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=136620.0, ans=0.125 2024-06-20 05:05:45,028 INFO [train.py:1028] (0/2) Epoch 8, batch 3700, loss[loss=0.2422, simple_loss=0.287, pruned_loss=0.09866, over 13279.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.2823, pruned_loss=0.1069, over 2585144.83 frames. ], batch size: 72, lr: 7.41e-03, grad_scale: 64.0 2024-06-20 05:05:46,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.91 vs. limit=22.5 2024-06-20 05:05:48,457 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:05:54,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=136638.33333333334, ans=0.125 2024-06-20 05:06:07,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=136675.0, ans=0.025 2024-06-20 05:06:11,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=136675.0, ans=0.1 2024-06-20 05:06:21,503 INFO [train.py:1028] (0/2) Epoch 8, batch 3750, loss[loss=0.2702, simple_loss=0.3052, pruned_loss=0.1176, over 12585.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2816, pruned_loss=0.1064, over 2587156.60 frames. ], batch size: 22, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:06:22,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136711.66666666666, ans=0.1 2024-06-20 05:06:27,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-20 05:06:30,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=136730.0, ans=0.2 2024-06-20 05:06:31,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136730.0, ans=0.1 2024-06-20 05:06:31,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=136730.0, ans=0.0 2024-06-20 05:06:32,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=136730.0, ans=0.125 2024-06-20 05:06:36,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=136748.33333333334, ans=0.2 2024-06-20 05:06:42,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136766.66666666666, ans=0.125 2024-06-20 05:06:43,741 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.818e+02 1.938e+02 2.146e+02 2.930e+02, threshold=3.875e+02, percent-clipped=0.0 2024-06-20 05:06:44,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-20 05:06:54,359 INFO [train.py:1028] (0/2) Epoch 8, batch 3800, loss[loss=0.2208, simple_loss=0.2618, pruned_loss=0.08992, over 13215.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2812, pruned_loss=0.1063, over 2584794.77 frames. ], batch size: 83, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:07:02,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136821.66666666666, ans=0.1 2024-06-20 05:07:04,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=136821.66666666666, ans=0.125 2024-06-20 05:07:14,361 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:07:18,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.32 vs. limit=22.5 2024-06-20 05:07:20,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=136876.66666666666, ans=0.125 2024-06-20 05:07:22,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136876.66666666666, ans=0.1 2024-06-20 05:07:26,626 INFO [train.py:1028] (0/2) Epoch 8, batch 3850, loss[loss=0.2389, simple_loss=0.2674, pruned_loss=0.1052, over 13021.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2799, pruned_loss=0.1054, over 2584628.55 frames. ], batch size: 144, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:07:44,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=136931.66666666666, ans=0.125 2024-06-20 05:07:51,550 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.737e+02 1.900e+02 2.204e+02 3.164e+02, threshold=3.801e+02, percent-clipped=0.0 2024-06-20 05:07:58,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=136968.33333333334, ans=0.0 2024-06-20 05:07:59,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136968.33333333334, ans=0.125 2024-06-20 05:08:01,767 INFO [train.py:1028] (0/2) Epoch 8, batch 3900, loss[loss=0.2537, simple_loss=0.2813, pruned_loss=0.113, over 13268.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.2799, pruned_loss=0.1057, over 2587329.89 frames. ], batch size: 83, lr: 7.40e-03, grad_scale: 64.0 2024-06-20 05:08:37,734 INFO [train.py:1028] (0/2) Epoch 8, batch 3950, loss[loss=0.2384, simple_loss=0.2664, pruned_loss=0.1052, over 13087.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2789, pruned_loss=0.1048, over 2588210.27 frames. ], batch size: 132, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:08:38,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.75 vs. limit=22.5 2024-06-20 05:08:40,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=137078.33333333334, ans=0.125 2024-06-20 05:08:43,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=137096.66666666666, ans=0.0 2024-06-20 05:08:49,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=137096.66666666666, ans=0.025 2024-06-20 05:09:00,223 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.776e+02 1.920e+02 2.146e+02 2.812e+02, threshold=3.840e+02, percent-clipped=0.0 2024-06-20 05:09:01,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=137133.33333333334, ans=0.0 2024-06-20 05:09:08,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=137151.66666666666, ans=0.0 2024-06-20 05:09:10,419 INFO [train.py:1028] (0/2) Epoch 8, batch 4000, loss[loss=0.2354, simple_loss=0.2712, pruned_loss=0.09979, over 12907.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.2785, pruned_loss=0.1049, over 2582962.47 frames. ], batch size: 39, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:09:15,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=137170.0, ans=0.0 2024-06-20 05:09:24,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=137206.66666666666, ans=0.125 2024-06-20 05:09:32,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=137225.0, ans=0.125 2024-06-20 05:09:33,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=137225.0, ans=0.125 2024-06-20 05:09:41,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=15.0 2024-06-20 05:09:47,230 INFO [train.py:1028] (0/2) Epoch 8, batch 4050, loss[loss=0.2576, simple_loss=0.2788, pruned_loss=0.1181, over 10957.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2785, pruned_loss=0.1051, over 2580173.26 frames. ], batch size: 305, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:09:54,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2024-06-20 05:10:09,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.844e+02 2.040e+02 2.255e+02 2.893e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-20 05:10:11,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=137316.66666666666, ans=0.0 2024-06-20 05:10:20,080 INFO [train.py:1028] (0/2) Epoch 8, batch 4100, loss[loss=0.2627, simple_loss=0.2899, pruned_loss=0.1177, over 13063.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.2792, pruned_loss=0.1059, over 2577958.76 frames. ], batch size: 102, lr: 7.39e-03, grad_scale: 64.0 2024-06-20 05:10:20,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=137353.33333333334, ans=0.2 2024-06-20 05:10:35,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=137390.0, ans=0.125 2024-06-20 05:10:37,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=137390.0, ans=0.2 2024-06-20 05:10:40,794 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.56 vs. limit=22.5 2024-06-20 05:10:53,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2024-06-20 05:10:56,817 INFO [train.py:1028] (0/2) Epoch 8, batch 4150, loss[loss=0.243, simple_loss=0.2813, pruned_loss=0.1023, over 13193.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2783, pruned_loss=0.1053, over 2576230.80 frames. ], batch size: 55, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:10:57,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=137445.0, ans=0.0 2024-06-20 05:11:11,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2024-06-20 05:11:17,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=137500.0, ans=0.2 2024-06-20 05:11:19,358 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.819e+02 1.960e+02 2.131e+02 3.025e+02, threshold=3.921e+02, percent-clipped=0.0 2024-06-20 05:11:21,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=137500.0, ans=0.0 2024-06-20 05:11:29,927 INFO [train.py:1028] (0/2) Epoch 8, batch 4200, loss[loss=0.2418, simple_loss=0.2712, pruned_loss=0.1062, over 13108.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2781, pruned_loss=0.1052, over 2579387.55 frames. ], batch size: 103, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:11:32,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=137536.66666666666, ans=0.0 2024-06-20 05:11:32,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=137536.66666666666, ans=0.09899494936611666 2024-06-20 05:11:36,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=18.11 vs. limit=15.0 2024-06-20 05:11:38,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=137555.0, ans=0.125 2024-06-20 05:11:52,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=137591.66666666666, ans=0.07 2024-06-20 05:11:52,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=137591.66666666666, ans=0.2 2024-06-20 05:12:06,719 INFO [train.py:1028] (0/2) Epoch 8, batch 4250, loss[loss=0.2437, simple_loss=0.2882, pruned_loss=0.09967, over 13324.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2776, pruned_loss=0.1048, over 2581810.03 frames. ], batch size: 46, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:12:22,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2024-06-20 05:12:25,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=137683.33333333334, ans=0.2 2024-06-20 05:12:28,897 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.736e+02 1.867e+02 2.078e+02 2.917e+02, threshold=3.733e+02, percent-clipped=0.0 2024-06-20 05:12:32,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.51 vs. limit=22.5 2024-06-20 05:12:38,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137701.66666666666, ans=0.1 2024-06-20 05:12:42,929 INFO [train.py:1028] (0/2) Epoch 8, batch 4300, loss[loss=0.2402, simple_loss=0.2773, pruned_loss=0.1015, over 13212.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2771, pruned_loss=0.1046, over 2581550.60 frames. ], batch size: 59, lr: 7.38e-03, grad_scale: 64.0 2024-06-20 05:12:54,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=137738.33333333334, ans=0.125 2024-06-20 05:12:55,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-20 05:12:59,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.15 vs. limit=15.0 2024-06-20 05:13:14,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=137793.33333333334, ans=0.0 2024-06-20 05:13:15,291 INFO [train.py:1028] (0/2) Epoch 8, batch 4350, loss[loss=0.2391, simple_loss=0.2764, pruned_loss=0.1009, over 13210.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.2765, pruned_loss=0.1044, over 2586240.87 frames. ], batch size: 59, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:13:17,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.26 vs. limit=15.0 2024-06-20 05:13:21,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=12.0 2024-06-20 05:13:27,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=137848.33333333334, ans=0.0 2024-06-20 05:13:32,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=137848.33333333334, ans=0.125 2024-06-20 05:13:33,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=137848.33333333334, ans=0.2 2024-06-20 05:13:37,321 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.788e+02 1.998e+02 2.264e+02 3.018e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 05:13:50,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=137903.33333333334, ans=0.125 2024-06-20 05:13:50,881 INFO [train.py:1028] (0/2) Epoch 8, batch 4400, loss[loss=0.2337, simple_loss=0.2706, pruned_loss=0.09836, over 13256.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.276, pruned_loss=0.1042, over 2586586.95 frames. ], batch size: 83, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:13:52,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=137903.33333333334, ans=0.125 2024-06-20 05:13:56,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2024-06-20 05:13:56,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.86 vs. limit=15.0 2024-06-20 05:13:59,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2024-06-20 05:14:00,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.17 vs. limit=15.0 2024-06-20 05:14:10,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=137958.33333333334, ans=0.0 2024-06-20 05:14:13,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.18 vs. limit=6.0 2024-06-20 05:14:17,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=137976.66666666666, ans=0.05 2024-06-20 05:14:19,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.78 vs. limit=15.0 2024-06-20 05:14:20,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=137976.66666666666, ans=0.2 2024-06-20 05:14:24,593 INFO [train.py:1028] (0/2) Epoch 8, batch 4450, loss[loss=0.2479, simple_loss=0.285, pruned_loss=0.1054, over 12864.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.2764, pruned_loss=0.1045, over 2581307.73 frames. ], batch size: 33, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:14:36,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=138013.33333333334, ans=0.025 2024-06-20 05:14:49,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=138050.0, ans=0.025 2024-06-20 05:14:49,459 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.796e+02 2.028e+02 2.357e+02 4.135e+02, threshold=4.056e+02, percent-clipped=2.0 2024-06-20 05:14:50,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=138050.0, ans=0.015 2024-06-20 05:14:51,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=138050.0, ans=0.0 2024-06-20 05:14:52,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=138068.33333333334, ans=0.025 2024-06-20 05:14:56,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=138068.33333333334, ans=0.125 2024-06-20 05:14:59,664 INFO [train.py:1028] (0/2) Epoch 8, batch 4500, loss[loss=0.2292, simple_loss=0.2638, pruned_loss=0.09727, over 13224.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2758, pruned_loss=0.1043, over 2585199.63 frames. ], batch size: 89, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:15:03,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=138086.66666666666, ans=0.125 2024-06-20 05:15:07,243 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=15.0 2024-06-20 05:15:07,717 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2024-06-20 05:15:17,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=138123.33333333334, ans=0.1 2024-06-20 05:15:24,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-20 05:15:25,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.98 vs. limit=15.0 2024-06-20 05:15:27,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=138160.0, ans=0.0 2024-06-20 05:15:32,262 INFO [train.py:1028] (0/2) Epoch 8, batch 4550, loss[loss=0.2399, simple_loss=0.2796, pruned_loss=0.1001, over 13293.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.2763, pruned_loss=0.1044, over 2588596.18 frames. ], batch size: 52, lr: 7.37e-03, grad_scale: 64.0 2024-06-20 05:15:32,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=138178.33333333334, ans=0.125 2024-06-20 05:15:55,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.38 vs. limit=15.0 2024-06-20 05:15:59,807 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.708e+02 1.866e+02 2.088e+02 3.348e+02, threshold=3.731e+02, percent-clipped=0.0 2024-06-20 05:16:03,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2024-06-20 05:16:10,328 INFO [train.py:1028] (0/2) Epoch 8, batch 4600, loss[loss=0.2529, simple_loss=0.2771, pruned_loss=0.1144, over 12515.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2772, pruned_loss=0.1047, over 2583705.03 frames. ], batch size: 202, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:16:13,132 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.857e+02 2024-06-20 05:16:14,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138270.0, ans=0.1 2024-06-20 05:16:14,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-20 05:16:21,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138288.33333333334, ans=0.1 2024-06-20 05:16:25,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=138306.66666666666, ans=0.0 2024-06-20 05:16:27,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=138306.66666666666, ans=0.0 2024-06-20 05:16:35,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=12.0 2024-06-20 05:16:47,226 INFO [train.py:1028] (0/2) Epoch 8, batch 4650, loss[loss=0.2533, simple_loss=0.2827, pruned_loss=0.1119, over 13094.00 frames. ], tot_loss[loss=0.243, simple_loss=0.2769, pruned_loss=0.1045, over 2586397.50 frames. ], batch size: 132, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:16:52,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=138361.66666666666, ans=0.125 2024-06-20 05:16:53,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=138380.0, ans=0.125 2024-06-20 05:16:54,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.21 vs. limit=10.0 2024-06-20 05:17:03,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=138398.33333333334, ans=0.125 2024-06-20 05:17:03,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 05:17:09,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.740e+02 1.898e+02 2.187e+02 3.144e+02, threshold=3.796e+02, percent-clipped=0.0 2024-06-20 05:17:14,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=138435.0, ans=0.0 2024-06-20 05:17:20,908 INFO [train.py:1028] (0/2) Epoch 8, batch 4700, loss[loss=0.2371, simple_loss=0.2766, pruned_loss=0.09882, over 12420.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2771, pruned_loss=0.1045, over 2583161.74 frames. ], batch size: 25, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:17:28,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=138471.66666666666, ans=0.025 2024-06-20 05:17:43,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=138508.33333333334, ans=0.125 2024-06-20 05:17:56,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=138526.66666666666, ans=0.125 2024-06-20 05:17:58,747 INFO [train.py:1028] (0/2) Epoch 8, batch 4750, loss[loss=0.2519, simple_loss=0.2783, pruned_loss=0.1127, over 12556.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.277, pruned_loss=0.1048, over 2580028.95 frames. ], batch size: 202, lr: 7.36e-03, grad_scale: 64.0 2024-06-20 05:18:03,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138545.0, ans=0.125 2024-06-20 05:18:05,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138563.33333333334, ans=0.1 2024-06-20 05:18:09,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=138563.33333333334, ans=0.125 2024-06-20 05:18:17,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138581.66666666666, ans=0.1 2024-06-20 05:18:21,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.740e+02 1.921e+02 2.083e+02 3.210e+02, threshold=3.841e+02, percent-clipped=0.0 2024-06-20 05:18:25,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=138618.33333333334, ans=0.125 2024-06-20 05:18:35,901 INFO [train.py:1028] (0/2) Epoch 8, batch 4800, loss[loss=0.2294, simple_loss=0.2722, pruned_loss=0.09328, over 13274.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.2765, pruned_loss=0.1043, over 2577220.79 frames. ], batch size: 63, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:18:47,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=138655.0, ans=0.125 2024-06-20 05:18:54,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-20 05:18:59,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.82 vs. limit=15.0 2024-06-20 05:19:00,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=138691.66666666666, ans=0.05 2024-06-20 05:19:08,481 INFO [train.py:1028] (0/2) Epoch 8, batch 4850, loss[loss=0.2443, simple_loss=0.2736, pruned_loss=0.1075, over 13233.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2761, pruned_loss=0.1042, over 2575607.57 frames. ], batch size: 89, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:19:08,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=138728.33333333334, ans=0.0 2024-06-20 05:19:09,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.51 vs. limit=22.5 2024-06-20 05:19:11,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=138728.33333333334, ans=0.0 2024-06-20 05:19:11,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=138728.33333333334, ans=0.0 2024-06-20 05:19:11,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=138728.33333333334, ans=10.0 2024-06-20 05:19:15,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.52 vs. limit=5.0 2024-06-20 05:19:16,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.14 vs. limit=22.5 2024-06-20 05:19:17,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=12.0 2024-06-20 05:19:20,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.44 vs. limit=22.5 2024-06-20 05:19:21,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=138765.0, ans=0.025 2024-06-20 05:19:28,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.05 vs. limit=15.0 2024-06-20 05:19:31,437 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.849e+02 2.019e+02 2.279e+02 3.031e+02, threshold=4.038e+02, percent-clipped=0.0 2024-06-20 05:19:32,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=138783.33333333334, ans=0.125 2024-06-20 05:19:32,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=138783.33333333334, ans=0.125 2024-06-20 05:19:42,208 INFO [train.py:1028] (0/2) Epoch 8, batch 4900, loss[loss=0.2338, simple_loss=0.2703, pruned_loss=0.09863, over 13156.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.2756, pruned_loss=0.1039, over 2577531.09 frames. ], batch size: 59, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:19:42,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=138820.0, ans=0.125 2024-06-20 05:19:50,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=138820.0, ans=0.5 2024-06-20 05:19:58,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.53 vs. limit=22.5 2024-06-20 05:20:04,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=138856.66666666666, ans=0.125 2024-06-20 05:20:10,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=138875.0, ans=0.0 2024-06-20 05:20:14,234 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:20:18,491 INFO [train.py:1028] (0/2) Epoch 8, batch 4950, loss[loss=0.2716, simple_loss=0.2891, pruned_loss=0.127, over 11094.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.2763, pruned_loss=0.1045, over 2573314.43 frames. ], batch size: 304, lr: 7.35e-03, grad_scale: 64.0 2024-06-20 05:20:20,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=138911.66666666666, ans=0.0 2024-06-20 05:20:26,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=138930.0, ans=0.125 2024-06-20 05:20:27,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.57 vs. limit=10.0 2024-06-20 05:20:44,319 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.741e+02 1.903e+02 2.103e+02 2.533e+02, threshold=3.806e+02, percent-clipped=0.0 2024-06-20 05:20:54,750 INFO [train.py:1028] (0/2) Epoch 8, batch 5000, loss[loss=0.2523, simple_loss=0.2814, pruned_loss=0.1116, over 13225.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.2751, pruned_loss=0.1035, over 2576374.16 frames. ], batch size: 95, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:20:55,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=139003.33333333334, ans=0.125 2024-06-20 05:20:56,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=139003.33333333334, ans=0.0 2024-06-20 05:21:00,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=139003.33333333334, ans=0.0 2024-06-20 05:21:04,334 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.068e+02 2024-06-20 05:21:20,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=12.0 2024-06-20 05:21:23,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=139076.66666666666, ans=0.0 2024-06-20 05:21:24,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=139076.66666666666, ans=0.0 2024-06-20 05:21:26,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=139076.66666666666, ans=0.04949747468305833 2024-06-20 05:21:28,521 INFO [train.py:1028] (0/2) Epoch 8, batch 5050, loss[loss=0.2619, simple_loss=0.2875, pruned_loss=0.1182, over 12841.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2747, pruned_loss=0.1031, over 2576294.37 frames. ], batch size: 36, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:21:29,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139095.0, ans=0.1 2024-06-20 05:21:30,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=139095.0, ans=0.125 2024-06-20 05:21:33,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=139095.0, ans=0.1 2024-06-20 05:21:38,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139113.33333333334, ans=0.125 2024-06-20 05:21:41,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=139113.33333333334, ans=0.0 2024-06-20 05:21:43,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.07 vs. limit=10.0 2024-06-20 05:21:44,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=139113.33333333334, ans=0.125 2024-06-20 05:21:45,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=139131.66666666666, ans=0.05 2024-06-20 05:21:48,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=139131.66666666666, ans=0.125 2024-06-20 05:21:54,740 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.734e+02 1.926e+02 2.133e+02 3.181e+02, threshold=3.853e+02, percent-clipped=0.0 2024-06-20 05:21:58,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139168.33333333334, ans=0.125 2024-06-20 05:22:05,348 INFO [train.py:1028] (0/2) Epoch 8, batch 5100, loss[loss=0.2434, simple_loss=0.283, pruned_loss=0.1018, over 12977.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2751, pruned_loss=0.1036, over 2572567.61 frames. ], batch size: 39, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:22:34,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=139260.0, ans=0.025 2024-06-20 05:22:36,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=139260.0, ans=0.0 2024-06-20 05:22:41,122 INFO [train.py:1028] (0/2) Epoch 8, batch 5150, loss[loss=0.2314, simple_loss=0.2601, pruned_loss=0.1014, over 13079.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2746, pruned_loss=0.1039, over 2573647.39 frames. ], batch size: 132, lr: 7.34e-03, grad_scale: 64.0 2024-06-20 05:22:41,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139278.33333333334, ans=0.1 2024-06-20 05:22:43,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=139278.33333333334, ans=0.125 2024-06-20 05:22:49,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=139296.66666666666, ans=0.125 2024-06-20 05:22:54,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139315.0, ans=0.1 2024-06-20 05:22:58,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139315.0, ans=0.1 2024-06-20 05:23:00,148 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-76000.pt 2024-06-20 05:23:08,287 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.791e+02 1.973e+02 2.192e+02 3.813e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 05:23:18,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139351.66666666666, ans=0.1 2024-06-20 05:23:19,085 INFO [train.py:1028] (0/2) Epoch 8, batch 5200, loss[loss=0.2348, simple_loss=0.2706, pruned_loss=0.09951, over 13134.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2741, pruned_loss=0.1036, over 2577056.09 frames. ], batch size: 95, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:23:19,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=139370.0, ans=0.125 2024-06-20 05:23:22,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139370.0, ans=0.1 2024-06-20 05:23:54,774 INFO [train.py:1028] (0/2) Epoch 8, batch 5250, loss[loss=0.2377, simple_loss=0.273, pruned_loss=0.1012, over 13239.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2744, pruned_loss=0.1038, over 2571576.08 frames. ], batch size: 52, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:23:54,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=139461.66666666666, ans=0.125 2024-06-20 05:23:56,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=139461.66666666666, ans=0.1 2024-06-20 05:23:57,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=139461.66666666666, ans=0.2 2024-06-20 05:24:14,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=139516.66666666666, ans=0.025 2024-06-20 05:24:15,812 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:24:16,941 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.725e+02 1.843e+02 1.998e+02 2.562e+02, threshold=3.687e+02, percent-clipped=0.0 2024-06-20 05:24:17,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139516.66666666666, ans=0.0 2024-06-20 05:24:18,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.04 vs. limit=6.0 2024-06-20 05:24:29,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.92 vs. limit=15.0 2024-06-20 05:24:29,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-20 05:24:30,507 INFO [train.py:1028] (0/2) Epoch 8, batch 5300, loss[loss=0.2371, simple_loss=0.2674, pruned_loss=0.1034, over 13044.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.274, pruned_loss=0.1036, over 2567329.85 frames. ], batch size: 144, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:24:31,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139553.33333333334, ans=0.125 2024-06-20 05:24:35,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=139553.33333333334, ans=0.0 2024-06-20 05:24:53,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.17 vs. limit=15.0 2024-06-20 05:24:57,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139626.66666666666, ans=0.125 2024-06-20 05:24:58,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=139626.66666666666, ans=0.125 2024-06-20 05:24:59,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=139626.66666666666, ans=0.125 2024-06-20 05:24:59,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 05:25:02,996 INFO [train.py:1028] (0/2) Epoch 8, batch 5350, loss[loss=0.2788, simple_loss=0.3038, pruned_loss=0.1269, over 11494.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2742, pruned_loss=0.1037, over 2573655.89 frames. ], batch size: 16, lr: 7.33e-03, grad_scale: 64.0 2024-06-20 05:25:08,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.38 vs. limit=22.5 2024-06-20 05:25:08,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=15.0 2024-06-20 05:25:09,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=139663.33333333334, ans=0.04949747468305833 2024-06-20 05:25:14,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=139663.33333333334, ans=0.125 2024-06-20 05:25:14,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=139663.33333333334, ans=0.125 2024-06-20 05:25:20,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.16 vs. limit=6.0 2024-06-20 05:25:21,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=139681.66666666666, ans=0.125 2024-06-20 05:25:24,749 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.717e+02 1.878e+02 2.102e+02 2.786e+02, threshold=3.757e+02, percent-clipped=0.0 2024-06-20 05:25:27,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=139700.0, ans=0.125 2024-06-20 05:25:38,469 INFO [train.py:1028] (0/2) Epoch 8, batch 5400, loss[loss=0.2487, simple_loss=0.2731, pruned_loss=0.1121, over 12221.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.2745, pruned_loss=0.1044, over 2566227.75 frames. ], batch size: 240, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:25:40,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139736.66666666666, ans=0.0 2024-06-20 05:25:47,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2024-06-20 05:25:47,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139755.0, ans=0.1 2024-06-20 05:26:00,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=139791.66666666666, ans=0.125 2024-06-20 05:26:07,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=139810.0, ans=0.125 2024-06-20 05:26:11,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2024-06-20 05:26:12,270 INFO [train.py:1028] (0/2) Epoch 8, batch 5450, loss[loss=0.2291, simple_loss=0.2686, pruned_loss=0.09481, over 12886.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2748, pruned_loss=0.1041, over 2569972.22 frames. ], batch size: 26, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:26:13,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=139828.33333333334, ans=0.125 2024-06-20 05:26:13,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=139828.33333333334, ans=0.0 2024-06-20 05:26:26,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=139846.66666666666, ans=0.125 2024-06-20 05:26:37,353 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.762e+02 1.942e+02 2.178e+02 3.493e+02, threshold=3.884e+02, percent-clipped=0.0 2024-06-20 05:26:42,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=12.0 2024-06-20 05:26:45,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139901.66666666666, ans=0.1 2024-06-20 05:26:45,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=22.5 2024-06-20 05:26:45,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=139901.66666666666, ans=15.0 2024-06-20 05:26:47,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=139920.0, ans=0.0 2024-06-20 05:26:48,110 INFO [train.py:1028] (0/2) Epoch 8, batch 5500, loss[loss=0.2632, simple_loss=0.2849, pruned_loss=0.1207, over 12204.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2747, pruned_loss=0.1042, over 2564304.25 frames. ], batch size: 240, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:26:56,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=139938.33333333334, ans=0.125 2024-06-20 05:27:10,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=139975.0, ans=0.025 2024-06-20 05:27:11,639 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=2.605e-03 2024-06-20 05:27:12,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=139975.0, ans=0.125 2024-06-20 05:27:21,074 INFO [train.py:1028] (0/2) Epoch 8, batch 5550, loss[loss=0.2502, simple_loss=0.2891, pruned_loss=0.1056, over 13221.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2745, pruned_loss=0.1039, over 2567535.74 frames. ], batch size: 43, lr: 7.32e-03, grad_scale: 64.0 2024-06-20 05:27:21,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140011.66666666666, ans=0.125 2024-06-20 05:27:25,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=140011.66666666666, ans=0.125 2024-06-20 05:27:42,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=140048.33333333334, ans=0.0 2024-06-20 05:27:45,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=140066.66666666666, ans=0.125 2024-06-20 05:27:45,359 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=1.872e+02 2024-06-20 05:27:46,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140066.66666666666, ans=0.125 2024-06-20 05:27:47,892 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.818e+02 2.048e+02 2.320e+02 3.177e+02, threshold=4.095e+02, percent-clipped=0.0 2024-06-20 05:27:49,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=140066.66666666666, ans=22.5 2024-06-20 05:27:52,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140085.0, ans=0.1 2024-06-20 05:27:57,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140103.33333333334, ans=0.125 2024-06-20 05:27:58,304 INFO [train.py:1028] (0/2) Epoch 8, batch 5600, loss[loss=0.2372, simple_loss=0.2666, pruned_loss=0.1039, over 13257.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2736, pruned_loss=0.1032, over 2569723.76 frames. ], batch size: 89, lr: 7.31e-03, grad_scale: 64.0 2024-06-20 05:28:02,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140103.33333333334, ans=0.125 2024-06-20 05:28:14,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140140.0, ans=0.1 2024-06-20 05:28:19,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=140158.33333333334, ans=0.07 2024-06-20 05:28:20,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2024-06-20 05:28:22,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=140158.33333333334, ans=0.125 2024-06-20 05:28:32,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=140176.66666666666, ans=0.125 2024-06-20 05:28:34,387 INFO [train.py:1028] (0/2) Epoch 8, batch 5650, loss[loss=0.2698, simple_loss=0.2929, pruned_loss=0.1234, over 12604.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2734, pruned_loss=0.103, over 2575597.76 frames. ], batch size: 202, lr: 7.31e-03, grad_scale: 64.0 2024-06-20 05:28:41,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=15.0 2024-06-20 05:28:51,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140231.66666666666, ans=0.125 2024-06-20 05:28:54,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=140250.0, ans=0.125 2024-06-20 05:28:54,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.65 vs. limit=22.5 2024-06-20 05:28:55,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140250.0, ans=0.125 2024-06-20 05:28:56,975 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.746e+02 1.903e+02 2.194e+02 3.581e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 05:28:57,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=140250.0, ans=0.125 2024-06-20 05:28:59,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=140250.0, ans=0.125 2024-06-20 05:29:07,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2024-06-20 05:29:07,392 INFO [train.py:1028] (0/2) Epoch 8, batch 5700, loss[loss=0.2205, simple_loss=0.2669, pruned_loss=0.08711, over 13265.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.2732, pruned_loss=0.1027, over 2578977.35 frames. ], batch size: 63, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:29:15,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140305.0, ans=0.125 2024-06-20 05:29:16,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140305.0, ans=0.1 2024-06-20 05:29:44,155 INFO [train.py:1028] (0/2) Epoch 8, batch 5750, loss[loss=0.2485, simple_loss=0.2707, pruned_loss=0.1132, over 12731.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2738, pruned_loss=0.1029, over 2579013.26 frames. ], batch size: 176, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:29:44,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=140378.33333333334, ans=0.0 2024-06-20 05:29:48,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=140378.33333333334, ans=0.07 2024-06-20 05:29:54,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140396.66666666666, ans=0.1 2024-06-20 05:30:03,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=140433.33333333334, ans=0.025 2024-06-20 05:30:06,492 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.788e+02 1.951e+02 2.216e+02 3.537e+02, threshold=3.903e+02, percent-clipped=0.0 2024-06-20 05:30:12,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=140451.66666666666, ans=0.2 2024-06-20 05:30:20,682 INFO [train.py:1028] (0/2) Epoch 8, batch 5800, loss[loss=0.2487, simple_loss=0.287, pruned_loss=0.1051, over 12739.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.2754, pruned_loss=0.1041, over 2578699.25 frames. ], batch size: 176, lr: 7.31e-03, grad_scale: 128.0 2024-06-20 05:30:35,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=140506.66666666666, ans=0.04949747468305833 2024-06-20 05:30:45,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=140525.0, ans=0.125 2024-06-20 05:30:46,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=140525.0, ans=0.025 2024-06-20 05:30:49,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=140543.33333333334, ans=0.2 2024-06-20 05:30:54,237 INFO [train.py:1028] (0/2) Epoch 8, batch 5850, loss[loss=0.2767, simple_loss=0.2987, pruned_loss=0.1273, over 12477.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2776, pruned_loss=0.1054, over 2576857.18 frames. ], batch size: 202, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:31:05,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=17.21 vs. limit=15.0 2024-06-20 05:31:15,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=22.5 2024-06-20 05:31:16,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=140616.66666666666, ans=0.2 2024-06-20 05:31:17,533 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.965e+02 2.248e+02 2.647e+02 4.334e+02, threshold=4.497e+02, percent-clipped=1.0 2024-06-20 05:31:26,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=140635.0, ans=0.0 2024-06-20 05:31:27,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=140653.33333333334, ans=0.04949747468305833 2024-06-20 05:31:28,250 INFO [train.py:1028] (0/2) Epoch 8, batch 5900, loss[loss=0.239, simple_loss=0.2673, pruned_loss=0.1053, over 13094.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2804, pruned_loss=0.1066, over 2576990.88 frames. ], batch size: 121, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:31:35,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-20 05:31:44,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.90 vs. limit=6.0 2024-06-20 05:31:44,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.08 vs. limit=22.5 2024-06-20 05:31:45,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.12 vs. limit=15.0 2024-06-20 05:31:51,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=140708.33333333334, ans=0.125 2024-06-20 05:31:53,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=140708.33333333334, ans=0.125 2024-06-20 05:31:57,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140726.66666666666, ans=0.125 2024-06-20 05:32:04,698 INFO [train.py:1028] (0/2) Epoch 8, batch 5950, loss[loss=0.2417, simple_loss=0.2685, pruned_loss=0.1075, over 13114.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.2817, pruned_loss=0.1071, over 2581644.06 frames. ], batch size: 121, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:32:15,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=140763.33333333334, ans=0.0 2024-06-20 05:32:30,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.882e+02 2.019e+02 2.236e+02 3.501e+02, threshold=4.039e+02, percent-clipped=0.0 2024-06-20 05:32:38,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=140818.33333333334, ans=0.0 2024-06-20 05:32:38,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=140818.33333333334, ans=0.125 2024-06-20 05:32:40,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140818.33333333334, ans=0.0 2024-06-20 05:32:41,572 INFO [train.py:1028] (0/2) Epoch 8, batch 6000, loss[loss=0.3024, simple_loss=0.3209, pruned_loss=0.1419, over 12269.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.2832, pruned_loss=0.1079, over 2574270.93 frames. ], batch size: 241, lr: 7.30e-03, grad_scale: 128.0 2024-06-20 05:32:41,572 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 05:32:49,447 INFO [train.py:1060] (0/2) Epoch 8, validation: loss=0.2064, simple_loss=0.2689, pruned_loss=0.07195, over 351949.00 frames. 2024-06-20 05:32:49,447 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 05:32:49,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=140836.66666666666, ans=0.2 2024-06-20 05:32:52,104 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2024-06-20 05:32:52,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-20 05:32:55,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=140836.66666666666, ans=0.2 2024-06-20 05:33:24,052 INFO [train.py:1028] (0/2) Epoch 8, batch 6050, loss[loss=0.2172, simple_loss=0.2618, pruned_loss=0.08631, over 12958.00 frames. ], tot_loss[loss=0.251, simple_loss=0.2848, pruned_loss=0.1086, over 2577442.53 frames. ], batch size: 39, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:33:36,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.99 vs. limit=6.0 2024-06-20 05:33:43,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.92 vs. limit=15.0 2024-06-20 05:33:45,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2024-06-20 05:33:46,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=140983.33333333334, ans=0.125 2024-06-20 05:33:50,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.839e+02 2.010e+02 2.266e+02 3.304e+02, threshold=4.020e+02, percent-clipped=0.0 2024-06-20 05:33:54,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141001.66666666666, ans=0.1 2024-06-20 05:34:00,724 INFO [train.py:1028] (0/2) Epoch 8, batch 6100, loss[loss=0.2341, simple_loss=0.2646, pruned_loss=0.1018, over 13091.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.2858, pruned_loss=0.1088, over 2579084.88 frames. ], batch size: 121, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:34:08,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141038.33333333334, ans=0.125 2024-06-20 05:34:12,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=141038.33333333334, ans=0.035 2024-06-20 05:34:14,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=141056.66666666666, ans=10.0 2024-06-20 05:34:21,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=141075.0, ans=0.125 2024-06-20 05:34:26,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141075.0, ans=0.125 2024-06-20 05:34:26,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.17 vs. limit=22.5 2024-06-20 05:34:26,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=141075.0, ans=0.0 2024-06-20 05:34:37,545 INFO [train.py:1028] (0/2) Epoch 8, batch 6150, loss[loss=0.2739, simple_loss=0.2895, pruned_loss=0.1291, over 10813.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.287, pruned_loss=0.1092, over 2577568.46 frames. ], batch size: 303, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:34:47,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.82 vs. limit=22.5 2024-06-20 05:34:54,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.79 vs. limit=15.0 2024-06-20 05:35:00,505 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.965e+02 2.194e+02 2.508e+02 4.834e+02, threshold=4.388e+02, percent-clipped=2.0 2024-06-20 05:35:09,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=141185.0, ans=0.025 2024-06-20 05:35:11,078 INFO [train.py:1028] (0/2) Epoch 8, batch 6200, loss[loss=0.2869, simple_loss=0.3233, pruned_loss=0.1252, over 13288.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.2897, pruned_loss=0.1108, over 2574802.54 frames. ], batch size: 89, lr: 7.29e-03, grad_scale: 128.0 2024-06-20 05:35:13,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=141203.33333333334, ans=0.025 2024-06-20 05:35:27,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2024-06-20 05:35:28,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=141240.0, ans=0.125 2024-06-20 05:35:36,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=141258.33333333334, ans=0.0 2024-06-20 05:35:37,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=141258.33333333334, ans=0.125 2024-06-20 05:35:41,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=141276.66666666666, ans=0.035 2024-06-20 05:35:42,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.52 vs. limit=10.0 2024-06-20 05:35:48,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=15.0 2024-06-20 05:35:48,651 INFO [train.py:1028] (0/2) Epoch 8, batch 6250, loss[loss=0.2538, simple_loss=0.287, pruned_loss=0.1103, over 13205.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2908, pruned_loss=0.1113, over 2567669.58 frames. ], batch size: 83, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:35:49,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=141295.0, ans=0.125 2024-06-20 05:35:50,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=141295.0, ans=0.0 2024-06-20 05:36:04,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=141331.66666666666, ans=0.95 2024-06-20 05:36:06,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2024-06-20 05:36:10,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=141350.0, ans=0.1 2024-06-20 05:36:11,464 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.893e+02 2.016e+02 2.226e+02 3.019e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 05:36:11,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=141350.0, ans=0.0 2024-06-20 05:36:21,855 INFO [train.py:1028] (0/2) Epoch 8, batch 6300, loss[loss=0.2664, simple_loss=0.3071, pruned_loss=0.1128, over 11064.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2917, pruned_loss=0.1114, over 2563911.98 frames. ], batch size: 16, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:36:38,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=141423.33333333334, ans=0.0 2024-06-20 05:36:46,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.67 vs. limit=15.0 2024-06-20 05:36:49,531 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-20 05:36:54,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=141460.0, ans=0.0 2024-06-20 05:36:58,083 INFO [train.py:1028] (0/2) Epoch 8, batch 6350, loss[loss=0.2802, simple_loss=0.3115, pruned_loss=0.1245, over 12630.00 frames. ], tot_loss[loss=0.258, simple_loss=0.293, pruned_loss=0.1115, over 2574585.88 frames. ], batch size: 202, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:37:01,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=12.0 2024-06-20 05:37:04,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=141496.66666666666, ans=0.0 2024-06-20 05:37:04,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=141496.66666666666, ans=0.125 2024-06-20 05:37:09,971 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:37:16,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141515.0, ans=0.125 2024-06-20 05:37:20,363 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.860e+02 2.051e+02 2.229e+02 3.594e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 05:37:29,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=141551.66666666666, ans=0.0 2024-06-20 05:37:30,998 INFO [train.py:1028] (0/2) Epoch 8, batch 6400, loss[loss=0.2607, simple_loss=0.2997, pruned_loss=0.1108, over 13168.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.295, pruned_loss=0.1123, over 2575722.52 frames. ], batch size: 67, lr: 7.28e-03, grad_scale: 128.0 2024-06-20 05:37:31,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=141570.0, ans=0.125 2024-06-20 05:37:33,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.00 vs. limit=15.0 2024-06-20 05:37:34,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141570.0, ans=0.125 2024-06-20 05:37:36,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=12.0 2024-06-20 05:37:42,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=141588.33333333334, ans=0.1 2024-06-20 05:37:43,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=141588.33333333334, ans=0.0 2024-06-20 05:37:44,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2024-06-20 05:37:44,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141588.33333333334, ans=0.125 2024-06-20 05:37:49,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=141606.66666666666, ans=15.0 2024-06-20 05:37:56,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141625.0, ans=0.125 2024-06-20 05:38:03,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=141643.33333333334, ans=0.125 2024-06-20 05:38:06,406 INFO [train.py:1028] (0/2) Epoch 8, batch 6450, loss[loss=0.3168, simple_loss=0.3406, pruned_loss=0.1466, over 12578.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.2969, pruned_loss=0.1132, over 2581003.26 frames. ], batch size: 202, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:38:12,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=15.0 2024-06-20 05:38:21,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=141698.33333333334, ans=0.125 2024-06-20 05:38:22,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2024-06-20 05:38:23,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=141698.33333333334, ans=0.0 2024-06-20 05:38:28,718 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.978e+02 2.153e+02 2.645e+02 4.398e+02, threshold=4.307e+02, percent-clipped=1.0 2024-06-20 05:38:30,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=141716.66666666666, ans=0.95 2024-06-20 05:38:31,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=141716.66666666666, ans=15.0 2024-06-20 05:38:42,534 INFO [train.py:1028] (0/2) Epoch 8, batch 6500, loss[loss=0.3054, simple_loss=0.3235, pruned_loss=0.1436, over 10665.00 frames. ], tot_loss[loss=0.264, simple_loss=0.2994, pruned_loss=0.1143, over 2583424.38 frames. ], batch size: 303, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:38:47,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141753.33333333334, ans=0.1 2024-06-20 05:38:51,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141771.66666666666, ans=0.1 2024-06-20 05:39:02,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=141808.33333333334, ans=0.125 2024-06-20 05:39:09,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.01 vs. limit=15.0 2024-06-20 05:39:14,916 INFO [train.py:1028] (0/2) Epoch 8, batch 6550, loss[loss=0.2401, simple_loss=0.2797, pruned_loss=0.1003, over 12632.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3003, pruned_loss=0.1147, over 2587441.67 frames. ], batch size: 22, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:39:20,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=141863.33333333334, ans=0.125 2024-06-20 05:39:24,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-20 05:39:28,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=141881.66666666666, ans=0.0 2024-06-20 05:39:31,504 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:39:36,693 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.899e+02 2.049e+02 2.219e+02 3.326e+02, threshold=4.097e+02, percent-clipped=0.0 2024-06-20 05:39:38,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=141900.0, ans=0.125 2024-06-20 05:39:52,240 INFO [train.py:1028] (0/2) Epoch 8, batch 6600, loss[loss=0.2461, simple_loss=0.2863, pruned_loss=0.1029, over 13258.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.2999, pruned_loss=0.1142, over 2590133.35 frames. ], batch size: 72, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:39:53,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=141936.66666666666, ans=0.2 2024-06-20 05:40:04,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141955.0, ans=0.1 2024-06-20 05:40:06,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=141973.33333333334, ans=0.0 2024-06-20 05:40:15,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141991.66666666666, ans=0.125 2024-06-20 05:40:16,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141991.66666666666, ans=0.125 2024-06-20 05:40:16,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2024-06-20 05:40:19,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=142010.0, ans=0.125 2024-06-20 05:40:23,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142010.0, ans=0.125 2024-06-20 05:40:25,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=12.0 2024-06-20 05:40:26,262 INFO [train.py:1028] (0/2) Epoch 8, batch 6650, loss[loss=0.2857, simple_loss=0.3143, pruned_loss=0.1285, over 12955.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3023, pruned_loss=0.1154, over 2583213.50 frames. ], batch size: 158, lr: 7.27e-03, grad_scale: 128.0 2024-06-20 05:40:29,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=142028.33333333334, ans=22.5 2024-06-20 05:40:40,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=142065.0, ans=0.125 2024-06-20 05:40:40,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142065.0, ans=0.1 2024-06-20 05:40:52,989 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.959e+02 2.113e+02 2.309e+02 3.321e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-20 05:41:03,822 INFO [train.py:1028] (0/2) Epoch 8, batch 6700, loss[loss=0.2982, simple_loss=0.3222, pruned_loss=0.1371, over 12710.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3046, pruned_loss=0.1166, over 2583206.20 frames. ], batch size: 176, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:41:12,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2024-06-20 05:41:17,614 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:41:18,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=142156.66666666666, ans=0.04949747468305833 2024-06-20 05:41:19,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142156.66666666666, ans=0.125 2024-06-20 05:41:26,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=142175.0, ans=0.125 2024-06-20 05:41:26,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142175.0, ans=0.0 2024-06-20 05:41:29,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2024-06-20 05:41:34,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=142193.33333333334, ans=0.0 2024-06-20 05:41:37,445 INFO [train.py:1028] (0/2) Epoch 8, batch 6750, loss[loss=0.3964, simple_loss=0.3872, pruned_loss=0.2028, over 12103.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3057, pruned_loss=0.1178, over 2576714.72 frames. ], batch size: 240, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:41:40,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=142211.66666666666, ans=0.125 2024-06-20 05:41:44,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=142230.0, ans=0.125 2024-06-20 05:42:02,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.939e+02 2.153e+02 2.424e+02 3.567e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-20 05:42:05,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=142266.66666666666, ans=0.125 2024-06-20 05:42:09,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=142285.0, ans=0.125 2024-06-20 05:42:13,162 INFO [train.py:1028] (0/2) Epoch 8, batch 6800, loss[loss=0.254, simple_loss=0.2928, pruned_loss=0.1076, over 13168.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.306, pruned_loss=0.1176, over 2577909.60 frames. ], batch size: 67, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:42:26,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=142340.0, ans=0.1 2024-06-20 05:42:30,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142340.0, ans=0.125 2024-06-20 05:42:30,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=142340.0, ans=0.125 2024-06-20 05:42:45,577 INFO [train.py:1028] (0/2) Epoch 8, batch 6850, loss[loss=0.2888, simple_loss=0.3418, pruned_loss=0.1179, over 13265.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3065, pruned_loss=0.1174, over 2582104.68 frames. ], batch size: 63, lr: 7.26e-03, grad_scale: 128.0 2024-06-20 05:42:57,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=142413.33333333334, ans=0.125 2024-06-20 05:43:07,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.03 vs. limit=22.5 2024-06-20 05:43:12,879 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.885e+02 2.030e+02 2.236e+02 2.788e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 05:43:17,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=15.0 2024-06-20 05:43:23,556 INFO [train.py:1028] (0/2) Epoch 8, batch 6900, loss[loss=0.2727, simple_loss=0.3138, pruned_loss=0.1158, over 13328.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3074, pruned_loss=0.1178, over 2584245.80 frames. ], batch size: 49, lr: 7.25e-03, grad_scale: 128.0 2024-06-20 05:43:25,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.02 vs. limit=15.0 2024-06-20 05:43:35,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=142505.0, ans=0.2 2024-06-20 05:43:37,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2024-06-20 05:43:40,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142523.33333333334, ans=0.125 2024-06-20 05:43:43,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=142541.66666666666, ans=0.2 2024-06-20 05:43:46,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.68 vs. limit=15.0 2024-06-20 05:44:00,532 INFO [train.py:1028] (0/2) Epoch 8, batch 6950, loss[loss=0.2866, simple_loss=0.3146, pruned_loss=0.1293, over 11279.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3081, pruned_loss=0.1179, over 2579153.87 frames. ], batch size: 16, lr: 7.25e-03, grad_scale: 128.0 2024-06-20 05:44:03,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=142578.33333333334, ans=0.125 2024-06-20 05:44:05,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=142578.33333333334, ans=0.125 2024-06-20 05:44:18,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=142615.0, ans=0.0 2024-06-20 05:44:19,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142633.33333333334, ans=0.125 2024-06-20 05:44:21,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.99 vs. limit=10.0 2024-06-20 05:44:22,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142633.33333333334, ans=0.1 2024-06-20 05:44:22,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.895e+02 2.030e+02 2.279e+02 3.825e+02, threshold=4.060e+02, percent-clipped=0.0 2024-06-20 05:44:33,449 INFO [train.py:1028] (0/2) Epoch 8, batch 7000, loss[loss=0.2945, simple_loss=0.331, pruned_loss=0.129, over 12966.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3082, pruned_loss=0.1175, over 2574858.96 frames. ], batch size: 158, lr: 7.25e-03, grad_scale: 64.0 2024-06-20 05:44:42,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=142688.33333333334, ans=0.125 2024-06-20 05:45:00,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=142725.0, ans=0.0 2024-06-20 05:45:10,812 INFO [train.py:1028] (0/2) Epoch 8, batch 7050, loss[loss=0.3271, simple_loss=0.354, pruned_loss=0.15, over 12791.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3096, pruned_loss=0.118, over 2582413.58 frames. ], batch size: 176, lr: 7.25e-03, grad_scale: 64.0 2024-06-20 05:45:21,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=142780.0, ans=0.025 2024-06-20 05:45:21,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=142780.0, ans=0.125 2024-06-20 05:45:22,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=142780.0, ans=10.0 2024-06-20 05:45:29,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2024-06-20 05:45:33,348 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.986e+02 2.162e+02 2.563e+02 3.817e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-20 05:45:43,115 INFO [train.py:1028] (0/2) Epoch 8, batch 7100, loss[loss=0.3073, simple_loss=0.3446, pruned_loss=0.1351, over 13193.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3105, pruned_loss=0.1186, over 2574328.98 frames. ], batch size: 112, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:45:59,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.24 vs. limit=22.5 2024-06-20 05:46:10,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=142908.33333333334, ans=0.0 2024-06-20 05:46:19,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=142945.0, ans=0.5 2024-06-20 05:46:19,670 INFO [train.py:1028] (0/2) Epoch 8, batch 7150, loss[loss=0.331, simple_loss=0.3498, pruned_loss=0.1561, over 12529.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3109, pruned_loss=0.1186, over 2572220.23 frames. ], batch size: 203, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:46:21,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=142945.0, ans=0.0 2024-06-20 05:46:22,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2024-06-20 05:46:25,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=142963.33333333334, ans=0.5 2024-06-20 05:46:34,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=142981.66666666666, ans=0.125 2024-06-20 05:46:42,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=143000.0, ans=0.0 2024-06-20 05:46:43,682 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.903e+02 2.096e+02 2.339e+02 3.206e+02, threshold=4.191e+02, percent-clipped=0.0 2024-06-20 05:46:49,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-20 05:46:53,292 INFO [train.py:1028] (0/2) Epoch 8, batch 7200, loss[loss=0.2812, simple_loss=0.3171, pruned_loss=0.1226, over 13188.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3122, pruned_loss=0.1188, over 2578422.34 frames. ], batch size: 112, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:46:56,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=143036.66666666666, ans=0.0 2024-06-20 05:47:06,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=143055.0, ans=0.125 2024-06-20 05:47:27,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143110.0, ans=0.1 2024-06-20 05:47:30,571 INFO [train.py:1028] (0/2) Epoch 8, batch 7250, loss[loss=0.2376, simple_loss=0.2816, pruned_loss=0.09685, over 12944.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3123, pruned_loss=0.1186, over 2579408.86 frames. ], batch size: 36, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:47:33,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=12.0 2024-06-20 05:47:33,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143128.33333333334, ans=0.0 2024-06-20 05:47:36,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=143146.66666666666, ans=0.0 2024-06-20 05:47:53,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.876e+02 2.052e+02 2.258e+02 3.578e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 05:48:00,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=143201.66666666666, ans=0.125 2024-06-20 05:48:03,619 INFO [train.py:1028] (0/2) Epoch 8, batch 7300, loss[loss=0.2705, simple_loss=0.3067, pruned_loss=0.1171, over 13034.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3135, pruned_loss=0.1193, over 2579858.65 frames. ], batch size: 36, lr: 7.24e-03, grad_scale: 64.0 2024-06-20 05:48:04,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=143220.0, ans=0.0 2024-06-20 05:48:10,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=143220.0, ans=0.125 2024-06-20 05:48:14,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=143238.33333333334, ans=0.125 2024-06-20 05:48:16,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143238.33333333334, ans=0.1 2024-06-20 05:48:19,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=143238.33333333334, ans=0.0 2024-06-20 05:48:19,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143256.66666666666, ans=0.1 2024-06-20 05:48:22,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143256.66666666666, ans=0.125 2024-06-20 05:48:27,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.96 vs. limit=22.5 2024-06-20 05:48:33,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=143293.33333333334, ans=0.125 2024-06-20 05:48:39,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143311.66666666666, ans=0.0 2024-06-20 05:48:40,318 INFO [train.py:1028] (0/2) Epoch 8, batch 7350, loss[loss=0.2764, simple_loss=0.3204, pruned_loss=0.1162, over 13355.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3139, pruned_loss=0.1195, over 2581847.47 frames. ], batch size: 46, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:48:46,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143330.0, ans=0.0 2024-06-20 05:48:47,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.15 vs. limit=12.0 2024-06-20 05:48:54,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-20 05:48:57,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=143348.33333333334, ans=0.0 2024-06-20 05:49:00,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143366.66666666666, ans=0.125 2024-06-20 05:49:01,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=143366.66666666666, ans=0.0 2024-06-20 05:49:03,056 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.924e+02 2.048e+02 2.252e+02 3.780e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 05:49:05,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=143385.0, ans=0.125 2024-06-20 05:49:10,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=143385.0, ans=0.025 2024-06-20 05:49:12,404 INFO [train.py:1028] (0/2) Epoch 8, batch 7400, loss[loss=0.2745, simple_loss=0.3194, pruned_loss=0.1148, over 13263.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3132, pruned_loss=0.1191, over 2587150.42 frames. ], batch size: 63, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:49:13,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=143403.33333333334, ans=0.125 2024-06-20 05:49:39,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=143458.33333333334, ans=0.0 2024-06-20 05:49:49,662 INFO [train.py:1028] (0/2) Epoch 8, batch 7450, loss[loss=0.2679, simple_loss=0.3094, pruned_loss=0.1131, over 12552.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3132, pruned_loss=0.1189, over 2581162.55 frames. ], batch size: 29, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:49:52,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143495.0, ans=0.1 2024-06-20 05:50:00,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=143513.33333333334, ans=0.125 2024-06-20 05:50:02,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=143513.33333333334, ans=0.125 2024-06-20 05:50:04,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=143531.66666666666, ans=0.0 2024-06-20 05:50:18,897 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.965e+02 2.100e+02 2.447e+02 3.091e+02, threshold=4.201e+02, percent-clipped=0.0 2024-06-20 05:50:28,784 INFO [train.py:1028] (0/2) Epoch 8, batch 7500, loss[loss=0.3008, simple_loss=0.3224, pruned_loss=0.1396, over 10574.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.315, pruned_loss=0.1198, over 2578129.81 frames. ], batch size: 303, lr: 7.23e-03, grad_scale: 64.0 2024-06-20 05:50:34,129 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=2.294e+01 2024-06-20 05:50:50,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.42 vs. limit=15.0 2024-06-20 05:50:52,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=143641.66666666666, ans=0.125 2024-06-20 05:51:01,361 INFO [train.py:1028] (0/2) Epoch 8, batch 7550, loss[loss=0.2463, simple_loss=0.2846, pruned_loss=0.104, over 12933.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3164, pruned_loss=0.1208, over 2577340.00 frames. ], batch size: 158, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:51:02,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143678.33333333334, ans=0.1 2024-06-20 05:51:10,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=143696.66666666666, ans=0.0 2024-06-20 05:51:17,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.78 vs. limit=22.5 2024-06-20 05:51:21,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=143715.0, ans=0.125 2024-06-20 05:51:21,856 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.042e+01 2024-06-20 05:51:22,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=15.0 2024-06-20 05:51:27,937 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.902e+02 2.095e+02 2.248e+02 3.118e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 05:51:31,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=143751.66666666666, ans=0.125 2024-06-20 05:51:37,612 INFO [train.py:1028] (0/2) Epoch 8, batch 7600, loss[loss=0.2914, simple_loss=0.3293, pruned_loss=0.1267, over 13209.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3171, pruned_loss=0.1208, over 2575512.62 frames. ], batch size: 83, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:51:40,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=143770.0, ans=0.025 2024-06-20 05:51:46,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143788.33333333334, ans=0.0 2024-06-20 05:52:01,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-20 05:52:14,611 INFO [train.py:1028] (0/2) Epoch 8, batch 7650, loss[loss=0.2905, simple_loss=0.333, pruned_loss=0.124, over 12991.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3174, pruned_loss=0.1209, over 2572154.87 frames. ], batch size: 33, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:52:16,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=143861.66666666666, ans=0.125 2024-06-20 05:52:18,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=143861.66666666666, ans=0.125 2024-06-20 05:52:22,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=143880.0, ans=0.1 2024-06-20 05:52:24,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=143880.0, ans=0.2 2024-06-20 05:52:25,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=143880.0, ans=0.2 2024-06-20 05:52:30,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143898.33333333334, ans=0.125 2024-06-20 05:52:31,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.24 vs. limit=15.0 2024-06-20 05:52:35,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=143916.66666666666, ans=0.0 2024-06-20 05:52:37,624 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.005e+02 2.196e+02 2.544e+02 4.080e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-20 05:52:38,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2024-06-20 05:52:47,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=143953.33333333334, ans=0.025 2024-06-20 05:52:47,642 INFO [train.py:1028] (0/2) Epoch 8, batch 7700, loss[loss=0.2946, simple_loss=0.3392, pruned_loss=0.125, over 13239.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3183, pruned_loss=0.1214, over 2569124.43 frames. ], batch size: 63, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:52:58,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.57 vs. limit=15.0 2024-06-20 05:53:00,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=143990.0, ans=0.125 2024-06-20 05:53:04,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=143990.0, ans=0.0 2024-06-20 05:53:10,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144008.33333333334, ans=0.1 2024-06-20 05:53:17,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.17 vs. limit=22.5 2024-06-20 05:53:23,411 INFO [train.py:1028] (0/2) Epoch 8, batch 7750, loss[loss=0.2616, simple_loss=0.3066, pruned_loss=0.1083, over 13217.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3194, pruned_loss=0.1225, over 2574112.11 frames. ], batch size: 72, lr: 7.22e-03, grad_scale: 64.0 2024-06-20 05:53:25,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=35.08 vs. limit=15.0 2024-06-20 05:53:30,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=144063.33333333334, ans=0.025 2024-06-20 05:53:31,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=144063.33333333334, ans=0.1 2024-06-20 05:53:31,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144063.33333333334, ans=0.1 2024-06-20 05:53:46,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.988e+02 2.328e+02 2.652e+02 4.025e+02, threshold=4.655e+02, percent-clipped=0.0 2024-06-20 05:53:53,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144118.33333333334, ans=0.1 2024-06-20 05:53:56,356 INFO [train.py:1028] (0/2) Epoch 8, batch 7800, loss[loss=0.2733, simple_loss=0.3109, pruned_loss=0.1178, over 13176.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3198, pruned_loss=0.1224, over 2578047.43 frames. ], batch size: 95, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:53:57,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=144136.66666666666, ans=0.1 2024-06-20 05:53:57,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=144136.66666666666, ans=0.0 2024-06-20 05:53:59,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=144136.66666666666, ans=0.0 2024-06-20 05:54:01,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=144136.66666666666, ans=0.0 2024-06-20 05:54:07,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.58 vs. limit=15.0 2024-06-20 05:54:09,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2024-06-20 05:54:10,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0 2024-06-20 05:54:17,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=144173.33333333334, ans=0.0 2024-06-20 05:54:22,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.32 vs. limit=10.0 2024-06-20 05:54:23,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144191.66666666666, ans=0.1 2024-06-20 05:54:32,960 INFO [train.py:1028] (0/2) Epoch 8, batch 7850, loss[loss=0.2455, simple_loss=0.2861, pruned_loss=0.1024, over 10642.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3205, pruned_loss=0.1227, over 2572111.63 frames. ], batch size: 16, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:54:45,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=144265.0, ans=0.125 2024-06-20 05:54:55,146 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 1.930e+02 2.140e+02 2.417e+02 3.663e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-20 05:54:55,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=144283.33333333334, ans=0.125 2024-06-20 05:55:07,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=144301.66666666666, ans=0.2 2024-06-20 05:55:08,582 INFO [train.py:1028] (0/2) Epoch 8, batch 7900, loss[loss=0.2852, simple_loss=0.3259, pruned_loss=0.1222, over 13160.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3214, pruned_loss=0.1236, over 2571880.69 frames. ], batch size: 77, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:55:12,440 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=8.0 2024-06-20 05:55:15,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144338.33333333334, ans=0.0 2024-06-20 05:55:21,170 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 05:55:21,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=144356.66666666666, ans=0.0 2024-06-20 05:55:32,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=144375.0, ans=0.125 2024-06-20 05:55:39,734 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.99 vs. limit=22.5 2024-06-20 05:55:41,322 INFO [train.py:1028] (0/2) Epoch 8, batch 7950, loss[loss=0.2762, simple_loss=0.3008, pruned_loss=0.1258, over 10734.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3223, pruned_loss=0.1237, over 2575978.35 frames. ], batch size: 303, lr: 7.21e-03, grad_scale: 64.0 2024-06-20 05:55:43,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-20 05:56:09,631 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.001e+02 2.236e+02 2.541e+02 3.791e+02, threshold=4.473e+02, percent-clipped=0.0 2024-06-20 05:56:15,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=144485.0, ans=0.2 2024-06-20 05:56:17,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=144485.0, ans=0.2 2024-06-20 05:56:17,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=144485.0, ans=0.125 2024-06-20 05:56:18,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=144485.0, ans=0.025 2024-06-20 05:56:19,853 INFO [train.py:1028] (0/2) Epoch 8, batch 8000, loss[loss=0.2601, simple_loss=0.3082, pruned_loss=0.106, over 12582.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3226, pruned_loss=0.1237, over 2572482.86 frames. ], batch size: 29, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:56:22,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=144503.33333333334, ans=0.0 2024-06-20 05:56:25,929 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=6.522e+01 2024-06-20 05:56:35,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=144540.0, ans=0.125 2024-06-20 05:56:50,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2024-06-20 05:56:51,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=144576.66666666666, ans=0.2 2024-06-20 05:56:52,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=144576.66666666666, ans=0.2 2024-06-20 05:56:53,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144595.0, ans=0.125 2024-06-20 05:56:53,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=15.0 2024-06-20 05:56:53,482 INFO [train.py:1028] (0/2) Epoch 8, batch 8050, loss[loss=0.2827, simple_loss=0.3131, pruned_loss=0.1262, over 13180.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3223, pruned_loss=0.1237, over 2571919.19 frames. ], batch size: 83, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:56:56,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144595.0, ans=0.125 2024-06-20 05:57:07,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=144613.33333333334, ans=0.0 2024-06-20 05:57:07,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=144613.33333333334, ans=0.125 2024-06-20 05:57:08,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144613.33333333334, ans=0.1 2024-06-20 05:57:09,293 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.914e+01 2024-06-20 05:57:13,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.15 vs. limit=15.0 2024-06-20 05:57:15,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=144650.0, ans=0.0 2024-06-20 05:57:16,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=144650.0, ans=0.2 2024-06-20 05:57:19,407 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.012e+02 2.211e+02 2.540e+02 3.507e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-20 05:57:27,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=144668.33333333334, ans=0.0 2024-06-20 05:57:28,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=144686.66666666666, ans=0.1 2024-06-20 05:57:28,980 INFO [train.py:1028] (0/2) Epoch 8, batch 8100, loss[loss=0.302, simple_loss=0.3418, pruned_loss=0.1311, over 13112.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3225, pruned_loss=0.1235, over 2576523.33 frames. ], batch size: 112, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:57:33,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=144686.66666666666, ans=0.125 2024-06-20 05:57:40,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=144705.0, ans=0.0 2024-06-20 05:57:56,356 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.16 vs. limit=15.0 2024-06-20 05:57:56,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2024-06-20 05:57:59,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144760.0, ans=0.125 2024-06-20 05:57:59,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.50 vs. limit=15.0 2024-06-20 05:58:02,705 INFO [train.py:1028] (0/2) Epoch 8, batch 8150, loss[loss=0.2703, simple_loss=0.307, pruned_loss=0.1168, over 13109.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3223, pruned_loss=0.1229, over 2579508.89 frames. ], batch size: 121, lr: 7.20e-03, grad_scale: 64.0 2024-06-20 05:58:08,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=144778.33333333334, ans=0.125 2024-06-20 05:58:17,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=144796.66666666666, ans=0.0 2024-06-20 05:58:23,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144815.0, ans=0.1 2024-06-20 05:58:29,267 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.010e+02 2.176e+02 2.502e+02 3.519e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-20 05:58:29,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=144833.33333333334, ans=0.035 2024-06-20 05:58:39,210 INFO [train.py:1028] (0/2) Epoch 8, batch 8200, loss[loss=0.3144, simple_loss=0.3418, pruned_loss=0.1435, over 13138.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3223, pruned_loss=0.1227, over 2583296.38 frames. ], batch size: 112, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:58:43,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=144870.0, ans=0.0 2024-06-20 05:58:45,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=144888.33333333334, ans=0.0 2024-06-20 05:58:59,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=144925.0, ans=0.025 2024-06-20 05:59:10,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=144943.33333333334, ans=0.125 2024-06-20 05:59:11,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.08 vs. limit=6.0 2024-06-20 05:59:15,742 INFO [train.py:1028] (0/2) Epoch 8, batch 8250, loss[loss=0.3076, simple_loss=0.3511, pruned_loss=0.1321, over 13277.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.323, pruned_loss=0.1229, over 2583630.53 frames. ], batch size: 52, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:59:18,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.54 vs. limit=10.0 2024-06-20 05:59:30,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.39 vs. limit=22.5 2024-06-20 05:59:38,274 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.889e+02 2.044e+02 2.251e+02 2.932e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 05:59:48,326 INFO [train.py:1028] (0/2) Epoch 8, batch 8300, loss[loss=0.2793, simple_loss=0.3129, pruned_loss=0.1229, over 12991.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.322, pruned_loss=0.1221, over 2580223.83 frames. ], batch size: 102, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 05:59:51,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=145053.33333333334, ans=0.07 2024-06-20 05:59:53,207 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=15.0 2024-06-20 05:59:55,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.59 vs. limit=15.0 2024-06-20 05:59:56,420 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=9.182e+01 2024-06-20 05:59:58,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2024-06-20 06:00:11,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=145108.33333333334, ans=0.2 2024-06-20 06:00:15,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2024-06-20 06:00:17,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=12.0 2024-06-20 06:00:24,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=145126.66666666666, ans=0.0 2024-06-20 06:00:25,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=145126.66666666666, ans=0.125 2024-06-20 06:00:26,563 INFO [train.py:1028] (0/2) Epoch 8, batch 8350, loss[loss=0.2708, simple_loss=0.313, pruned_loss=0.1143, over 13159.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3218, pruned_loss=0.1218, over 2581172.73 frames. ], batch size: 112, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 06:00:29,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=145145.0, ans=0.025 2024-06-20 06:00:40,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=145181.66666666666, ans=0.125 2024-06-20 06:00:44,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=145181.66666666666, ans=0.0 2024-06-20 06:00:50,503 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.942e+02 2.124e+02 2.391e+02 3.448e+02, threshold=4.248e+02, percent-clipped=0.0 2024-06-20 06:00:53,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.94 vs. limit=10.0 2024-06-20 06:00:54,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=145218.33333333334, ans=0.125 2024-06-20 06:01:00,623 INFO [train.py:1028] (0/2) Epoch 8, batch 8400, loss[loss=0.2572, simple_loss=0.3015, pruned_loss=0.1065, over 12959.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3214, pruned_loss=0.1218, over 2578177.33 frames. ], batch size: 39, lr: 7.19e-03, grad_scale: 64.0 2024-06-20 06:01:10,792 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.18 vs. limit=22.5 2024-06-20 06:01:17,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=145273.33333333334, ans=0.0 2024-06-20 06:01:29,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=145291.66666666666, ans=0.2 2024-06-20 06:01:32,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=145310.0, ans=0.025 2024-06-20 06:01:33,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145310.0, ans=0.125 2024-06-20 06:01:33,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=145310.0, ans=0.1 2024-06-20 06:01:35,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=145310.0, ans=0.125 2024-06-20 06:01:37,402 INFO [train.py:1028] (0/2) Epoch 8, batch 8450, loss[loss=0.3163, simple_loss=0.3559, pruned_loss=0.1383, over 13196.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3225, pruned_loss=0.1224, over 2579742.56 frames. ], batch size: 112, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:01:58,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145383.33333333334, ans=0.125 2024-06-20 06:01:59,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2024-06-20 06:02:00,646 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.986e+02 2.160e+02 2.375e+02 3.461e+02, threshold=4.320e+02, percent-clipped=0.0 2024-06-20 06:02:14,247 INFO [train.py:1028] (0/2) Epoch 8, batch 8500, loss[loss=0.2635, simple_loss=0.3091, pruned_loss=0.109, over 12804.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3232, pruned_loss=0.1229, over 2577626.04 frames. ], batch size: 29, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:02:21,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.91 vs. limit=22.5 2024-06-20 06:02:30,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=145456.66666666666, ans=0.09899494936611666 2024-06-20 06:02:48,773 INFO [train.py:1028] (0/2) Epoch 8, batch 8550, loss[loss=0.2677, simple_loss=0.3094, pruned_loss=0.113, over 12550.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3232, pruned_loss=0.1228, over 2576092.23 frames. ], batch size: 22, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:02:59,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=145530.0, ans=0.015 2024-06-20 06:03:06,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=145548.33333333334, ans=0.0 2024-06-20 06:03:12,938 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 2.171e+02 2.366e+02 2.808e+02 3.872e+02, threshold=4.732e+02, percent-clipped=0.0 2024-06-20 06:03:19,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=145585.0, ans=0.2 2024-06-20 06:03:22,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=145585.0, ans=0.125 2024-06-20 06:03:26,099 INFO [train.py:1028] (0/2) Epoch 8, batch 8600, loss[loss=0.3005, simple_loss=0.331, pruned_loss=0.1351, over 13172.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3236, pruned_loss=0.1231, over 2574015.76 frames. ], batch size: 112, lr: 7.18e-03, grad_scale: 64.0 2024-06-20 06:03:26,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=145603.33333333334, ans=0.0 2024-06-20 06:03:36,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=145621.66666666666, ans=0.125 2024-06-20 06:03:39,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=145640.0, ans=0.0 2024-06-20 06:03:44,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145640.0, ans=0.125 2024-06-20 06:03:46,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145658.33333333334, ans=0.1 2024-06-20 06:03:49,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145658.33333333334, ans=0.1 2024-06-20 06:03:56,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2024-06-20 06:03:58,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.92 vs. limit=15.0 2024-06-20 06:03:59,848 INFO [train.py:1028] (0/2) Epoch 8, batch 8650, loss[loss=0.2627, simple_loss=0.3064, pruned_loss=0.1095, over 13066.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3245, pruned_loss=0.1233, over 2577777.45 frames. ], batch size: 102, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:04:07,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=145713.33333333334, ans=0.125 2024-06-20 06:04:12,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=145713.33333333334, ans=0.125 2024-06-20 06:04:21,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=145750.0, ans=0.125 2024-06-20 06:04:25,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=145750.0, ans=0.125 2024-06-20 06:04:25,810 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.098e+02 2.285e+02 2.659e+02 3.625e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-20 06:04:30,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.32 vs. limit=15.0 2024-06-20 06:04:32,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=145768.33333333334, ans=0.0 2024-06-20 06:04:32,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=145768.33333333334, ans=0.0 2024-06-20 06:04:35,707 INFO [train.py:1028] (0/2) Epoch 8, batch 8700, loss[loss=0.2884, simple_loss=0.3359, pruned_loss=0.1204, over 13232.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3246, pruned_loss=0.1235, over 2573984.12 frames. ], batch size: 59, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:04:37,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=145786.66666666666, ans=0.025 2024-06-20 06:04:44,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=145805.0, ans=0.125 2024-06-20 06:04:52,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=145823.33333333334, ans=0.0 2024-06-20 06:05:06,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=145860.0, ans=0.0 2024-06-20 06:05:12,001 INFO [train.py:1028] (0/2) Epoch 8, batch 8750, loss[loss=0.2861, simple_loss=0.3166, pruned_loss=0.1278, over 13074.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3243, pruned_loss=0.1234, over 2569704.66 frames. ], batch size: 121, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:05:21,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145896.66666666666, ans=0.1 2024-06-20 06:05:27,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=145915.0, ans=0.125 2024-06-20 06:05:35,342 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 1.980e+02 2.148e+02 2.405e+02 3.825e+02, threshold=4.296e+02, percent-clipped=0.0 2024-06-20 06:05:42,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=145951.66666666666, ans=0.125 2024-06-20 06:05:43,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=145951.66666666666, ans=0.2 2024-06-20 06:05:46,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.92 vs. limit=22.5 2024-06-20 06:05:46,353 INFO [train.py:1028] (0/2) Epoch 8, batch 8800, loss[loss=0.2762, simple_loss=0.3155, pruned_loss=0.1184, over 13223.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3246, pruned_loss=0.1235, over 2575567.91 frames. ], batch size: 72, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:05:46,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=145970.0, ans=0.0 2024-06-20 06:05:49,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145970.0, ans=0.1 2024-06-20 06:05:49,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.14 vs. limit=15.0 2024-06-20 06:05:53,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=145988.33333333334, ans=0.2 2024-06-20 06:05:53,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.74 vs. limit=22.5 2024-06-20 06:05:55,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145988.33333333334, ans=0.1 2024-06-20 06:06:09,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.69 vs. limit=12.0 2024-06-20 06:06:15,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2024-06-20 06:06:19,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=146043.33333333334, ans=0.125 2024-06-20 06:06:24,086 INFO [train.py:1028] (0/2) Epoch 8, batch 8850, loss[loss=0.3262, simple_loss=0.358, pruned_loss=0.1472, over 12490.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3253, pruned_loss=0.1242, over 2564673.45 frames. ], batch size: 202, lr: 7.17e-03, grad_scale: 64.0 2024-06-20 06:06:37,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=146098.33333333334, ans=0.0 2024-06-20 06:06:38,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=146098.33333333334, ans=0.125 2024-06-20 06:06:39,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2024-06-20 06:06:47,436 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.971e+02 2.144e+02 2.478e+02 3.434e+02, threshold=4.288e+02, percent-clipped=0.0 2024-06-20 06:06:49,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=146116.66666666666, ans=0.125 2024-06-20 06:06:52,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=146135.0, ans=10.0 2024-06-20 06:06:54,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.98 vs. limit=15.0 2024-06-20 06:06:57,478 INFO [train.py:1028] (0/2) Epoch 8, batch 8900, loss[loss=0.2976, simple_loss=0.335, pruned_loss=0.1301, over 12867.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3259, pruned_loss=0.1248, over 2562878.89 frames. ], batch size: 33, lr: 7.16e-03, grad_scale: 64.0 2024-06-20 06:06:58,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.66 vs. limit=22.5 2024-06-20 06:07:01,563 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.756e+01 2024-06-20 06:07:03,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=146171.66666666666, ans=0.125 2024-06-20 06:07:04,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=146171.66666666666, ans=0.125 2024-06-20 06:07:13,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.95 vs. limit=10.0 2024-06-20 06:07:36,446 INFO [train.py:1028] (0/2) Epoch 8, batch 8950, loss[loss=0.3039, simple_loss=0.3362, pruned_loss=0.1358, over 12620.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3254, pruned_loss=0.124, over 2563052.54 frames. ], batch size: 202, lr: 7.16e-03, grad_scale: 64.0 2024-06-20 06:07:46,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.73 vs. limit=22.5 2024-06-20 06:07:48,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=146263.33333333334, ans=0.2 2024-06-20 06:07:51,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.91 vs. limit=15.0 2024-06-20 06:07:58,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=146300.0, ans=0.0 2024-06-20 06:07:59,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=146300.0, ans=15.0 2024-06-20 06:08:00,095 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.017e+02 2.283e+02 2.688e+02 4.509e+02, threshold=4.567e+02, percent-clipped=1.0 2024-06-20 06:08:11,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=146318.33333333334, ans=0.0 2024-06-20 06:08:13,451 INFO [train.py:1028] (0/2) Epoch 8, batch 9000, loss[loss=0.2817, simple_loss=0.3335, pruned_loss=0.115, over 13292.00 frames. ], tot_loss[loss=0.287, simple_loss=0.326, pruned_loss=0.124, over 2568779.08 frames. ], batch size: 46, lr: 7.16e-03, grad_scale: 128.0 2024-06-20 06:08:13,452 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 06:08:21,243 INFO [train.py:1060] (0/2) Epoch 8, validation: loss=0.2055, simple_loss=0.2684, pruned_loss=0.07135, over 351949.00 frames. 2024-06-20 06:08:21,243 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 06:08:23,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=146336.66666666666, ans=0.0 2024-06-20 06:08:24,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=146336.66666666666, ans=0.0 2024-06-20 06:08:24,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=146336.66666666666, ans=0.025 2024-06-20 06:08:30,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.01 vs. limit=22.5 2024-06-20 06:08:36,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.73 vs. limit=22.5 2024-06-20 06:08:45,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=146391.66666666666, ans=0.125 2024-06-20 06:08:49,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=146410.0, ans=0.125 2024-06-20 06:08:53,446 INFO [train.py:1028] (0/2) Epoch 8, batch 9050, loss[loss=0.2303, simple_loss=0.2728, pruned_loss=0.09397, over 11556.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3253, pruned_loss=0.1235, over 2567692.20 frames. ], batch size: 17, lr: 7.16e-03, grad_scale: 128.0 2024-06-20 06:09:07,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=146465.0, ans=0.125 2024-06-20 06:09:08,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=146465.0, ans=0.0 2024-06-20 06:09:10,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=15.0 2024-06-20 06:09:11,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=146465.0, ans=0.2 2024-06-20 06:09:13,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=146483.33333333334, ans=0.0 2024-06-20 06:09:13,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146483.33333333334, ans=0.1 2024-06-20 06:09:14,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.34 vs. limit=15.0 2024-06-20 06:09:16,524 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.663e+02 1.982e+02 2.156e+02 2.383e+02 3.049e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-20 06:09:26,581 INFO [train.py:1028] (0/2) Epoch 8, batch 9100, loss[loss=0.2693, simple_loss=0.3191, pruned_loss=0.1098, over 13240.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3246, pruned_loss=0.1226, over 2569273.47 frames. ], batch size: 72, lr: 7.15e-03, grad_scale: 128.0 2024-06-20 06:09:34,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.11 vs. limit=15.0 2024-06-20 06:09:40,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2024-06-20 06:09:41,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=146556.66666666666, ans=0.125 2024-06-20 06:09:45,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=146575.0, ans=0.0 2024-06-20 06:09:58,271 INFO [train.py:1028] (0/2) Epoch 8, batch 9150, loss[loss=0.257, simple_loss=0.3101, pruned_loss=0.102, over 13162.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3242, pruned_loss=0.1224, over 2570409.36 frames. ], batch size: 77, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:10:14,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146648.33333333334, ans=0.125 2024-06-20 06:10:15,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=146648.33333333334, ans=0.125 2024-06-20 06:10:15,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=146648.33333333334, ans=0.125 2024-06-20 06:10:20,287 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-80000.pt 2024-06-20 06:10:28,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=146666.66666666666, ans=0.125 2024-06-20 06:10:29,866 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.894e+02 2.037e+02 2.253e+02 2.783e+02, threshold=4.075e+02, percent-clipped=0.0 2024-06-20 06:10:30,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=146666.66666666666, ans=10.0 2024-06-20 06:10:38,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146703.33333333334, ans=0.1 2024-06-20 06:10:38,783 INFO [train.py:1028] (0/2) Epoch 8, batch 9200, loss[loss=0.2849, simple_loss=0.3334, pruned_loss=0.1181, over 12926.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3239, pruned_loss=0.1219, over 2574292.23 frames. ], batch size: 36, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:10:55,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-20 06:11:10,283 INFO [train.py:1028] (0/2) Epoch 8, batch 9250, loss[loss=0.2679, simple_loss=0.3098, pruned_loss=0.113, over 13212.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.324, pruned_loss=0.1221, over 2575850.26 frames. ], batch size: 67, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:11:10,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=146795.0, ans=0.0 2024-06-20 06:11:27,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=146831.66666666666, ans=0.0 2024-06-20 06:11:32,510 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.923e+02 2.143e+02 2.473e+02 3.321e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 06:11:40,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=146886.66666666666, ans=0.07 2024-06-20 06:11:41,440 INFO [train.py:1028] (0/2) Epoch 8, batch 9300, loss[loss=0.2457, simple_loss=0.3001, pruned_loss=0.09565, over 12885.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3244, pruned_loss=0.1226, over 2572598.75 frames. ], batch size: 39, lr: 7.15e-03, grad_scale: 64.0 2024-06-20 06:11:41,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=146886.66666666666, ans=0.125 2024-06-20 06:11:44,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=146886.66666666666, ans=0.125 2024-06-20 06:11:48,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=146905.0, ans=0.125 2024-06-20 06:11:58,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146923.33333333334, ans=0.1 2024-06-20 06:12:01,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=146923.33333333334, ans=0.125 2024-06-20 06:12:14,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=146978.33333333334, ans=0.0 2024-06-20 06:12:15,361 INFO [train.py:1028] (0/2) Epoch 8, batch 9350, loss[loss=0.3025, simple_loss=0.3405, pruned_loss=0.1323, over 12727.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3252, pruned_loss=0.1226, over 2569060.33 frames. ], batch size: 22, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:12:18,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=146978.33333333334, ans=0.025 2024-06-20 06:12:20,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=146978.33333333334, ans=0.125 2024-06-20 06:12:23,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=146996.66666666666, ans=0.125 2024-06-20 06:12:23,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2024-06-20 06:12:31,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2024-06-20 06:12:37,909 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 1.948e+02 2.046e+02 2.282e+02 3.279e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 06:12:43,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=147051.66666666666, ans=0.2 2024-06-20 06:12:46,386 INFO [train.py:1028] (0/2) Epoch 8, batch 9400, loss[loss=0.2914, simple_loss=0.3347, pruned_loss=0.124, over 13250.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3261, pruned_loss=0.1233, over 2568278.14 frames. ], batch size: 52, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:12:49,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2024-06-20 06:12:50,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=147070.0, ans=0.125 2024-06-20 06:12:58,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147106.66666666666, ans=0.0 2024-06-20 06:13:13,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=147143.33333333334, ans=0.0 2024-06-20 06:13:16,734 INFO [train.py:1028] (0/2) Epoch 8, batch 9450, loss[loss=0.2946, simple_loss=0.3336, pruned_loss=0.1278, over 12689.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3272, pruned_loss=0.1241, over 2567654.00 frames. ], batch size: 22, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:13:17,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147161.66666666666, ans=0.0 2024-06-20 06:13:21,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=147161.66666666666, ans=0.025 2024-06-20 06:13:28,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=147198.33333333334, ans=0.015 2024-06-20 06:13:33,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=147198.33333333334, ans=0.05 2024-06-20 06:13:41,518 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.961e+02 2.150e+02 2.399e+02 3.185e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-20 06:13:41,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=147216.66666666666, ans=0.025 2024-06-20 06:13:47,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=147235.0, ans=0.05 2024-06-20 06:13:48,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147235.0, ans=0.125 2024-06-20 06:13:50,334 INFO [train.py:1028] (0/2) Epoch 8, batch 9500, loss[loss=0.2775, simple_loss=0.3234, pruned_loss=0.1158, over 13206.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3262, pruned_loss=0.1233, over 2577341.30 frames. ], batch size: 43, lr: 7.14e-03, grad_scale: 64.0 2024-06-20 06:14:10,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147308.33333333334, ans=0.125 2024-06-20 06:14:10,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=147308.33333333334, ans=0.025 2024-06-20 06:14:12,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=147308.33333333334, ans=0.0 2024-06-20 06:14:18,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=147326.66666666666, ans=0.5 2024-06-20 06:14:21,414 INFO [train.py:1028] (0/2) Epoch 8, batch 9550, loss[loss=0.2498, simple_loss=0.2957, pruned_loss=0.102, over 12923.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3256, pruned_loss=0.1231, over 2573687.85 frames. ], batch size: 39, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:14:25,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.46 vs. limit=15.0 2024-06-20 06:14:37,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=147381.66666666666, ans=0.2 2024-06-20 06:14:40,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147400.0, ans=0.125 2024-06-20 06:14:41,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.90 vs. limit=10.0 2024-06-20 06:14:44,022 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.945e+02 2.109e+02 2.331e+02 3.710e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-20 06:14:44,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147400.0, ans=0.125 2024-06-20 06:14:47,325 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:14:53,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147418.33333333334, ans=0.0 2024-06-20 06:14:54,865 INFO [train.py:1028] (0/2) Epoch 8, batch 9600, loss[loss=0.285, simple_loss=0.3136, pruned_loss=0.1282, over 10627.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3251, pruned_loss=0.1228, over 2572710.39 frames. ], batch size: 303, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:14:57,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147436.66666666666, ans=0.125 2024-06-20 06:15:03,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=147455.0, ans=0.125 2024-06-20 06:15:05,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=147455.0, ans=0.125 2024-06-20 06:15:10,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=147473.33333333334, ans=0.0 2024-06-20 06:15:10,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=147473.33333333334, ans=0.0 2024-06-20 06:15:11,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=147473.33333333334, ans=0.125 2024-06-20 06:15:15,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=22.5 2024-06-20 06:15:26,039 INFO [train.py:1028] (0/2) Epoch 8, batch 9650, loss[loss=0.2859, simple_loss=0.3159, pruned_loss=0.128, over 13106.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3255, pruned_loss=0.1235, over 2561751.67 frames. ], batch size: 132, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:15:29,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=147528.33333333334, ans=0.125 2024-06-20 06:15:30,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147528.33333333334, ans=0.1 2024-06-20 06:15:33,407 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=12.0 2024-06-20 06:15:37,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.12 vs. limit=15.0 2024-06-20 06:15:41,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147565.0, ans=0.1 2024-06-20 06:15:44,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=147583.33333333334, ans=0.2 2024-06-20 06:15:48,375 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.918e+02 2.087e+02 2.275e+02 3.787e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-20 06:15:54,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147601.66666666666, ans=0.0 2024-06-20 06:15:59,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.93 vs. limit=10.0 2024-06-20 06:15:59,886 INFO [train.py:1028] (0/2) Epoch 8, batch 9700, loss[loss=0.3123, simple_loss=0.3346, pruned_loss=0.145, over 13044.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3256, pruned_loss=0.124, over 2557096.98 frames. ], batch size: 144, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:16:18,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147675.0, ans=0.125 2024-06-20 06:16:21,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2024-06-20 06:16:30,494 INFO [train.py:1028] (0/2) Epoch 8, batch 9750, loss[loss=0.2773, simple_loss=0.3144, pruned_loss=0.1201, over 13090.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3249, pruned_loss=0.1238, over 2553340.02 frames. ], batch size: 132, lr: 7.13e-03, grad_scale: 64.0 2024-06-20 06:16:41,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=147730.0, ans=0.0 2024-06-20 06:16:41,444 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.12 vs. limit=22.5 2024-06-20 06:16:43,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147748.33333333334, ans=0.125 2024-06-20 06:16:46,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147748.33333333334, ans=0.1 2024-06-20 06:16:54,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.011e+02 2.168e+02 2.550e+02 3.783e+02, threshold=4.336e+02, percent-clipped=0.0 2024-06-20 06:16:57,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 06:17:02,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=147803.33333333334, ans=0.0 2024-06-20 06:17:02,870 INFO [train.py:1028] (0/2) Epoch 8, batch 9800, loss[loss=0.2499, simple_loss=0.2956, pruned_loss=0.1021, over 12901.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3241, pruned_loss=0.1232, over 2545666.34 frames. ], batch size: 39, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:17:09,655 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:17:13,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147821.66666666666, ans=0.1 2024-06-20 06:17:22,462 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:17:22,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=147858.33333333334, ans=15.0 2024-06-20 06:17:29,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=147876.66666666666, ans=0.125 2024-06-20 06:17:33,693 INFO [train.py:1028] (0/2) Epoch 8, batch 9850, loss[loss=0.3044, simple_loss=0.3388, pruned_loss=0.1351, over 13065.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3225, pruned_loss=0.1223, over 2538474.24 frames. ], batch size: 102, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:17:45,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=147913.33333333334, ans=0.0 2024-06-20 06:17:51,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=147931.66666666666, ans=0.125 2024-06-20 06:17:57,081 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.905e+02 2.078e+02 2.393e+02 3.804e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 06:17:58,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.99 vs. limit=10.0 2024-06-20 06:18:01,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=147968.33333333334, ans=10.0 2024-06-20 06:18:03,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=147968.33333333334, ans=0.125 2024-06-20 06:18:05,418 INFO [train.py:1028] (0/2) Epoch 8, batch 9900, loss[loss=0.2441, simple_loss=0.2946, pruned_loss=0.09677, over 12921.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3218, pruned_loss=0.1223, over 2529882.22 frames. ], batch size: 39, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:18:05,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.99 vs. limit=10.0 2024-06-20 06:18:08,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=15.0 2024-06-20 06:18:11,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 06:18:15,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=148005.0, ans=0.025 2024-06-20 06:18:23,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.82 vs. limit=12.0 2024-06-20 06:18:26,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148041.66666666666, ans=0.1 2024-06-20 06:18:31,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=148060.0, ans=0.1 2024-06-20 06:18:32,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.51 vs. limit=15.0 2024-06-20 06:18:37,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.58 vs. limit=15.0 2024-06-20 06:18:37,924 INFO [train.py:1028] (0/2) Epoch 8, batch 9950, loss[loss=0.2915, simple_loss=0.3231, pruned_loss=0.1299, over 12691.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3205, pruned_loss=0.1221, over 2525650.68 frames. ], batch size: 29, lr: 7.12e-03, grad_scale: 64.0 2024-06-20 06:18:39,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=148078.33333333334, ans=0.0 2024-06-20 06:18:46,030 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:18:49,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=18.95 vs. limit=15.0 2024-06-20 06:18:55,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=148115.0, ans=0.0 2024-06-20 06:18:55,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=148115.0, ans=0.125 2024-06-20 06:18:55,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2024-06-20 06:18:55,778 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.55 vs. limit=22.5 2024-06-20 06:18:57,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=148133.33333333334, ans=0.0 2024-06-20 06:19:01,372 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.917e+02 2.109e+02 2.316e+02 3.279e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-20 06:19:10,923 INFO [train.py:1028] (0/2) Epoch 8, batch 10000, loss[loss=0.2815, simple_loss=0.3291, pruned_loss=0.1169, over 12719.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3213, pruned_loss=0.1231, over 2488187.38 frames. ], batch size: 22, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:19:13,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148170.0, ans=0.125 2024-06-20 06:19:37,233 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2024-06-20 06:19:38,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=148243.33333333334, ans=0.125 2024-06-20 06:19:41,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148243.33333333334, ans=0.1 2024-06-20 06:19:43,428 INFO [train.py:1028] (0/2) Epoch 8, batch 10050, loss[loss=0.2961, simple_loss=0.3367, pruned_loss=0.1278, over 12661.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3211, pruned_loss=0.1241, over 2447999.32 frames. ], batch size: 22, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:19:53,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=148280.0, ans=0.125 2024-06-20 06:20:00,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=148316.66666666666, ans=0.125 2024-06-20 06:20:04,542 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 2.090e+02 2.344e+02 2.693e+02 3.716e+02, threshold=4.688e+02, percent-clipped=0.0 2024-06-20 06:20:13,390 INFO [train.py:1028] (0/2) Epoch 8, batch 10100, loss[loss=0.2478, simple_loss=0.2922, pruned_loss=0.1018, over 11777.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3193, pruned_loss=0.1224, over 2427330.37 frames. ], batch size: 17, lr: 7.11e-03, grad_scale: 64.0 2024-06-20 06:20:13,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=148353.33333333334, ans=0.0 2024-06-20 06:20:16,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=148353.33333333334, ans=0.125 2024-06-20 06:20:27,152 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-8.pt 2024-06-20 06:22:26,788 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:22:27,266 INFO [train.py:1028] (0/2) Epoch 9, batch 0, loss[loss=0.2557, simple_loss=0.3045, pruned_loss=0.1035, over 12944.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3045, pruned_loss=0.1035, over 12944.00 frames. ], batch size: 36, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:22:27,267 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 06:22:34,312 INFO [train.py:1060] (0/2) Epoch 9, validation: loss=0.2068, simple_loss=0.27, pruned_loss=0.07179, over 351949.00 frames. 2024-06-20 06:22:34,313 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 06:22:36,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=148386.33333333334, ans=0.2 2024-06-20 06:22:52,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=148423.0, ans=0.025 2024-06-20 06:23:04,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2024-06-20 06:23:04,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=148459.66666666666, ans=0.05 2024-06-20 06:23:07,383 INFO [train.py:1028] (0/2) Epoch 9, batch 50, loss[loss=0.2531, simple_loss=0.2965, pruned_loss=0.1049, over 12731.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.2999, pruned_loss=0.1135, over 574763.65 frames. ], batch size: 29, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:23:10,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=148478.0, ans=0.0 2024-06-20 06:23:18,708 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.820e+02 1.981e+02 2.167e+02 3.074e+02, threshold=3.962e+02, percent-clipped=0.0 2024-06-20 06:23:20,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=148514.66666666666, ans=0.04949747468305833 2024-06-20 06:23:21,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2024-06-20 06:23:23,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=148514.66666666666, ans=0.0 2024-06-20 06:23:25,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=148533.0, ans=15.0 2024-06-20 06:23:31,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=15.0 2024-06-20 06:23:33,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=148551.33333333334, ans=0.125 2024-06-20 06:23:41,288 INFO [train.py:1028] (0/2) Epoch 9, batch 100, loss[loss=0.2605, simple_loss=0.3121, pruned_loss=0.1045, over 13286.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.2991, pruned_loss=0.1122, over 1017418.75 frames. ], batch size: 46, lr: 6.73e-03, grad_scale: 64.0 2024-06-20 06:23:42,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.76 vs. limit=22.5 2024-06-20 06:23:43,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=148569.66666666666, ans=0.125 2024-06-20 06:23:46,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=15.0 2024-06-20 06:23:53,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=148588.0, ans=0.125 2024-06-20 06:24:00,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148606.33333333334, ans=0.1 2024-06-20 06:24:01,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=148606.33333333334, ans=0.0 2024-06-20 06:24:02,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.55 vs. limit=15.0 2024-06-20 06:24:07,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=148624.66666666666, ans=0.125 2024-06-20 06:24:15,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.66 vs. limit=15.0 2024-06-20 06:24:16,681 INFO [train.py:1028] (0/2) Epoch 9, batch 150, loss[loss=0.2617, simple_loss=0.2954, pruned_loss=0.114, over 12539.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.2977, pruned_loss=0.1104, over 1365655.15 frames. ], batch size: 29, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:24:27,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=148679.66666666666, ans=0.95 2024-06-20 06:24:28,107 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.804e+02 1.922e+02 2.111e+02 3.094e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 06:24:48,752 INFO [train.py:1028] (0/2) Epoch 9, batch 200, loss[loss=0.2743, simple_loss=0.3043, pruned_loss=0.1221, over 12566.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.2971, pruned_loss=0.1101, over 1635102.36 frames. ], batch size: 202, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:25:00,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=148771.33333333334, ans=0.0 2024-06-20 06:25:00,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=148771.33333333334, ans=0.125 2024-06-20 06:25:18,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=148826.33333333334, ans=0.125 2024-06-20 06:25:19,772 INFO [train.py:1028] (0/2) Epoch 9, batch 250, loss[loss=0.2611, simple_loss=0.2849, pruned_loss=0.1186, over 13000.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.2974, pruned_loss=0.11, over 1846350.21 frames. ], batch size: 144, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:25:22,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=148844.66666666666, ans=0.125 2024-06-20 06:25:31,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.812e+02 1.972e+02 2.224e+02 4.192e+02, threshold=3.945e+02, percent-clipped=1.0 2024-06-20 06:25:42,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=148899.66666666666, ans=0.125 2024-06-20 06:25:47,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=148899.66666666666, ans=0.125 2024-06-20 06:25:48,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2024-06-20 06:25:52,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=148918.0, ans=0.0 2024-06-20 06:25:58,236 INFO [train.py:1028] (0/2) Epoch 9, batch 300, loss[loss=0.2691, simple_loss=0.3009, pruned_loss=0.1186, over 13149.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.2976, pruned_loss=0.1101, over 2008974.83 frames. ], batch size: 112, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:26:03,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=148936.33333333334, ans=0.2 2024-06-20 06:26:06,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148954.66666666666, ans=0.125 2024-06-20 06:26:08,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=148954.66666666666, ans=0.125 2024-06-20 06:26:12,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.93 vs. limit=15.0 2024-06-20 06:26:15,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.31 vs. limit=15.0 2024-06-20 06:26:19,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=148991.33333333334, ans=0.125 2024-06-20 06:26:19,085 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:26:19,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2024-06-20 06:26:30,741 INFO [train.py:1028] (0/2) Epoch 9, batch 350, loss[loss=0.2323, simple_loss=0.2836, pruned_loss=0.09052, over 12836.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2972, pruned_loss=0.1095, over 2137710.67 frames. ], batch size: 33, lr: 6.72e-03, grad_scale: 64.0 2024-06-20 06:26:31,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=149028.0, ans=0.2 2024-06-20 06:26:35,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=149028.0, ans=0.125 2024-06-20 06:26:35,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.28 vs. limit=10.0 2024-06-20 06:26:36,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.53 vs. limit=15.0 2024-06-20 06:26:42,787 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.873e+02 2.049e+02 2.269e+02 3.318e+02, threshold=4.098e+02, percent-clipped=0.0 2024-06-20 06:26:44,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149064.66666666666, ans=0.1 2024-06-20 06:26:51,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=149083.0, ans=0.0 2024-06-20 06:26:56,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=149101.33333333334, ans=0.0 2024-06-20 06:26:57,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2024-06-20 06:26:57,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=149101.33333333334, ans=0.0 2024-06-20 06:27:03,010 INFO [train.py:1028] (0/2) Epoch 9, batch 400, loss[loss=0.2553, simple_loss=0.2982, pruned_loss=0.1062, over 13242.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2966, pruned_loss=0.1089, over 2240015.59 frames. ], batch size: 63, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:27:05,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=149119.66666666666, ans=0.0 2024-06-20 06:27:10,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=149138.0, ans=0.125 2024-06-20 06:27:23,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=149174.66666666666, ans=0.07 2024-06-20 06:27:34,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=149211.33333333334, ans=0.025 2024-06-20 06:27:34,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=149211.33333333334, ans=22.5 2024-06-20 06:27:34,681 INFO [train.py:1028] (0/2) Epoch 9, batch 450, loss[loss=0.2442, simple_loss=0.2958, pruned_loss=0.09629, over 13224.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.2975, pruned_loss=0.1095, over 2314665.27 frames. ], batch size: 67, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:27:34,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=149211.33333333334, ans=10.0 2024-06-20 06:27:49,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.911e+02 2.029e+02 2.241e+02 3.321e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 06:28:00,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=149266.33333333334, ans=0.125 2024-06-20 06:28:08,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=149284.66666666666, ans=0.125 2024-06-20 06:28:08,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=149284.66666666666, ans=0.125 2024-06-20 06:28:09,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.12 vs. limit=12.0 2024-06-20 06:28:13,159 INFO [train.py:1028] (0/2) Epoch 9, batch 500, loss[loss=0.2506, simple_loss=0.2853, pruned_loss=0.1079, over 13140.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2976, pruned_loss=0.1093, over 2377145.81 frames. ], batch size: 121, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:28:18,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=149303.0, ans=0.2 2024-06-20 06:28:38,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2024-06-20 06:28:41,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.91 vs. limit=22.5 2024-06-20 06:28:45,076 INFO [train.py:1028] (0/2) Epoch 9, batch 550, loss[loss=0.2498, simple_loss=0.2884, pruned_loss=0.1056, over 12940.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.2971, pruned_loss=0.1089, over 2422784.52 frames. ], batch size: 158, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:28:56,387 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.823e+02 1.969e+02 2.175e+02 2.892e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 06:29:03,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=149449.66666666666, ans=0.0 2024-06-20 06:29:07,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=149449.66666666666, ans=0.0 2024-06-20 06:29:16,434 INFO [train.py:1028] (0/2) Epoch 9, batch 600, loss[loss=0.2395, simple_loss=0.2728, pruned_loss=0.103, over 13023.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.297, pruned_loss=0.1088, over 2459231.11 frames. ], batch size: 144, lr: 6.71e-03, grad_scale: 64.0 2024-06-20 06:29:18,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.01 vs. limit=15.0 2024-06-20 06:29:32,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=149523.0, ans=0.2 2024-06-20 06:29:46,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=149559.66666666666, ans=0.0 2024-06-20 06:29:48,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149559.66666666666, ans=0.1 2024-06-20 06:29:51,641 INFO [train.py:1028] (0/2) Epoch 9, batch 650, loss[loss=0.2573, simple_loss=0.3017, pruned_loss=0.1064, over 13186.00 frames. ], tot_loss[loss=0.257, simple_loss=0.2969, pruned_loss=0.1085, over 2490545.19 frames. ], batch size: 59, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:29:52,082 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.57 vs. limit=22.5 2024-06-20 06:30:06,487 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.877e+02 1.985e+02 2.088e+02 2.810e+02, threshold=3.971e+02, percent-clipped=0.0 2024-06-20 06:30:12,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=149614.66666666666, ans=0.125 2024-06-20 06:30:21,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149651.33333333334, ans=0.1 2024-06-20 06:30:27,726 INFO [train.py:1028] (0/2) Epoch 9, batch 700, loss[loss=0.2236, simple_loss=0.2694, pruned_loss=0.08891, over 13360.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.2969, pruned_loss=0.1087, over 2512388.91 frames. ], batch size: 46, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:30:31,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=149669.66666666666, ans=0.125 2024-06-20 06:30:36,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=12.0 2024-06-20 06:30:39,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=149706.33333333334, ans=0.125 2024-06-20 06:30:51,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.93 vs. limit=22.5 2024-06-20 06:30:52,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=149724.66666666666, ans=0.025 2024-06-20 06:30:58,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=149743.0, ans=0.125 2024-06-20 06:30:59,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=149761.33333333334, ans=0.025 2024-06-20 06:30:59,927 INFO [train.py:1028] (0/2) Epoch 9, batch 750, loss[loss=0.25, simple_loss=0.2953, pruned_loss=0.1024, over 13273.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.2968, pruned_loss=0.1087, over 2528990.96 frames. ], batch size: 63, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:31:01,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149761.33333333334, ans=0.0 2024-06-20 06:31:04,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=149761.33333333334, ans=0.2 2024-06-20 06:31:07,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=149779.66666666666, ans=0.125 2024-06-20 06:31:11,607 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.806e+02 1.953e+02 2.153e+02 3.078e+02, threshold=3.906e+02, percent-clipped=0.0 2024-06-20 06:31:18,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=149798.0, ans=0.025 2024-06-20 06:31:24,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149816.33333333334, ans=0.125 2024-06-20 06:31:28,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=15.0 2024-06-20 06:31:28,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=149834.66666666666, ans=0.125 2024-06-20 06:31:31,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=149853.0, ans=0.125 2024-06-20 06:31:32,441 INFO [train.py:1028] (0/2) Epoch 9, batch 800, loss[loss=0.2418, simple_loss=0.2838, pruned_loss=0.09987, over 12981.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2965, pruned_loss=0.1085, over 2541331.92 frames. ], batch size: 36, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:31:39,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149871.33333333334, ans=0.125 2024-06-20 06:31:40,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2024-06-20 06:32:01,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-20 06:32:10,891 INFO [train.py:1028] (0/2) Epoch 9, batch 850, loss[loss=0.2389, simple_loss=0.2768, pruned_loss=0.1005, over 13108.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2966, pruned_loss=0.1084, over 2551410.10 frames. ], batch size: 95, lr: 6.70e-03, grad_scale: 64.0 2024-06-20 06:32:11,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=149944.66666666666, ans=0.0 2024-06-20 06:32:11,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=149944.66666666666, ans=0.125 2024-06-20 06:32:15,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2024-06-20 06:32:16,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=149963.0, ans=0.125 2024-06-20 06:32:19,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=149963.0, ans=0.125 2024-06-20 06:32:22,344 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.805e+02 1.965e+02 2.176e+02 3.046e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 06:32:31,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149999.66666666666, ans=0.1 2024-06-20 06:32:42,840 INFO [train.py:1028] (0/2) Epoch 9, batch 900, loss[loss=0.2738, simple_loss=0.3077, pruned_loss=0.1199, over 12952.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.296, pruned_loss=0.1083, over 2556790.96 frames. ], batch size: 36, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:32:48,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=150054.66666666666, ans=0.0 2024-06-20 06:33:02,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=150091.33333333334, ans=0.2 2024-06-20 06:33:08,247 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.235e+01 2024-06-20 06:33:12,918 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-20 06:33:14,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=150128.0, ans=0.0 2024-06-20 06:33:14,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=150128.0, ans=0.0 2024-06-20 06:33:15,224 INFO [train.py:1028] (0/2) Epoch 9, batch 950, loss[loss=0.2605, simple_loss=0.3038, pruned_loss=0.1086, over 12957.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.2969, pruned_loss=0.109, over 2560230.55 frames. ], batch size: 39, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:33:15,359 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:33:16,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=150128.0, ans=0.2 2024-06-20 06:33:23,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=150146.33333333334, ans=0.125 2024-06-20 06:33:26,409 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.854e+02 2.047e+02 2.609e+02 4.175e+02, threshold=4.093e+02, percent-clipped=3.0 2024-06-20 06:33:29,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.06 vs. limit=15.0 2024-06-20 06:33:32,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150164.66666666666, ans=0.125 2024-06-20 06:33:46,553 INFO [train.py:1028] (0/2) Epoch 9, batch 1000, loss[loss=0.2606, simple_loss=0.3049, pruned_loss=0.1082, over 13352.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2962, pruned_loss=0.1086, over 2563021.95 frames. ], batch size: 49, lr: 6.69e-03, grad_scale: 64.0 2024-06-20 06:33:58,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-20 06:34:11,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150274.66666666666, ans=0.125 2024-06-20 06:34:13,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=150274.66666666666, ans=0.0 2024-06-20 06:34:24,558 INFO [train.py:1028] (0/2) Epoch 9, batch 1050, loss[loss=0.2617, simple_loss=0.3021, pruned_loss=0.1106, over 13158.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.2972, pruned_loss=0.1091, over 2566694.75 frames. ], batch size: 77, lr: 6.69e-03, grad_scale: 128.0 2024-06-20 06:34:26,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=150311.33333333334, ans=0.125 2024-06-20 06:34:29,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=150311.33333333334, ans=0.125 2024-06-20 06:34:36,131 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.819e+02 1.994e+02 2.168e+02 2.832e+02, threshold=3.989e+02, percent-clipped=0.0 2024-06-20 06:34:37,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=150348.0, ans=0.0 2024-06-20 06:34:44,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=150366.33333333334, ans=0.125 2024-06-20 06:34:51,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150384.66666666666, ans=0.1 2024-06-20 06:34:51,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=150384.66666666666, ans=0.0 2024-06-20 06:34:57,185 INFO [train.py:1028] (0/2) Epoch 9, batch 1100, loss[loss=0.2363, simple_loss=0.278, pruned_loss=0.09728, over 13239.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.2973, pruned_loss=0.1088, over 2570482.36 frames. ], batch size: 52, lr: 6.69e-03, grad_scale: 128.0 2024-06-20 06:34:57,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2024-06-20 06:35:05,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=150421.33333333334, ans=0.0 2024-06-20 06:35:16,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=150458.0, ans=0.125 2024-06-20 06:35:18,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=150458.0, ans=0.125 2024-06-20 06:35:22,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=150476.33333333334, ans=0.125 2024-06-20 06:35:29,681 INFO [train.py:1028] (0/2) Epoch 9, batch 1150, loss[loss=0.2479, simple_loss=0.2942, pruned_loss=0.1008, over 13277.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.2977, pruned_loss=0.1093, over 2570774.12 frames. ], batch size: 52, lr: 6.68e-03, grad_scale: 128.0 2024-06-20 06:35:29,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=150494.66666666666, ans=0.1 2024-06-20 06:35:29,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=150494.66666666666, ans=0.125 2024-06-20 06:35:34,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=150494.66666666666, ans=0.125 2024-06-20 06:35:38,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=150513.0, ans=0.2 2024-06-20 06:35:40,582 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.821e+02 1.969e+02 2.222e+02 2.999e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 06:35:51,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=150531.33333333334, ans=15.0 2024-06-20 06:35:59,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=150549.66666666666, ans=0.125 2024-06-20 06:36:02,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2024-06-20 06:36:04,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=150568.0, ans=0.0 2024-06-20 06:36:07,617 INFO [train.py:1028] (0/2) Epoch 9, batch 1200, loss[loss=0.2509, simple_loss=0.294, pruned_loss=0.1039, over 13144.00 frames. ], tot_loss[loss=0.258, simple_loss=0.2973, pruned_loss=0.1093, over 2572795.30 frames. ], batch size: 77, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:36:08,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=150586.33333333334, ans=0.125 2024-06-20 06:36:20,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150623.0, ans=0.1 2024-06-20 06:36:27,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=150641.33333333334, ans=0.125 2024-06-20 06:36:28,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=150641.33333333334, ans=0.125 2024-06-20 06:36:35,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=150659.66666666666, ans=0.07 2024-06-20 06:36:39,355 INFO [train.py:1028] (0/2) Epoch 9, batch 1250, loss[loss=0.2373, simple_loss=0.2778, pruned_loss=0.09839, over 13201.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2966, pruned_loss=0.1089, over 2583098.03 frames. ], batch size: 112, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:36:50,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=150696.33333333334, ans=22.5 2024-06-20 06:36:51,519 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.857e+02 1.957e+02 2.192e+02 3.105e+02, threshold=3.915e+02, percent-clipped=0.0 2024-06-20 06:36:52,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.83 vs. limit=10.0 2024-06-20 06:37:01,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=12.0 2024-06-20 06:37:06,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=150751.33333333334, ans=0.125 2024-06-20 06:37:08,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150751.33333333334, ans=0.1 2024-06-20 06:37:11,578 INFO [train.py:1028] (0/2) Epoch 9, batch 1300, loss[loss=0.296, simple_loss=0.3253, pruned_loss=0.1333, over 12795.00 frames. ], tot_loss[loss=0.258, simple_loss=0.2973, pruned_loss=0.1093, over 2583426.59 frames. ], batch size: 176, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:37:13,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=150769.66666666666, ans=15.0 2024-06-20 06:37:13,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2024-06-20 06:37:27,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=150806.33333333334, ans=0.125 2024-06-20 06:37:34,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150824.66666666666, ans=0.1 2024-06-20 06:37:42,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=150843.0, ans=0.125 2024-06-20 06:37:44,157 INFO [train.py:1028] (0/2) Epoch 9, batch 1350, loss[loss=0.246, simple_loss=0.2902, pruned_loss=0.1009, over 13225.00 frames. ], tot_loss[loss=0.256, simple_loss=0.2956, pruned_loss=0.1082, over 2586143.75 frames. ], batch size: 59, lr: 6.68e-03, grad_scale: 64.0 2024-06-20 06:37:48,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=150861.33333333334, ans=0.125 2024-06-20 06:37:49,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.74 vs. limit=6.0 2024-06-20 06:37:59,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=150879.66666666666, ans=0.125 2024-06-20 06:38:02,654 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.857e+02 2.030e+02 2.265e+02 2.900e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 06:38:03,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=150898.0, ans=0.0 2024-06-20 06:38:05,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150898.0, ans=0.125 2024-06-20 06:38:08,030 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-20 06:38:08,061 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.08 vs. limit=15.0 2024-06-20 06:38:13,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=150916.33333333334, ans=0.2 2024-06-20 06:38:18,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=150934.66666666666, ans=0.125 2024-06-20 06:38:23,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2024-06-20 06:38:23,567 INFO [train.py:1028] (0/2) Epoch 9, batch 1400, loss[loss=0.2639, simple_loss=0.3002, pruned_loss=0.1139, over 12845.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2962, pruned_loss=0.1088, over 2588008.98 frames. ], batch size: 26, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:38:23,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=150953.0, ans=0.0 2024-06-20 06:38:24,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150953.0, ans=0.1 2024-06-20 06:38:51,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=151026.33333333334, ans=0.5 2024-06-20 06:38:56,338 INFO [train.py:1028] (0/2) Epoch 9, batch 1450, loss[loss=0.2731, simple_loss=0.3005, pruned_loss=0.1228, over 13124.00 frames. ], tot_loss[loss=0.257, simple_loss=0.2961, pruned_loss=0.1089, over 2588141.30 frames. ], batch size: 121, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:39:01,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2024-06-20 06:39:09,221 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.860e+02 1.990e+02 2.165e+02 3.156e+02, threshold=3.980e+02, percent-clipped=0.0 2024-06-20 06:39:12,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=151081.33333333334, ans=0.95 2024-06-20 06:39:12,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.41 vs. limit=15.0 2024-06-20 06:39:17,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2024-06-20 06:39:22,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151118.0, ans=0.125 2024-06-20 06:39:27,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=151118.0, ans=0.125 2024-06-20 06:39:29,075 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=20.90 vs. limit=15.0 2024-06-20 06:39:30,009 INFO [train.py:1028] (0/2) Epoch 9, batch 1500, loss[loss=0.2574, simple_loss=0.2984, pruned_loss=0.1082, over 13269.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.2966, pruned_loss=0.1092, over 2590715.34 frames. ], batch size: 83, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:39:36,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151154.66666666666, ans=0.1 2024-06-20 06:39:49,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151173.0, ans=0.1 2024-06-20 06:40:09,693 INFO [train.py:1028] (0/2) Epoch 9, batch 1550, loss[loss=0.2558, simple_loss=0.2879, pruned_loss=0.1119, over 13069.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.2972, pruned_loss=0.1096, over 2586210.01 frames. ], batch size: 102, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:40:12,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=151228.0, ans=0.125 2024-06-20 06:40:15,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151228.0, ans=0.1 2024-06-20 06:40:15,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.98 vs. limit=10.0 2024-06-20 06:40:22,115 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.829e+02 1.950e+02 2.121e+02 3.084e+02, threshold=3.900e+02, percent-clipped=0.0 2024-06-20 06:40:24,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.24 vs. limit=10.0 2024-06-20 06:40:26,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=151264.66666666666, ans=0.125 2024-06-20 06:40:29,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.49 vs. limit=22.5 2024-06-20 06:40:33,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=151283.0, ans=0.2 2024-06-20 06:40:36,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-20 06:40:42,016 INFO [train.py:1028] (0/2) Epoch 9, batch 1600, loss[loss=0.268, simple_loss=0.3073, pruned_loss=0.1144, over 13144.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.2973, pruned_loss=0.1098, over 2580768.82 frames. ], batch size: 77, lr: 6.67e-03, grad_scale: 64.0 2024-06-20 06:40:52,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=151338.0, ans=0.125 2024-06-20 06:40:54,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2024-06-20 06:40:56,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2024-06-20 06:41:03,646 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:41:07,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.35 vs. limit=15.0 2024-06-20 06:41:11,081 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2024-06-20 06:41:11,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2024-06-20 06:41:13,829 INFO [train.py:1028] (0/2) Epoch 9, batch 1650, loss[loss=0.2647, simple_loss=0.3032, pruned_loss=0.1131, over 13204.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.2976, pruned_loss=0.11, over 2576147.04 frames. ], batch size: 95, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:41:17,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=151411.33333333334, ans=0.2 2024-06-20 06:41:24,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=151429.66666666666, ans=0.025 2024-06-20 06:41:25,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.11 vs. limit=15.0 2024-06-20 06:41:26,066 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 1.848e+02 1.997e+02 2.191e+02 2.862e+02, threshold=3.994e+02, percent-clipped=0.0 2024-06-20 06:41:27,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=151448.0, ans=0.0 2024-06-20 06:41:34,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=151466.33333333334, ans=0.2 2024-06-20 06:41:48,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=151484.66666666666, ans=0.125 2024-06-20 06:41:49,674 INFO [train.py:1028] (0/2) Epoch 9, batch 1700, loss[loss=0.24, simple_loss=0.2884, pruned_loss=0.09576, over 12493.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.2975, pruned_loss=0.1092, over 2581138.84 frames. ], batch size: 25, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:41:49,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=151503.0, ans=0.125 2024-06-20 06:41:51,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=151503.0, ans=0.2 2024-06-20 06:41:52,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=151503.0, ans=0.0 2024-06-20 06:42:06,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151539.66666666666, ans=0.125 2024-06-20 06:42:10,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=22.5 2024-06-20 06:42:16,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151558.0, ans=0.125 2024-06-20 06:42:16,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=151558.0, ans=0.125 2024-06-20 06:42:24,882 INFO [train.py:1028] (0/2) Epoch 9, batch 1750, loss[loss=0.2418, simple_loss=0.2954, pruned_loss=0.09405, over 12373.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.2971, pruned_loss=0.1088, over 2581499.71 frames. ], batch size: 22, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:42:28,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.21 vs. limit=10.0 2024-06-20 06:42:29,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=151594.66666666666, ans=0.025 2024-06-20 06:42:37,185 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.790e+02 1.965e+02 2.187e+02 2.836e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 06:42:45,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=151649.66666666666, ans=0.125 2024-06-20 06:42:53,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=151668.0, ans=0.2 2024-06-20 06:42:54,382 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.87 vs. limit=22.5 2024-06-20 06:42:55,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=151668.0, ans=0.1 2024-06-20 06:42:57,354 INFO [train.py:1028] (0/2) Epoch 9, batch 1800, loss[loss=0.2526, simple_loss=0.2966, pruned_loss=0.1043, over 13192.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.2973, pruned_loss=0.1088, over 2581236.42 frames. ], batch size: 67, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:42:58,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=151686.33333333334, ans=0.1 2024-06-20 06:43:01,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=151686.33333333334, ans=0.0 2024-06-20 06:43:04,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=151704.66666666666, ans=0.0 2024-06-20 06:43:06,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2024-06-20 06:43:11,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=12.0 2024-06-20 06:43:13,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=151723.0, ans=0.125 2024-06-20 06:43:26,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=151759.66666666666, ans=0.0 2024-06-20 06:43:29,827 INFO [train.py:1028] (0/2) Epoch 9, batch 1850, loss[loss=0.2554, simple_loss=0.2934, pruned_loss=0.1086, over 13176.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.2973, pruned_loss=0.1088, over 2582876.82 frames. ], batch size: 83, lr: 6.66e-03, grad_scale: 64.0 2024-06-20 06:43:30,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=151778.0, ans=0.2 2024-06-20 06:43:36,057 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:43:42,631 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.882e+02 2.028e+02 2.295e+02 3.513e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-20 06:43:43,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.65 vs. limit=15.0 2024-06-20 06:43:53,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=151833.0, ans=0.2 2024-06-20 06:44:06,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=151851.33333333334, ans=0.125 2024-06-20 06:44:08,309 INFO [train.py:1028] (0/2) Epoch 9, batch 1900, loss[loss=0.2448, simple_loss=0.2835, pruned_loss=0.1031, over 13190.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.2966, pruned_loss=0.1088, over 2585642.76 frames. ], batch size: 95, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:44:09,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=151869.66666666666, ans=0.125 2024-06-20 06:44:10,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.02 vs. limit=15.0 2024-06-20 06:44:16,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=151888.0, ans=0.07 2024-06-20 06:44:16,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=151888.0, ans=0.125 2024-06-20 06:44:23,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=151906.33333333334, ans=0.0 2024-06-20 06:44:31,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=151924.66666666666, ans=0.125 2024-06-20 06:44:37,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151943.0, ans=0.125 2024-06-20 06:44:40,817 INFO [train.py:1028] (0/2) Epoch 9, batch 1950, loss[loss=0.2437, simple_loss=0.2882, pruned_loss=0.09955, over 13254.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.296, pruned_loss=0.1088, over 2591286.37 frames. ], batch size: 52, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:44:44,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=151961.33333333334, ans=0.0 2024-06-20 06:44:48,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=151979.66666666666, ans=10.0 2024-06-20 06:44:50,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=12.0 2024-06-20 06:44:53,324 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.799e+02 1.909e+02 2.051e+02 3.642e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 06:45:02,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=152016.33333333334, ans=0.125 2024-06-20 06:45:06,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=152034.66666666666, ans=0.025 2024-06-20 06:45:11,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152034.66666666666, ans=0.1 2024-06-20 06:45:11,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=152034.66666666666, ans=0.125 2024-06-20 06:45:13,321 INFO [train.py:1028] (0/2) Epoch 9, batch 2000, loss[loss=0.3045, simple_loss=0.3423, pruned_loss=0.1334, over 12601.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2963, pruned_loss=0.1091, over 2587434.25 frames. ], batch size: 22, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:45:13,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152053.0, ans=0.1 2024-06-20 06:45:14,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=152053.0, ans=0.125 2024-06-20 06:45:34,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-20 06:45:40,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=152108.0, ans=0.0 2024-06-20 06:45:40,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.90 vs. limit=10.0 2024-06-20 06:45:44,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152126.33333333334, ans=0.1 2024-06-20 06:45:48,999 INFO [train.py:1028] (0/2) Epoch 9, batch 2050, loss[loss=0.307, simple_loss=0.3482, pruned_loss=0.1329, over 12601.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.2967, pruned_loss=0.1095, over 2583605.46 frames. ], batch size: 29, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:46:03,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2024-06-20 06:46:05,256 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.831e+02 1.980e+02 2.192e+02 2.951e+02, threshold=3.960e+02, percent-clipped=0.0 2024-06-20 06:46:25,262 INFO [train.py:1028] (0/2) Epoch 9, batch 2100, loss[loss=0.244, simple_loss=0.2901, pruned_loss=0.09893, over 13192.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.2967, pruned_loss=0.1089, over 2586788.47 frames. ], batch size: 59, lr: 6.65e-03, grad_scale: 64.0 2024-06-20 06:46:25,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152236.33333333334, ans=0.1 2024-06-20 06:46:29,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=152236.33333333334, ans=0.5 2024-06-20 06:46:34,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.85 vs. limit=22.5 2024-06-20 06:46:35,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2024-06-20 06:46:37,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.78 vs. limit=12.0 2024-06-20 06:46:45,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=152291.33333333334, ans=0.125 2024-06-20 06:46:47,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152291.33333333334, ans=0.1 2024-06-20 06:46:52,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152309.66666666666, ans=0.1 2024-06-20 06:46:57,909 INFO [train.py:1028] (0/2) Epoch 9, batch 2150, loss[loss=0.2618, simple_loss=0.3032, pruned_loss=0.1102, over 13199.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.2961, pruned_loss=0.1085, over 2590481.80 frames. ], batch size: 52, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:46:58,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=152328.0, ans=10.0 2024-06-20 06:46:59,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=152328.0, ans=0.0 2024-06-20 06:47:02,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2024-06-20 06:47:10,482 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.802e+02 2.027e+02 2.254e+02 3.096e+02, threshold=4.054e+02, percent-clipped=0.0 2024-06-20 06:47:13,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=152364.66666666666, ans=0.125 2024-06-20 06:47:13,437 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 06:47:18,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=152383.0, ans=0.125 2024-06-20 06:47:23,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152383.0, ans=0.1 2024-06-20 06:47:23,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=152383.0, ans=0.2 2024-06-20 06:47:30,706 INFO [train.py:1028] (0/2) Epoch 9, batch 2200, loss[loss=0.2543, simple_loss=0.2857, pruned_loss=0.1115, over 13200.00 frames. ], tot_loss[loss=0.257, simple_loss=0.2965, pruned_loss=0.1087, over 2590452.57 frames. ], batch size: 83, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:47:37,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=19.77 vs. limit=15.0 2024-06-20 06:47:40,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=152438.0, ans=0.0 2024-06-20 06:48:01,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152493.0, ans=0.125 2024-06-20 06:48:07,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=152493.0, ans=0.09899494936611666 2024-06-20 06:48:08,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=152511.33333333334, ans=0.125 2024-06-20 06:48:08,924 INFO [train.py:1028] (0/2) Epoch 9, batch 2250, loss[loss=0.2453, simple_loss=0.2894, pruned_loss=0.1006, over 13258.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.2964, pruned_loss=0.1082, over 2589463.30 frames. ], batch size: 63, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:48:09,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.59 vs. limit=22.5 2024-06-20 06:48:21,381 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.782e+02 1.903e+02 2.055e+02 2.931e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 06:48:42,122 INFO [train.py:1028] (0/2) Epoch 9, batch 2300, loss[loss=0.2558, simple_loss=0.2958, pruned_loss=0.1079, over 12859.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.2969, pruned_loss=0.1084, over 2582690.23 frames. ], batch size: 33, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:48:42,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.30 vs. limit=15.0 2024-06-20 06:49:14,832 INFO [train.py:1028] (0/2) Epoch 9, batch 2350, loss[loss=0.2583, simple_loss=0.2957, pruned_loss=0.1105, over 13158.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.2969, pruned_loss=0.1088, over 2586628.48 frames. ], batch size: 67, lr: 6.64e-03, grad_scale: 64.0 2024-06-20 06:49:22,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152713.0, ans=0.125 2024-06-20 06:49:23,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=152713.0, ans=0.2 2024-06-20 06:49:24,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=152713.0, ans=0.0 2024-06-20 06:49:27,026 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.895e+02 2.085e+02 2.312e+02 2.820e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 06:49:28,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-20 06:49:33,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=152749.66666666666, ans=0.1 2024-06-20 06:49:36,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=152749.66666666666, ans=0.0 2024-06-20 06:49:41,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152768.0, ans=0.125 2024-06-20 06:49:42,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=152768.0, ans=0.0 2024-06-20 06:49:50,077 INFO [train.py:1028] (0/2) Epoch 9, batch 2400, loss[loss=0.2578, simple_loss=0.3024, pruned_loss=0.1066, over 13284.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.2957, pruned_loss=0.1081, over 2588823.07 frames. ], batch size: 46, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:50:11,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=152823.0, ans=0.0 2024-06-20 06:50:24,771 INFO [train.py:1028] (0/2) Epoch 9, batch 2450, loss[loss=0.261, simple_loss=0.3055, pruned_loss=0.1083, over 13263.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.2951, pruned_loss=0.1082, over 2585522.30 frames. ], batch size: 63, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:50:29,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=152878.0, ans=0.025 2024-06-20 06:50:36,992 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 1.896e+02 2.041e+02 2.242e+02 2.984e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 06:50:38,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=152914.66666666666, ans=0.1 2024-06-20 06:50:44,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.74 vs. limit=15.0 2024-06-20 06:50:57,373 INFO [train.py:1028] (0/2) Epoch 9, batch 2500, loss[loss=0.2358, simple_loss=0.2725, pruned_loss=0.09957, over 13174.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.2937, pruned_loss=0.1075, over 2588116.74 frames. ], batch size: 83, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:50:58,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=152969.66666666666, ans=0.025 2024-06-20 06:51:03,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=152969.66666666666, ans=0.0 2024-06-20 06:51:03,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=152988.0, ans=0.125 2024-06-20 06:51:04,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=152988.0, ans=0.2 2024-06-20 06:51:04,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2024-06-20 06:51:12,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=153006.33333333334, ans=0.04949747468305833 2024-06-20 06:51:12,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.67 vs. limit=22.5 2024-06-20 06:51:14,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.89 vs. limit=15.0 2024-06-20 06:51:23,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153043.0, ans=0.1 2024-06-20 06:51:29,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.30 vs. limit=22.5 2024-06-20 06:51:30,216 INFO [train.py:1028] (0/2) Epoch 9, batch 2550, loss[loss=0.2376, simple_loss=0.2877, pruned_loss=0.09372, over 12729.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.2931, pruned_loss=0.1074, over 2587244.44 frames. ], batch size: 22, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:51:43,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=153079.66666666666, ans=0.125 2024-06-20 06:51:45,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.891e+02 2.094e+02 2.348e+02 3.499e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 06:51:54,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=153098.0, ans=0.0 2024-06-20 06:51:57,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153116.33333333334, ans=0.1 2024-06-20 06:52:01,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=153116.33333333334, ans=0.0 2024-06-20 06:52:07,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153134.66666666666, ans=0.0 2024-06-20 06:52:09,496 INFO [train.py:1028] (0/2) Epoch 9, batch 2600, loss[loss=0.2493, simple_loss=0.294, pruned_loss=0.1023, over 13236.00 frames. ], tot_loss[loss=0.253, simple_loss=0.2919, pruned_loss=0.107, over 2586689.40 frames. ], batch size: 52, lr: 6.63e-03, grad_scale: 64.0 2024-06-20 06:52:19,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=153171.33333333334, ans=0.125 2024-06-20 06:52:23,778 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.04 vs. limit=15.0 2024-06-20 06:52:33,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153208.0, ans=0.1 2024-06-20 06:52:39,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=153226.33333333334, ans=0.125 2024-06-20 06:52:42,333 INFO [train.py:1028] (0/2) Epoch 9, batch 2650, loss[loss=0.2383, simple_loss=0.2725, pruned_loss=0.1021, over 13033.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.2899, pruned_loss=0.1064, over 2587034.40 frames. ], batch size: 144, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:52:54,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2024-06-20 06:52:54,523 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.864e+02 2.033e+02 2.277e+02 2.772e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-20 06:52:56,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153281.33333333334, ans=0.125 2024-06-20 06:53:00,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=153299.66666666666, ans=0.2 2024-06-20 06:53:04,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=153299.66666666666, ans=0.0 2024-06-20 06:53:09,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2024-06-20 06:53:12,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=153318.0, ans=0.2 2024-06-20 06:53:14,633 INFO [train.py:1028] (0/2) Epoch 9, batch 2700, loss[loss=0.2522, simple_loss=0.2865, pruned_loss=0.109, over 13274.00 frames. ], tot_loss[loss=0.25, simple_loss=0.2883, pruned_loss=0.1058, over 2584235.73 frames. ], batch size: 89, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:53:19,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.99 vs. limit=15.0 2024-06-20 06:53:23,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=153354.66666666666, ans=0.2 2024-06-20 06:53:25,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 06:53:25,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=153354.66666666666, ans=0.125 2024-06-20 06:53:26,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=153354.66666666666, ans=0.2 2024-06-20 06:53:41,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=153391.33333333334, ans=12.0 2024-06-20 06:53:50,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=153409.66666666666, ans=0.125 2024-06-20 06:53:51,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.61 vs. limit=10.0 2024-06-20 06:53:53,537 INFO [train.py:1028] (0/2) Epoch 9, batch 2750, loss[loss=0.2444, simple_loss=0.2794, pruned_loss=0.1047, over 13273.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2875, pruned_loss=0.1051, over 2581028.74 frames. ], batch size: 43, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:54:05,988 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.774e+02 1.902e+02 2.055e+02 2.972e+02, threshold=3.803e+02, percent-clipped=0.0 2024-06-20 06:54:08,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=153464.66666666666, ans=0.125 2024-06-20 06:54:26,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=153519.66666666666, ans=0.125 2024-06-20 06:54:26,640 INFO [train.py:1028] (0/2) Epoch 9, batch 2800, loss[loss=0.263, simple_loss=0.2867, pruned_loss=0.1196, over 10718.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.287, pruned_loss=0.1051, over 2578926.17 frames. ], batch size: 303, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:54:34,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153538.0, ans=0.1 2024-06-20 06:54:40,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153556.33333333334, ans=0.1 2024-06-20 06:54:48,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=153574.66666666666, ans=0.125 2024-06-20 06:54:50,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=153574.66666666666, ans=0.09899494936611666 2024-06-20 06:54:58,297 INFO [train.py:1028] (0/2) Epoch 9, batch 2850, loss[loss=0.2434, simple_loss=0.2866, pruned_loss=0.1001, over 13313.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.286, pruned_loss=0.1048, over 2576539.55 frames. ], batch size: 49, lr: 6.62e-03, grad_scale: 64.0 2024-06-20 06:55:10,599 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.766e+02 1.919e+02 2.055e+02 2.610e+02, threshold=3.839e+02, percent-clipped=0.0 2024-06-20 06:55:12,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=153648.0, ans=0.0 2024-06-20 06:55:23,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=153666.33333333334, ans=0.025 2024-06-20 06:55:25,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=153666.33333333334, ans=0.0 2024-06-20 06:55:26,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=153684.66666666666, ans=0.125 2024-06-20 06:55:27,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=153684.66666666666, ans=0.125 2024-06-20 06:55:28,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2024-06-20 06:55:28,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=153684.66666666666, ans=0.2 2024-06-20 06:55:33,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=153684.66666666666, ans=0.1 2024-06-20 06:55:36,254 INFO [train.py:1028] (0/2) Epoch 9, batch 2900, loss[loss=0.2343, simple_loss=0.2781, pruned_loss=0.09519, over 13212.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.2834, pruned_loss=0.1035, over 2585255.47 frames. ], batch size: 55, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:55:37,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.86 vs. limit=22.5 2024-06-20 06:55:40,300 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.83 vs. limit=15.0 2024-06-20 06:55:43,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=153721.33333333334, ans=0.0 2024-06-20 06:55:53,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=153739.66666666666, ans=0.0 2024-06-20 06:55:54,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=153739.66666666666, ans=0.125 2024-06-20 06:55:57,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=153758.0, ans=0.125 2024-06-20 06:56:09,433 INFO [train.py:1028] (0/2) Epoch 9, batch 2950, loss[loss=0.2548, simple_loss=0.2909, pruned_loss=0.1093, over 13213.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2835, pruned_loss=0.1036, over 2579438.57 frames. ], batch size: 43, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:56:12,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=153794.66666666666, ans=0.125 2024-06-20 06:56:15,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=153813.0, ans=0.2 2024-06-20 06:56:15,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=153813.0, ans=0.0 2024-06-20 06:56:22,635 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.756e+02 1.864e+02 2.050e+02 3.041e+02, threshold=3.727e+02, percent-clipped=0.0 2024-06-20 06:56:25,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=153831.33333333334, ans=0.09899494936611666 2024-06-20 06:56:30,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=153849.66666666666, ans=0.125 2024-06-20 06:56:30,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=153849.66666666666, ans=0.0 2024-06-20 06:56:33,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=153849.66666666666, ans=0.0 2024-06-20 06:56:35,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=153849.66666666666, ans=0.2 2024-06-20 06:56:43,377 INFO [train.py:1028] (0/2) Epoch 9, batch 3000, loss[loss=0.2319, simple_loss=0.2706, pruned_loss=0.09662, over 13161.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2817, pruned_loss=0.1026, over 2579651.07 frames. ], batch size: 59, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:56:43,378 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 06:56:51,193 INFO [train.py:1060] (0/2) Epoch 9, validation: loss=0.2027, simple_loss=0.2657, pruned_loss=0.06989, over 351949.00 frames. 2024-06-20 06:56:51,194 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 06:56:56,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=153886.33333333334, ans=0.025 2024-06-20 06:56:56,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=153886.33333333334, ans=0.025 2024-06-20 06:57:03,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=153904.66666666666, ans=0.125 2024-06-20 06:57:03,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=153923.0, ans=0.125 2024-06-20 06:57:04,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2024-06-20 06:57:10,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=153941.33333333334, ans=0.125 2024-06-20 06:57:15,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=153941.33333333334, ans=0.0 2024-06-20 06:57:16,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=153941.33333333334, ans=0.125 2024-06-20 06:57:17,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153941.33333333334, ans=0.1 2024-06-20 06:57:19,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=153941.33333333334, ans=0.025 2024-06-20 06:57:20,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=153941.33333333334, ans=0.07 2024-06-20 06:57:27,855 INFO [train.py:1028] (0/2) Epoch 9, batch 3050, loss[loss=0.2147, simple_loss=0.254, pruned_loss=0.08776, over 13323.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2805, pruned_loss=0.1025, over 2579580.43 frames. ], batch size: 46, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:57:28,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.82 vs. limit=22.5 2024-06-20 06:57:32,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.08 vs. limit=22.5 2024-06-20 06:57:35,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=153978.0, ans=0.015 2024-06-20 06:57:37,899 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-84000.pt 2024-06-20 06:57:44,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=153996.33333333334, ans=0.2 2024-06-20 06:57:46,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.03 vs. limit=22.5 2024-06-20 06:57:48,358 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.787e+02 1.932e+02 2.125e+02 2.848e+02, threshold=3.865e+02, percent-clipped=0.0 2024-06-20 06:57:56,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=15.0 2024-06-20 06:57:58,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=154033.0, ans=0.125 2024-06-20 06:57:59,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=154033.0, ans=0.2 2024-06-20 06:57:59,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=154033.0, ans=22.5 2024-06-20 06:58:08,573 INFO [train.py:1028] (0/2) Epoch 9, batch 3100, loss[loss=0.2453, simple_loss=0.2779, pruned_loss=0.1063, over 13051.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.2801, pruned_loss=0.1022, over 2580338.40 frames. ], batch size: 144, lr: 6.61e-03, grad_scale: 64.0 2024-06-20 06:58:22,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154106.33333333334, ans=0.125 2024-06-20 06:58:24,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.79 vs. limit=15.0 2024-06-20 06:58:38,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=154143.0, ans=0.5 2024-06-20 06:58:41,504 INFO [train.py:1028] (0/2) Epoch 9, batch 3150, loss[loss=0.2229, simple_loss=0.2607, pruned_loss=0.09253, over 12934.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2788, pruned_loss=0.1013, over 2582752.96 frames. ], batch size: 158, lr: 6.60e-03, grad_scale: 64.0 2024-06-20 06:58:48,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=154179.66666666666, ans=0.0 2024-06-20 06:58:48,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=154179.66666666666, ans=0.0 2024-06-20 06:58:53,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=154179.66666666666, ans=0.0 2024-06-20 06:58:54,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.782e+02 1.917e+02 2.096e+02 3.057e+02, threshold=3.834e+02, percent-clipped=0.0 2024-06-20 06:58:59,791 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2024-06-20 06:59:14,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-20 06:59:14,387 INFO [train.py:1028] (0/2) Epoch 9, batch 3200, loss[loss=0.2626, simple_loss=0.2963, pruned_loss=0.1145, over 13194.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2789, pruned_loss=0.1015, over 2583000.05 frames. ], batch size: 55, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 06:59:48,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=154326.33333333334, ans=0.0 2024-06-20 06:59:52,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=154326.33333333334, ans=0.2 2024-06-20 06:59:53,738 INFO [train.py:1028] (0/2) Epoch 9, batch 3250, loss[loss=0.2479, simple_loss=0.2868, pruned_loss=0.1045, over 13182.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.2787, pruned_loss=0.1017, over 2586698.41 frames. ], batch size: 72, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 06:59:57,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=154344.66666666666, ans=0.0 2024-06-20 06:59:59,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.73 vs. limit=22.5 2024-06-20 07:00:06,854 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.788e+02 1.947e+02 2.164e+02 2.876e+02, threshold=3.894e+02, percent-clipped=0.0 2024-06-20 07:00:07,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=154381.33333333334, ans=0.015 2024-06-20 07:00:13,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=154381.33333333334, ans=0.125 2024-06-20 07:00:25,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.69 vs. limit=15.0 2024-06-20 07:00:26,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154418.0, ans=0.125 2024-06-20 07:00:28,889 INFO [train.py:1028] (0/2) Epoch 9, batch 3300, loss[loss=0.2531, simple_loss=0.2888, pruned_loss=0.1087, over 12801.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2782, pruned_loss=0.1011, over 2582522.92 frames. ], batch size: 176, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 07:00:32,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=154436.33333333334, ans=0.125 2024-06-20 07:00:32,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=154436.33333333334, ans=0.125 2024-06-20 07:00:35,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=154454.66666666666, ans=0.5 2024-06-20 07:00:40,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=154454.66666666666, ans=0.1 2024-06-20 07:00:44,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=154473.0, ans=0.125 2024-06-20 07:00:45,968 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=2.800e+01 2024-06-20 07:00:50,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2024-06-20 07:00:52,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=154491.33333333334, ans=0.0 2024-06-20 07:01:00,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.73 vs. limit=15.0 2024-06-20 07:01:02,627 INFO [train.py:1028] (0/2) Epoch 9, batch 3350, loss[loss=0.238, simple_loss=0.2727, pruned_loss=0.1016, over 12906.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2779, pruned_loss=0.1014, over 2576896.32 frames. ], batch size: 158, lr: 6.60e-03, grad_scale: 128.0 2024-06-20 07:01:18,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=154546.33333333334, ans=0.025 2024-06-20 07:01:19,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.827e+02 1.974e+02 2.272e+02 3.567e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 07:01:27,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=154564.66666666666, ans=0.025 2024-06-20 07:01:35,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=154583.0, ans=0.2 2024-06-20 07:01:36,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=154583.0, ans=0.0 2024-06-20 07:01:44,282 INFO [train.py:1028] (0/2) Epoch 9, batch 3400, loss[loss=0.2293, simple_loss=0.273, pruned_loss=0.09284, over 12660.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2776, pruned_loss=0.1015, over 2574519.51 frames. ], batch size: 22, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:01:44,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=154619.66666666666, ans=0.2 2024-06-20 07:01:56,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=154638.0, ans=0.025 2024-06-20 07:02:17,452 INFO [train.py:1028] (0/2) Epoch 9, batch 3450, loss[loss=0.2486, simple_loss=0.285, pruned_loss=0.1061, over 12727.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2772, pruned_loss=0.1013, over 2576205.05 frames. ], batch size: 176, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:02:22,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=15.0 2024-06-20 07:02:27,913 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:02:29,708 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.772e+02 1.848e+02 2.103e+02 3.399e+02, threshold=3.695e+02, percent-clipped=0.0 2024-06-20 07:02:34,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=154748.0, ans=0.125 2024-06-20 07:02:36,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=154766.33333333334, ans=0.125 2024-06-20 07:02:40,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=154766.33333333334, ans=0.09899494936611666 2024-06-20 07:02:48,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=154784.66666666666, ans=0.07 2024-06-20 07:02:50,489 INFO [train.py:1028] (0/2) Epoch 9, batch 3500, loss[loss=0.2301, simple_loss=0.2768, pruned_loss=0.09169, over 12983.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2765, pruned_loss=0.1006, over 2576053.36 frames. ], batch size: 33, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:03:04,511 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.161e+02 2024-06-20 07:03:25,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.20 vs. limit=15.0 2024-06-20 07:03:27,519 INFO [train.py:1028] (0/2) Epoch 9, batch 3550, loss[loss=0.2238, simple_loss=0.2596, pruned_loss=0.09401, over 13137.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.2752, pruned_loss=0.1001, over 2577525.07 frames. ], batch size: 95, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:03:38,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=154913.0, ans=0.125 2024-06-20 07:03:43,312 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.748e+02 1.921e+02 2.140e+02 2.883e+02, threshold=3.841e+02, percent-clipped=0.0 2024-06-20 07:03:47,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=154931.33333333334, ans=0.0 2024-06-20 07:03:53,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-20 07:03:54,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=19.42 vs. limit=15.0 2024-06-20 07:04:04,284 INFO [train.py:1028] (0/2) Epoch 9, batch 3600, loss[loss=0.2025, simple_loss=0.2591, pruned_loss=0.07295, over 13284.00 frames. ], tot_loss[loss=0.237, simple_loss=0.2745, pruned_loss=0.09972, over 2580609.14 frames. ], batch size: 49, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:04:10,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=155004.66666666666, ans=0.125 2024-06-20 07:04:19,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=155023.0, ans=0.125 2024-06-20 07:04:19,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.86 vs. limit=12.0 2024-06-20 07:04:28,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.00 vs. limit=15.0 2024-06-20 07:04:30,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=155059.66666666666, ans=0.125 2024-06-20 07:04:32,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=155059.66666666666, ans=0.125 2024-06-20 07:04:34,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=155059.66666666666, ans=0.125 2024-06-20 07:04:37,474 INFO [train.py:1028] (0/2) Epoch 9, batch 3650, loss[loss=0.2475, simple_loss=0.277, pruned_loss=0.109, over 13036.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2743, pruned_loss=0.09942, over 2580352.29 frames. ], batch size: 102, lr: 6.59e-03, grad_scale: 128.0 2024-06-20 07:04:46,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=15.0 2024-06-20 07:04:49,819 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.717e+02 1.876e+02 2.016e+02 2.652e+02, threshold=3.753e+02, percent-clipped=0.0 2024-06-20 07:05:10,460 INFO [train.py:1028] (0/2) Epoch 9, batch 3700, loss[loss=0.1985, simple_loss=0.2474, pruned_loss=0.07481, over 13238.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2736, pruned_loss=0.09903, over 2584189.45 frames. ], batch size: 72, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:05:13,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.11 vs. limit=22.5 2024-06-20 07:05:48,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=155243.0, ans=0.2 2024-06-20 07:05:49,318 INFO [train.py:1028] (0/2) Epoch 9, batch 3750, loss[loss=0.2404, simple_loss=0.2729, pruned_loss=0.1039, over 12513.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.273, pruned_loss=0.09861, over 2586598.92 frames. ], batch size: 22, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:05:54,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=155261.33333333334, ans=0.125 2024-06-20 07:05:54,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=155261.33333333334, ans=0.0 2024-06-20 07:06:01,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=155279.66666666666, ans=0.0 2024-06-20 07:06:01,808 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.807e+02 1.963e+02 2.203e+02 3.308e+02, threshold=3.925e+02, percent-clipped=0.0 2024-06-20 07:06:08,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=155316.33333333334, ans=0.125 2024-06-20 07:06:09,108 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.909e+01 2024-06-20 07:06:10,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=155316.33333333334, ans=0.0 2024-06-20 07:06:18,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=155334.66666666666, ans=0.125 2024-06-20 07:06:21,888 INFO [train.py:1028] (0/2) Epoch 9, batch 3800, loss[loss=0.2171, simple_loss=0.2518, pruned_loss=0.09119, over 13226.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2732, pruned_loss=0.09882, over 2584825.06 frames. ], batch size: 83, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:06:34,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155371.33333333334, ans=0.125 2024-06-20 07:06:36,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=155389.66666666666, ans=0.0 2024-06-20 07:06:38,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.42 vs. limit=22.5 2024-06-20 07:06:41,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=155408.0, ans=0.0 2024-06-20 07:06:43,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=155408.0, ans=0.125 2024-06-20 07:06:48,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=155426.33333333334, ans=0.125 2024-06-20 07:06:55,123 INFO [train.py:1028] (0/2) Epoch 9, batch 3850, loss[loss=0.2397, simple_loss=0.271, pruned_loss=0.1042, over 13044.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2724, pruned_loss=0.09835, over 2583947.99 frames. ], batch size: 144, lr: 6.58e-03, grad_scale: 128.0 2024-06-20 07:07:04,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155463.0, ans=0.125 2024-06-20 07:07:07,492 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.769e+02 1.948e+02 2.324e+02 3.753e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 07:07:07,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=155481.33333333334, ans=0.0 2024-06-20 07:07:17,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155499.66666666666, ans=0.1 2024-06-20 07:07:23,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=155518.0, ans=0.0 2024-06-20 07:07:27,315 INFO [train.py:1028] (0/2) Epoch 9, batch 3900, loss[loss=0.2472, simple_loss=0.2799, pruned_loss=0.1072, over 13273.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.2727, pruned_loss=0.09878, over 2587806.38 frames. ], batch size: 83, lr: 6.58e-03, grad_scale: 64.0 2024-06-20 07:07:28,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=155536.33333333334, ans=0.0 2024-06-20 07:07:33,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155536.33333333334, ans=0.1 2024-06-20 07:07:34,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=155536.33333333334, ans=10.0 2024-06-20 07:07:38,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=15.0 2024-06-20 07:07:45,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=155573.0, ans=0.125 2024-06-20 07:07:47,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=155573.0, ans=0.0 2024-06-20 07:07:49,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=155573.0, ans=0.125 2024-06-20 07:07:59,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155591.33333333334, ans=0.125 2024-06-20 07:07:59,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.44 vs. limit=15.0 2024-06-20 07:08:03,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=155609.66666666666, ans=0.125 2024-06-20 07:08:06,588 INFO [train.py:1028] (0/2) Epoch 9, batch 3950, loss[loss=0.2249, simple_loss=0.2602, pruned_loss=0.09476, over 13098.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2721, pruned_loss=0.09838, over 2589285.79 frames. ], batch size: 132, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:08:08,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.97 vs. limit=22.5 2024-06-20 07:08:15,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.71 vs. limit=15.0 2024-06-20 07:08:17,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=155646.33333333334, ans=0.2 2024-06-20 07:08:19,460 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.759e+02 1.860e+02 2.045e+02 2.627e+02, threshold=3.720e+02, percent-clipped=0.0 2024-06-20 07:08:23,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=155664.66666666666, ans=0.035 2024-06-20 07:08:27,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155683.0, ans=0.1 2024-06-20 07:08:27,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=155683.0, ans=0.2 2024-06-20 07:08:29,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=155683.0, ans=0.125 2024-06-20 07:08:39,209 INFO [train.py:1028] (0/2) Epoch 9, batch 4000, loss[loss=0.2501, simple_loss=0.2877, pruned_loss=0.1063, over 12968.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2712, pruned_loss=0.0981, over 2583738.66 frames. ], batch size: 39, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:08:50,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-20 07:09:11,889 INFO [train.py:1028] (0/2) Epoch 9, batch 4050, loss[loss=0.2429, simple_loss=0.2718, pruned_loss=0.107, over 11011.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.2712, pruned_loss=0.09817, over 2581456.16 frames. ], batch size: 304, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:09:20,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=155829.66666666666, ans=0.125 2024-06-20 07:09:23,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2024-06-20 07:09:24,727 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.783e+02 1.918e+02 2.174e+02 3.057e+02, threshold=3.836e+02, percent-clipped=0.0 2024-06-20 07:09:36,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=155866.33333333334, ans=0.0 2024-06-20 07:09:37,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=155866.33333333334, ans=0.025 2024-06-20 07:09:47,708 INFO [train.py:1028] (0/2) Epoch 9, batch 4100, loss[loss=0.2372, simple_loss=0.2746, pruned_loss=0.09986, over 13016.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.271, pruned_loss=0.09822, over 2577124.32 frames. ], batch size: 102, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:09:55,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=155903.0, ans=0.125 2024-06-20 07:10:10,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155939.66666666666, ans=0.125 2024-06-20 07:10:24,791 INFO [train.py:1028] (0/2) Epoch 9, batch 4150, loss[loss=0.2345, simple_loss=0.2792, pruned_loss=0.09489, over 13091.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.271, pruned_loss=0.09817, over 2575894.12 frames. ], batch size: 55, lr: 6.57e-03, grad_scale: 64.0 2024-06-20 07:10:37,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.804e+02 2.073e+02 2.365e+02 3.487e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-20 07:10:40,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=156031.33333333334, ans=0.0 2024-06-20 07:10:41,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=156031.33333333334, ans=0.2 2024-06-20 07:10:41,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-20 07:10:42,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.82 vs. limit=22.5 2024-06-20 07:10:51,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.89 vs. limit=15.0 2024-06-20 07:10:52,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=156068.0, ans=0.07 2024-06-20 07:10:54,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=156068.0, ans=0.125 2024-06-20 07:10:54,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 07:10:57,517 INFO [train.py:1028] (0/2) Epoch 9, batch 4200, loss[loss=0.2263, simple_loss=0.259, pruned_loss=0.09682, over 13050.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.2709, pruned_loss=0.0983, over 2578814.46 frames. ], batch size: 102, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:10:58,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2024-06-20 07:11:03,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=156104.66666666666, ans=0.125 2024-06-20 07:11:04,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=156104.66666666666, ans=0.125 2024-06-20 07:11:12,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.02 vs. limit=22.5 2024-06-20 07:11:17,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=156141.33333333334, ans=0.125 2024-06-20 07:11:27,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156159.66666666666, ans=0.1 2024-06-20 07:11:34,608 INFO [train.py:1028] (0/2) Epoch 9, batch 4250, loss[loss=0.2013, simple_loss=0.2485, pruned_loss=0.07705, over 13280.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2707, pruned_loss=0.09808, over 2580742.90 frames. ], batch size: 46, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:11:34,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=156178.0, ans=0.0 2024-06-20 07:11:41,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=156196.33333333334, ans=0.125 2024-06-20 07:11:48,131 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.798e+02 2.015e+02 2.246e+02 3.303e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-20 07:12:11,099 INFO [train.py:1028] (0/2) Epoch 9, batch 4300, loss[loss=0.246, simple_loss=0.2741, pruned_loss=0.109, over 13175.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2702, pruned_loss=0.09772, over 2581812.53 frames. ], batch size: 59, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:12:21,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=156288.0, ans=0.035 2024-06-20 07:12:23,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=156306.33333333334, ans=0.125 2024-06-20 07:12:30,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.65 vs. limit=15.0 2024-06-20 07:12:38,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=156343.0, ans=0.2 2024-06-20 07:12:39,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=156343.0, ans=0.125 2024-06-20 07:12:42,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=156343.0, ans=0.2 2024-06-20 07:12:42,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=156343.0, ans=0.04949747468305833 2024-06-20 07:12:43,240 INFO [train.py:1028] (0/2) Epoch 9, batch 4350, loss[loss=0.2726, simple_loss=0.3026, pruned_loss=0.1213, over 13165.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2699, pruned_loss=0.09787, over 2586249.43 frames. ], batch size: 59, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:12:48,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2024-06-20 07:12:56,079 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.516e+02 1.714e+02 1.879e+02 2.163e+02 3.055e+02, threshold=3.758e+02, percent-clipped=0.0 2024-06-20 07:13:01,044 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2024-06-20 07:13:04,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=15.0 2024-06-20 07:13:15,725 INFO [train.py:1028] (0/2) Epoch 9, batch 4400, loss[loss=0.2179, simple_loss=0.2526, pruned_loss=0.0916, over 13212.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2695, pruned_loss=0.09774, over 2586191.93 frames. ], batch size: 83, lr: 6.56e-03, grad_scale: 64.0 2024-06-20 07:13:16,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.86 vs. limit=10.0 2024-06-20 07:13:17,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156453.0, ans=0.1 2024-06-20 07:13:28,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=156489.66666666666, ans=0.09899494936611666 2024-06-20 07:13:43,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156508.0, ans=0.1 2024-06-20 07:13:55,404 INFO [train.py:1028] (0/2) Epoch 9, batch 4450, loss[loss=0.2124, simple_loss=0.2467, pruned_loss=0.08902, over 12946.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2691, pruned_loss=0.09764, over 2580568.93 frames. ], batch size: 33, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:13:57,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=156544.66666666666, ans=0.2 2024-06-20 07:14:01,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=156563.0, ans=0.2 2024-06-20 07:14:08,571 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.689e+02 1.812e+02 2.050e+02 2.708e+02, threshold=3.624e+02, percent-clipped=0.0 2024-06-20 07:14:09,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=156581.33333333334, ans=0.5 2024-06-20 07:14:10,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.44 vs. limit=15.0 2024-06-20 07:14:28,527 INFO [train.py:1028] (0/2) Epoch 9, batch 4500, loss[loss=0.2274, simple_loss=0.2613, pruned_loss=0.09674, over 13214.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2682, pruned_loss=0.09687, over 2584803.61 frames. ], batch size: 89, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:14:31,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.43 vs. limit=15.0 2024-06-20 07:14:41,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=156673.0, ans=0.07 2024-06-20 07:14:52,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=156691.33333333334, ans=0.0 2024-06-20 07:14:52,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=156691.33333333334, ans=0.125 2024-06-20 07:14:52,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=156691.33333333334, ans=0.0 2024-06-20 07:14:52,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=156691.33333333334, ans=0.125 2024-06-20 07:14:53,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2024-06-20 07:14:54,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2024-06-20 07:14:58,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=156709.66666666666, ans=0.025 2024-06-20 07:15:02,017 INFO [train.py:1028] (0/2) Epoch 9, batch 4550, loss[loss=0.2252, simple_loss=0.268, pruned_loss=0.09125, over 13191.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2684, pruned_loss=0.097, over 2588180.98 frames. ], batch size: 52, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:15:12,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.73 vs. limit=22.5 2024-06-20 07:15:14,956 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 1.754e+02 1.883e+02 2.084e+02 3.556e+02, threshold=3.766e+02, percent-clipped=0.0 2024-06-20 07:15:16,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=156764.66666666666, ans=0.125 2024-06-20 07:15:19,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.43 vs. limit=6.0 2024-06-20 07:15:19,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=156764.66666666666, ans=0.05 2024-06-20 07:15:23,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2024-06-20 07:15:26,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=156783.0, ans=0.035 2024-06-20 07:15:26,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.30 vs. limit=15.0 2024-06-20 07:15:38,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=156819.66666666666, ans=0.125 2024-06-20 07:15:38,608 INFO [train.py:1028] (0/2) Epoch 9, batch 4600, loss[loss=0.2654, simple_loss=0.2986, pruned_loss=0.1161, over 12601.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2693, pruned_loss=0.09744, over 2584335.46 frames. ], batch size: 202, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:15:44,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2024-06-20 07:15:48,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.72 vs. limit=22.5 2024-06-20 07:15:49,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=156838.0, ans=0.2 2024-06-20 07:15:56,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=156856.33333333334, ans=0.0 2024-06-20 07:15:57,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=156856.33333333334, ans=0.0 2024-06-20 07:16:01,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=156874.66666666666, ans=0.0 2024-06-20 07:16:10,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156893.0, ans=0.1 2024-06-20 07:16:12,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.67 vs. limit=22.5 2024-06-20 07:16:13,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=156893.0, ans=0.125 2024-06-20 07:16:14,925 INFO [train.py:1028] (0/2) Epoch 9, batch 4650, loss[loss=0.2302, simple_loss=0.259, pruned_loss=0.1008, over 13115.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2683, pruned_loss=0.09722, over 2587980.56 frames. ], batch size: 132, lr: 6.55e-03, grad_scale: 64.0 2024-06-20 07:16:27,689 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.447e+02 1.700e+02 1.876e+02 2.101e+02 3.110e+02, threshold=3.751e+02, percent-clipped=0.0 2024-06-20 07:16:30,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=156948.0, ans=0.0 2024-06-20 07:16:33,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-20 07:16:39,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=156966.33333333334, ans=0.09899494936611666 2024-06-20 07:16:39,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=156966.33333333334, ans=0.0 2024-06-20 07:16:48,013 INFO [train.py:1028] (0/2) Epoch 9, batch 4700, loss[loss=0.2254, simple_loss=0.269, pruned_loss=0.09091, over 12746.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2683, pruned_loss=0.09734, over 2582973.86 frames. ], batch size: 26, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:16:48,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=157003.0, ans=0.0 2024-06-20 07:16:54,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2024-06-20 07:17:08,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.78 vs. limit=22.5 2024-06-20 07:17:16,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=157076.33333333334, ans=0.125 2024-06-20 07:17:17,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=157076.33333333334, ans=0.125 2024-06-20 07:17:21,788 INFO [train.py:1028] (0/2) Epoch 9, batch 4750, loss[loss=0.2411, simple_loss=0.269, pruned_loss=0.1066, over 12568.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2679, pruned_loss=0.09741, over 2580012.07 frames. ], batch size: 202, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:17:22,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.94 vs. limit=15.0 2024-06-20 07:17:23,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-20 07:17:38,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.853e+02 2.020e+02 2.304e+02 3.405e+02, threshold=4.040e+02, percent-clipped=0.0 2024-06-20 07:17:46,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157149.66666666666, ans=0.1 2024-06-20 07:17:59,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=157168.0, ans=0.125 2024-06-20 07:18:02,238 INFO [train.py:1028] (0/2) Epoch 9, batch 4800, loss[loss=0.2243, simple_loss=0.2649, pruned_loss=0.09183, over 13259.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2678, pruned_loss=0.0974, over 2576663.74 frames. ], batch size: 63, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:18:05,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157186.33333333334, ans=0.1 2024-06-20 07:18:10,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157204.66666666666, ans=0.125 2024-06-20 07:18:25,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=157241.33333333334, ans=0.2 2024-06-20 07:18:29,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=157259.66666666666, ans=0.0 2024-06-20 07:18:34,569 INFO [train.py:1028] (0/2) Epoch 9, batch 4850, loss[loss=0.2036, simple_loss=0.2385, pruned_loss=0.08438, over 13211.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2671, pruned_loss=0.09665, over 2574394.33 frames. ], batch size: 89, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:18:34,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=157278.0, ans=0.0 2024-06-20 07:18:42,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=157296.33333333334, ans=0.2 2024-06-20 07:18:48,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.477e+02 1.775e+02 1.970e+02 2.198e+02 2.876e+02, threshold=3.940e+02, percent-clipped=0.0 2024-06-20 07:18:57,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=157333.0, ans=0.2 2024-06-20 07:18:58,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.69 vs. limit=15.0 2024-06-20 07:19:01,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157351.33333333334, ans=0.0 2024-06-20 07:19:08,320 INFO [train.py:1028] (0/2) Epoch 9, batch 4900, loss[loss=0.2339, simple_loss=0.2807, pruned_loss=0.09352, over 13180.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2672, pruned_loss=0.09656, over 2574580.30 frames. ], batch size: 59, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:19:08,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157369.66666666666, ans=0.1 2024-06-20 07:19:16,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157388.0, ans=0.0 2024-06-20 07:19:44,997 INFO [train.py:1028] (0/2) Epoch 9, batch 4950, loss[loss=0.263, simple_loss=0.2827, pruned_loss=0.1217, over 11202.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2675, pruned_loss=0.09712, over 2569840.07 frames. ], batch size: 304, lr: 6.54e-03, grad_scale: 64.0 2024-06-20 07:19:45,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2024-06-20 07:19:51,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2024-06-20 07:19:54,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=157479.66666666666, ans=0.0 2024-06-20 07:19:58,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157479.66666666666, ans=0.1 2024-06-20 07:19:59,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=157479.66666666666, ans=0.0 2024-06-20 07:20:01,098 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.720e+02 1.891e+02 2.183e+02 3.125e+02, threshold=3.783e+02, percent-clipped=0.0 2024-06-20 07:20:01,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157498.0, ans=0.125 2024-06-20 07:20:03,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=157498.0, ans=0.025 2024-06-20 07:20:06,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=157498.0, ans=0.125 2024-06-20 07:20:06,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=157498.0, ans=0.1 2024-06-20 07:20:14,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.50 vs. limit=15.0 2024-06-20 07:20:20,854 INFO [train.py:1028] (0/2) Epoch 9, batch 5000, loss[loss=0.227, simple_loss=0.2619, pruned_loss=0.09607, over 13139.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2683, pruned_loss=0.09733, over 2574718.21 frames. ], batch size: 95, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:20:21,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157553.0, ans=0.125 2024-06-20 07:20:23,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.15 vs. limit=15.0 2024-06-20 07:20:30,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=157571.33333333334, ans=0.0 2024-06-20 07:20:39,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=14.49 vs. limit=15.0 2024-06-20 07:20:43,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=157608.0, ans=0.0 2024-06-20 07:20:54,339 INFO [train.py:1028] (0/2) Epoch 9, batch 5050, loss[loss=0.2089, simple_loss=0.2493, pruned_loss=0.0842, over 12918.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2681, pruned_loss=0.09694, over 2573554.10 frames. ], batch size: 36, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:21:00,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=157663.0, ans=0.125 2024-06-20 07:21:02,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=157663.0, ans=0.125 2024-06-20 07:21:07,628 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.937e+02 2.178e+02 2.505e+02 3.328e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-20 07:21:07,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=157681.33333333334, ans=0.125 2024-06-20 07:21:08,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=12.0 2024-06-20 07:21:10,490 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:21:10,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=157681.33333333334, ans=0.125 2024-06-20 07:21:11,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157681.33333333334, ans=0.1 2024-06-20 07:21:19,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.68 vs. limit=22.5 2024-06-20 07:21:20,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=157699.66666666666, ans=0.125 2024-06-20 07:21:23,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.63 vs. limit=22.5 2024-06-20 07:21:31,250 INFO [train.py:1028] (0/2) Epoch 9, batch 5100, loss[loss=0.2604, simple_loss=0.3024, pruned_loss=0.1091, over 12922.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2683, pruned_loss=0.09726, over 2570348.10 frames. ], batch size: 39, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:21:32,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=157736.33333333334, ans=0.125 2024-06-20 07:21:37,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=157754.66666666666, ans=0.125 2024-06-20 07:21:41,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.73 vs. limit=15.0 2024-06-20 07:21:52,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157773.0, ans=0.1 2024-06-20 07:21:59,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=157791.33333333334, ans=0.125 2024-06-20 07:22:03,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=157809.66666666666, ans=0.125 2024-06-20 07:22:07,064 INFO [train.py:1028] (0/2) Epoch 9, batch 5150, loss[loss=0.2245, simple_loss=0.2557, pruned_loss=0.09663, over 13132.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2679, pruned_loss=0.09739, over 2571650.51 frames. ], batch size: 132, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:22:07,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.12 vs. limit=12.0 2024-06-20 07:22:09,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=15.0 2024-06-20 07:22:14,138 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=3.329e+00 2024-06-20 07:22:18,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157846.33333333334, ans=0.1 2024-06-20 07:22:20,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=157864.66666666666, ans=0.0 2024-06-20 07:22:20,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.721e+02 1.915e+02 2.125e+02 2.649e+02, threshold=3.831e+02, percent-clipped=0.0 2024-06-20 07:22:35,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=157901.33333333334, ans=0.125 2024-06-20 07:22:39,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157919.66666666666, ans=0.1 2024-06-20 07:22:39,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 07:22:39,938 INFO [train.py:1028] (0/2) Epoch 9, batch 5200, loss[loss=0.213, simple_loss=0.2519, pruned_loss=0.08707, over 13184.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2676, pruned_loss=0.09703, over 2574650.63 frames. ], batch size: 95, lr: 6.53e-03, grad_scale: 64.0 2024-06-20 07:22:40,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=157919.66666666666, ans=0.2 2024-06-20 07:22:40,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157919.66666666666, ans=0.125 2024-06-20 07:22:40,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2024-06-20 07:22:41,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=157919.66666666666, ans=0.125 2024-06-20 07:22:47,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=157938.0, ans=0.125 2024-06-20 07:22:48,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=157938.0, ans=0.125 2024-06-20 07:22:48,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2024-06-20 07:23:02,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=157974.66666666666, ans=0.125 2024-06-20 07:23:03,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.02 vs. limit=15.0 2024-06-20 07:23:09,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=157993.0, ans=0.05 2024-06-20 07:23:13,431 INFO [train.py:1028] (0/2) Epoch 9, batch 5250, loss[loss=0.2262, simple_loss=0.2591, pruned_loss=0.09658, over 13284.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2679, pruned_loss=0.09733, over 2571943.95 frames. ], batch size: 52, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:23:21,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=158029.66666666666, ans=0.125 2024-06-20 07:23:29,928 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.761e+02 1.897e+02 2.172e+02 2.687e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 07:23:31,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=158048.0, ans=0.07 2024-06-20 07:23:35,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=158048.0, ans=0.1 2024-06-20 07:23:36,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=158066.33333333334, ans=0.125 2024-06-20 07:23:41,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=158066.33333333334, ans=0.125 2024-06-20 07:23:44,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=158084.66666666666, ans=0.125 2024-06-20 07:23:49,807 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:23:51,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=158084.66666666666, ans=0.2 2024-06-20 07:23:52,719 INFO [train.py:1028] (0/2) Epoch 9, batch 5300, loss[loss=0.2187, simple_loss=0.2523, pruned_loss=0.09259, over 13033.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2681, pruned_loss=0.09743, over 2569362.10 frames. ], batch size: 144, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:23:53,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=15.0 2024-06-20 07:23:54,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=158103.0, ans=0.2 2024-06-20 07:23:54,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.59 vs. limit=15.0 2024-06-20 07:24:05,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=158139.66666666666, ans=0.035 2024-06-20 07:24:05,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=158139.66666666666, ans=0.125 2024-06-20 07:24:07,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=158139.66666666666, ans=0.125 2024-06-20 07:24:15,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=158158.0, ans=0.2 2024-06-20 07:24:25,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.31 vs. limit=15.0 2024-06-20 07:24:25,866 INFO [train.py:1028] (0/2) Epoch 9, batch 5350, loss[loss=0.2446, simple_loss=0.2884, pruned_loss=0.1004, over 11485.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2674, pruned_loss=0.09695, over 2576229.54 frames. ], batch size: 16, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:24:31,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=158194.66666666666, ans=10.0 2024-06-20 07:24:38,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158231.33333333334, ans=0.125 2024-06-20 07:24:38,755 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.459e+02 1.732e+02 1.907e+02 2.107e+02 2.808e+02, threshold=3.813e+02, percent-clipped=0.0 2024-06-20 07:24:45,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=158249.66666666666, ans=22.5 2024-06-20 07:24:45,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=158249.66666666666, ans=15.0 2024-06-20 07:24:49,403 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:24:56,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=158268.0, ans=0.125 2024-06-20 07:24:58,071 INFO [train.py:1028] (0/2) Epoch 9, batch 5400, loss[loss=0.2625, simple_loss=0.2813, pruned_loss=0.1218, over 12159.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2675, pruned_loss=0.09756, over 2568009.37 frames. ], batch size: 240, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:25:05,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=15.0 2024-06-20 07:25:05,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-20 07:25:20,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 07:25:21,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=158341.33333333334, ans=0.2 2024-06-20 07:25:31,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=158359.66666666666, ans=0.125 2024-06-20 07:25:32,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.07 vs. limit=15.0 2024-06-20 07:25:34,368 INFO [train.py:1028] (0/2) Epoch 9, batch 5450, loss[loss=0.25, simple_loss=0.2831, pruned_loss=0.1084, over 12299.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2681, pruned_loss=0.09753, over 2570564.20 frames. ], batch size: 25, lr: 6.52e-03, grad_scale: 64.0 2024-06-20 07:25:41,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=158378.0, ans=0.125 2024-06-20 07:25:48,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=158396.33333333334, ans=0.125 2024-06-20 07:25:49,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=158396.33333333334, ans=0.0 2024-06-20 07:25:49,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=158396.33333333334, ans=0.0 2024-06-20 07:25:50,882 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.747e+02 1.931e+02 2.301e+02 3.615e+02, threshold=3.863e+02, percent-clipped=0.0 2024-06-20 07:25:51,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=158414.66666666666, ans=0.2 2024-06-20 07:25:55,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2024-06-20 07:25:56,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=158414.66666666666, ans=0.0 2024-06-20 07:26:03,182 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:26:04,791 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.52 vs. limit=15.0 2024-06-20 07:26:07,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=158451.33333333334, ans=0.0 2024-06-20 07:26:10,830 INFO [train.py:1028] (0/2) Epoch 9, batch 5500, loss[loss=0.2702, simple_loss=0.2851, pruned_loss=0.1276, over 12133.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2676, pruned_loss=0.09733, over 2564142.72 frames. ], batch size: 240, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:26:16,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158488.0, ans=0.1 2024-06-20 07:26:18,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=158488.0, ans=0.125 2024-06-20 07:26:18,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=158488.0, ans=0.2 2024-06-20 07:26:43,968 INFO [train.py:1028] (0/2) Epoch 9, batch 5550, loss[loss=0.2311, simple_loss=0.2728, pruned_loss=0.0947, over 13226.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2669, pruned_loss=0.09672, over 2567448.18 frames. ], batch size: 43, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:26:56,998 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.724e+02 1.856e+02 2.053e+02 2.773e+02, threshold=3.713e+02, percent-clipped=0.0 2024-06-20 07:26:57,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158598.0, ans=0.125 2024-06-20 07:26:58,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=12.0 2024-06-20 07:26:58,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=158598.0, ans=0.09899494936611666 2024-06-20 07:27:08,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=158616.33333333334, ans=0.0 2024-06-20 07:27:15,954 INFO [train.py:1028] (0/2) Epoch 9, batch 5600, loss[loss=0.2212, simple_loss=0.2565, pruned_loss=0.09297, over 13272.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2659, pruned_loss=0.09617, over 2569386.40 frames. ], batch size: 89, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:27:22,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=158653.0, ans=0.125 2024-06-20 07:27:33,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=158689.66666666666, ans=0.2 2024-06-20 07:27:39,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.74 vs. limit=15.0 2024-06-20 07:27:40,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=158708.0, ans=0.125 2024-06-20 07:27:50,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158726.33333333334, ans=0.125 2024-06-20 07:27:53,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2024-06-20 07:27:55,284 INFO [train.py:1028] (0/2) Epoch 9, batch 5650, loss[loss=0.2358, simple_loss=0.2669, pruned_loss=0.1023, over 12500.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2653, pruned_loss=0.09569, over 2574206.33 frames. ], batch size: 202, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:28:00,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=15.0 2024-06-20 07:28:04,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=158763.0, ans=0.0 2024-06-20 07:28:08,852 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.710e+02 1.801e+02 1.989e+02 2.965e+02, threshold=3.601e+02, percent-clipped=0.0 2024-06-20 07:28:24,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158818.0, ans=0.125 2024-06-20 07:28:28,562 INFO [train.py:1028] (0/2) Epoch 9, batch 5700, loss[loss=0.2065, simple_loss=0.2456, pruned_loss=0.08373, over 13308.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2652, pruned_loss=0.09584, over 2578156.49 frames. ], batch size: 63, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:28:37,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158854.66666666666, ans=0.1 2024-06-20 07:28:54,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=158909.66666666666, ans=0.125 2024-06-20 07:28:57,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=158909.66666666666, ans=0.125 2024-06-20 07:28:58,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158909.66666666666, ans=0.0 2024-06-20 07:29:00,328 INFO [train.py:1028] (0/2) Epoch 9, batch 5750, loss[loss=0.2615, simple_loss=0.2924, pruned_loss=0.1153, over 12796.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2665, pruned_loss=0.09644, over 2579052.86 frames. ], batch size: 176, lr: 6.51e-03, grad_scale: 64.0 2024-06-20 07:29:13,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2024-06-20 07:29:16,745 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.736e+02 1.859e+02 2.024e+02 3.196e+02, threshold=3.719e+02, percent-clipped=0.0 2024-06-20 07:29:17,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2024-06-20 07:29:22,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=158983.0, ans=0.125 2024-06-20 07:29:26,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=158983.0, ans=0.0 2024-06-20 07:29:39,928 INFO [train.py:1028] (0/2) Epoch 9, batch 5800, loss[loss=0.258, simple_loss=0.2897, pruned_loss=0.1132, over 12677.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2683, pruned_loss=0.09747, over 2577657.29 frames. ], batch size: 176, lr: 6.50e-03, grad_scale: 64.0 2024-06-20 07:29:43,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=159019.66666666666, ans=0.125 2024-06-20 07:29:43,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=159019.66666666666, ans=0.025 2024-06-20 07:29:44,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=159019.66666666666, ans=0.0 2024-06-20 07:29:45,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=159038.0, ans=0.2 2024-06-20 07:29:57,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=159056.33333333334, ans=0.125 2024-06-20 07:30:11,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=159093.0, ans=0.125 2024-06-20 07:30:12,826 INFO [train.py:1028] (0/2) Epoch 9, batch 5850, loss[loss=0.2499, simple_loss=0.2798, pruned_loss=0.11, over 12552.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.27, pruned_loss=0.0982, over 2576487.82 frames. ], batch size: 202, lr: 6.50e-03, grad_scale: 64.0 2024-06-20 07:30:14,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=159111.33333333334, ans=0.125 2024-06-20 07:30:26,813 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.856e+02 1.998e+02 2.219e+02 2.902e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 07:30:47,413 INFO [train.py:1028] (0/2) Epoch 9, batch 5900, loss[loss=0.2187, simple_loss=0.2546, pruned_loss=0.09145, over 13075.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.2727, pruned_loss=0.09936, over 2576919.21 frames. ], batch size: 121, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:30:47,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=159203.0, ans=10.0 2024-06-20 07:30:53,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159221.33333333334, ans=0.0 2024-06-20 07:31:19,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159276.33333333334, ans=0.1 2024-06-20 07:31:20,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=159276.33333333334, ans=0.0 2024-06-20 07:31:24,814 INFO [train.py:1028] (0/2) Epoch 9, batch 5950, loss[loss=0.2189, simple_loss=0.2549, pruned_loss=0.09141, over 13122.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2739, pruned_loss=0.09966, over 2580840.93 frames. ], batch size: 121, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:31:38,072 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.771e+02 1.909e+02 2.139e+02 2.823e+02, threshold=3.819e+02, percent-clipped=0.0 2024-06-20 07:31:41,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=159331.33333333334, ans=0.125 2024-06-20 07:31:45,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=159331.33333333334, ans=0.0 2024-06-20 07:31:47,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=159349.66666666666, ans=0.2 2024-06-20 07:31:55,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=159368.0, ans=0.125 2024-06-20 07:31:55,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2024-06-20 07:32:01,002 INFO [train.py:1028] (0/2) Epoch 9, batch 6000, loss[loss=0.3041, simple_loss=0.3191, pruned_loss=0.1445, over 12159.00 frames. ], tot_loss[loss=0.238, simple_loss=0.2753, pruned_loss=0.1003, over 2574343.22 frames. ], batch size: 240, lr: 6.50e-03, grad_scale: 128.0 2024-06-20 07:32:01,003 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 07:32:08,961 INFO [train.py:1060] (0/2) Epoch 9, validation: loss=0.2023, simple_loss=0.2651, pruned_loss=0.06971, over 351949.00 frames. 2024-06-20 07:32:08,961 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 07:32:11,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159386.33333333334, ans=0.1 2024-06-20 07:32:14,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=159386.33333333334, ans=0.125 2024-06-20 07:32:17,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=159404.66666666666, ans=0.0 2024-06-20 07:32:22,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=159423.0, ans=0.125 2024-06-20 07:32:31,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2024-06-20 07:32:35,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159441.33333333334, ans=0.1 2024-06-20 07:32:38,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-20 07:32:41,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=159459.66666666666, ans=0.125 2024-06-20 07:32:42,904 INFO [train.py:1028] (0/2) Epoch 9, batch 6050, loss[loss=0.2272, simple_loss=0.2724, pruned_loss=0.09101, over 13219.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2772, pruned_loss=0.1012, over 2577931.28 frames. ], batch size: 40, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:32:54,101 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2024-06-20 07:32:56,140 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.843e+02 2.047e+02 2.369e+02 3.231e+02, threshold=4.094e+02, percent-clipped=0.0 2024-06-20 07:33:11,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2024-06-20 07:33:19,155 INFO [train.py:1028] (0/2) Epoch 9, batch 6100, loss[loss=0.2118, simple_loss=0.2486, pruned_loss=0.08745, over 13090.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2781, pruned_loss=0.1014, over 2579764.85 frames. ], batch size: 121, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:33:40,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=159624.66666666666, ans=0.0 2024-06-20 07:33:56,367 INFO [train.py:1028] (0/2) Epoch 9, batch 6150, loss[loss=0.2688, simple_loss=0.2929, pruned_loss=0.1223, over 10828.00 frames. ], tot_loss[loss=0.242, simple_loss=0.2795, pruned_loss=0.1022, over 2577676.63 frames. ], batch size: 304, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:34:03,096 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=15.0 2024-06-20 07:34:06,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=159679.66666666666, ans=0.125 2024-06-20 07:34:08,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=159679.66666666666, ans=15.0 2024-06-20 07:34:09,712 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.831e+02 1.997e+02 2.236e+02 3.156e+02, threshold=3.994e+02, percent-clipped=0.0 2024-06-20 07:34:09,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=159698.0, ans=0.025 2024-06-20 07:34:13,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=15.0 2024-06-20 07:34:17,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=159716.33333333334, ans=0.125 2024-06-20 07:34:29,429 INFO [train.py:1028] (0/2) Epoch 9, batch 6200, loss[loss=0.2689, simple_loss=0.3057, pruned_loss=0.1161, over 13241.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2816, pruned_loss=0.1032, over 2575585.52 frames. ], batch size: 89, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:34:32,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=159753.0, ans=0.125 2024-06-20 07:34:34,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=159753.0, ans=0.125 2024-06-20 07:34:35,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159771.33333333334, ans=0.125 2024-06-20 07:34:46,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=159789.66666666666, ans=0.0 2024-06-20 07:35:00,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=159826.33333333334, ans=0.0 2024-06-20 07:35:02,958 INFO [train.py:1028] (0/2) Epoch 9, batch 6250, loss[loss=0.2347, simple_loss=0.2776, pruned_loss=0.09585, over 13243.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2822, pruned_loss=0.1034, over 2567763.97 frames. ], batch size: 83, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:35:06,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.29 vs. limit=15.0 2024-06-20 07:35:19,599 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 1.839e+02 1.992e+02 2.203e+02 2.987e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 07:35:22,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2024-06-20 07:35:39,205 INFO [train.py:1028] (0/2) Epoch 9, batch 6300, loss[loss=0.2159, simple_loss=0.2633, pruned_loss=0.08429, over 11825.00 frames. ], tot_loss[loss=0.246, simple_loss=0.284, pruned_loss=0.104, over 2563834.10 frames. ], batch size: 17, lr: 6.49e-03, grad_scale: 128.0 2024-06-20 07:35:50,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159954.66666666666, ans=0.1 2024-06-20 07:35:53,674 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2024-06-20 07:35:59,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=159973.0, ans=0.125 2024-06-20 07:36:03,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=15.0 2024-06-20 07:36:12,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=160009.66666666666, ans=0.0 2024-06-20 07:36:15,132 INFO [train.py:1028] (0/2) Epoch 9, batch 6350, loss[loss=0.2659, simple_loss=0.2961, pruned_loss=0.1178, over 12511.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2854, pruned_loss=0.1041, over 2574238.89 frames. ], batch size: 202, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:36:23,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=160046.33333333334, ans=0.025 2024-06-20 07:36:28,034 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.856e+02 1.999e+02 2.150e+02 3.070e+02, threshold=3.997e+02, percent-clipped=0.0 2024-06-20 07:36:29,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.90 vs. limit=10.0 2024-06-20 07:36:34,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=160083.0, ans=0.0 2024-06-20 07:36:47,966 INFO [train.py:1028] (0/2) Epoch 9, batch 6400, loss[loss=0.252, simple_loss=0.2939, pruned_loss=0.105, over 13242.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2876, pruned_loss=0.1052, over 2575150.06 frames. ], batch size: 67, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:37:22,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=160193.0, ans=0.0 2024-06-20 07:37:23,376 INFO [train.py:1028] (0/2) Epoch 9, batch 6450, loss[loss=0.271, simple_loss=0.3083, pruned_loss=0.1169, over 12557.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2896, pruned_loss=0.1061, over 2581258.96 frames. ], batch size: 202, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:37:25,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.72 vs. limit=15.0 2024-06-20 07:37:32,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=160229.66666666666, ans=0.025 2024-06-20 07:37:34,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=160229.66666666666, ans=6.0 2024-06-20 07:37:36,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.952e+02 2.128e+02 2.406e+02 3.588e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 07:37:40,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=160248.0, ans=0.125 2024-06-20 07:37:52,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=160284.66666666666, ans=0.125 2024-06-20 07:38:00,753 INFO [train.py:1028] (0/2) Epoch 9, batch 6500, loss[loss=0.2755, simple_loss=0.298, pruned_loss=0.1265, over 11048.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.2915, pruned_loss=0.1068, over 2586009.59 frames. ], batch size: 306, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:38:01,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-20 07:38:04,061 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=15.16 vs. limit=15.0 2024-06-20 07:38:14,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.19 vs. limit=15.0 2024-06-20 07:38:33,357 INFO [train.py:1028] (0/2) Epoch 9, batch 6550, loss[loss=0.2389, simple_loss=0.29, pruned_loss=0.09389, over 12598.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.2915, pruned_loss=0.1062, over 2589356.90 frames. ], batch size: 22, lr: 6.48e-03, grad_scale: 128.0 2024-06-20 07:38:37,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=160394.66666666666, ans=0.125 2024-06-20 07:38:42,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2024-06-20 07:38:44,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=160413.0, ans=0.2 2024-06-20 07:38:44,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2024-06-20 07:38:45,731 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.862e+02 1.995e+02 2.141e+02 2.676e+02, threshold=3.989e+02, percent-clipped=0.0 2024-06-20 07:38:48,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160431.33333333334, ans=0.125 2024-06-20 07:38:56,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=160449.66666666666, ans=0.025 2024-06-20 07:39:00,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=160468.0, ans=0.125 2024-06-20 07:39:03,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=160468.0, ans=0.125 2024-06-20 07:39:04,823 INFO [train.py:1028] (0/2) Epoch 9, batch 6600, loss[loss=0.2462, simple_loss=0.2866, pruned_loss=0.1029, over 13214.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.2923, pruned_loss=0.1065, over 2591386.72 frames. ], batch size: 72, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:39:05,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=160486.33333333334, ans=0.2 2024-06-20 07:39:08,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=160486.33333333334, ans=0.125 2024-06-20 07:39:09,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=160486.33333333334, ans=15.0 2024-06-20 07:39:09,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=160486.33333333334, ans=0.0 2024-06-20 07:39:10,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=160486.33333333334, ans=0.125 2024-06-20 07:39:10,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160486.33333333334, ans=0.1 2024-06-20 07:39:14,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=160504.66666666666, ans=0.125 2024-06-20 07:39:17,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=160523.0, ans=0.125 2024-06-20 07:39:20,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=160523.0, ans=0.0 2024-06-20 07:39:26,184 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.941e+01 2024-06-20 07:39:26,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=160523.0, ans=0.0 2024-06-20 07:39:28,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-20 07:39:30,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.89 vs. limit=6.0 2024-06-20 07:39:30,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160541.33333333334, ans=0.1 2024-06-20 07:39:34,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160559.66666666666, ans=0.125 2024-06-20 07:39:39,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=160559.66666666666, ans=0.125 2024-06-20 07:39:40,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2024-06-20 07:39:40,780 INFO [train.py:1028] (0/2) Epoch 9, batch 6650, loss[loss=0.2869, simple_loss=0.3211, pruned_loss=0.1263, over 12928.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.2941, pruned_loss=0.1074, over 2584932.98 frames. ], batch size: 158, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:39:41,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=160578.0, ans=0.04949747468305833 2024-06-20 07:39:41,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=160578.0, ans=0.0 2024-06-20 07:39:51,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2024-06-20 07:39:53,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 2.016e+02 2.303e+02 2.686e+02 4.382e+02, threshold=4.607e+02, percent-clipped=3.0 2024-06-20 07:40:11,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.29 vs. limit=22.5 2024-06-20 07:40:14,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=160651.33333333334, ans=0.125 2024-06-20 07:40:17,301 INFO [train.py:1028] (0/2) Epoch 9, batch 6700, loss[loss=0.2645, simple_loss=0.3048, pruned_loss=0.1121, over 12740.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.2955, pruned_loss=0.108, over 2583901.39 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:40:30,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.52 vs. limit=6.0 2024-06-20 07:40:46,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160743.0, ans=0.125 2024-06-20 07:40:46,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2024-06-20 07:40:50,214 INFO [train.py:1028] (0/2) Epoch 9, batch 6750, loss[loss=0.3036, simple_loss=0.3252, pruned_loss=0.141, over 12222.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.2961, pruned_loss=0.1087, over 2577110.36 frames. ], batch size: 240, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:40:53,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=160761.33333333334, ans=0.1 2024-06-20 07:40:56,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160779.66666666666, ans=0.125 2024-06-20 07:40:59,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160779.66666666666, ans=0.125 2024-06-20 07:40:59,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=160779.66666666666, ans=0.0 2024-06-20 07:41:02,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=160798.0, ans=0.125 2024-06-20 07:41:02,976 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.885e+02 2.082e+02 2.311e+02 3.918e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-20 07:41:16,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160834.66666666666, ans=0.1 2024-06-20 07:41:25,967 INFO [train.py:1028] (0/2) Epoch 9, batch 6800, loss[loss=0.2669, simple_loss=0.3025, pruned_loss=0.1156, over 13223.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.2981, pruned_loss=0.1094, over 2579969.46 frames. ], batch size: 67, lr: 6.47e-03, grad_scale: 128.0 2024-06-20 07:41:33,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=160871.33333333334, ans=0.2 2024-06-20 07:41:33,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=160871.33333333334, ans=0.125 2024-06-20 07:41:37,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=160871.33333333334, ans=0.125 2024-06-20 07:41:49,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=160908.0, ans=0.125 2024-06-20 07:41:58,645 INFO [train.py:1028] (0/2) Epoch 9, batch 6850, loss[loss=0.2785, simple_loss=0.3278, pruned_loss=0.1146, over 13195.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.2985, pruned_loss=0.1092, over 2582961.57 frames. ], batch size: 63, lr: 6.46e-03, grad_scale: 128.0 2024-06-20 07:41:58,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=160944.66666666666, ans=0.0 2024-06-20 07:42:03,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2024-06-20 07:42:06,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=160944.66666666666, ans=0.0 2024-06-20 07:42:06,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=160944.66666666666, ans=10.0 2024-06-20 07:42:14,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=160981.33333333334, ans=0.125 2024-06-20 07:42:15,084 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.845e+02 2.046e+02 2.388e+02 3.410e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 07:42:21,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=160999.66666666666, ans=0.125 2024-06-20 07:42:27,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=161018.0, ans=0.125 2024-06-20 07:42:28,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=161018.0, ans=0.95 2024-06-20 07:42:29,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=161018.0, ans=0.05 2024-06-20 07:42:29,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=161018.0, ans=0.125 2024-06-20 07:42:32,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=161018.0, ans=0.125 2024-06-20 07:42:34,715 INFO [train.py:1028] (0/2) Epoch 9, batch 6900, loss[loss=0.2473, simple_loss=0.291, pruned_loss=0.1018, over 13304.00 frames. ], tot_loss[loss=0.26, simple_loss=0.2998, pruned_loss=0.1101, over 2585557.61 frames. ], batch size: 49, lr: 6.46e-03, grad_scale: 128.0 2024-06-20 07:42:35,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-20 07:42:36,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.66 vs. limit=15.0 2024-06-20 07:42:46,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=161054.66666666666, ans=0.025 2024-06-20 07:42:47,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.29 vs. limit=15.0 2024-06-20 07:42:49,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.80 vs. limit=15.0 2024-06-20 07:43:03,824 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 07:43:07,734 INFO [train.py:1028] (0/2) Epoch 9, batch 6950, loss[loss=0.2331, simple_loss=0.2729, pruned_loss=0.09667, over 10900.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3003, pruned_loss=0.11, over 2578367.77 frames. ], batch size: 16, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:43:15,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2024-06-20 07:43:21,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.896e+02 2.057e+02 2.321e+02 3.049e+02, threshold=4.115e+02, percent-clipped=0.0 2024-06-20 07:43:25,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=161164.66666666666, ans=0.0 2024-06-20 07:43:25,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=161164.66666666666, ans=0.0 2024-06-20 07:43:26,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 07:43:39,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2024-06-20 07:43:41,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.52 vs. limit=10.0 2024-06-20 07:43:42,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=161201.33333333334, ans=0.125 2024-06-20 07:43:42,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=161201.33333333334, ans=0.2 2024-06-20 07:43:43,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=161219.66666666666, ans=0.125 2024-06-20 07:43:44,280 INFO [train.py:1028] (0/2) Epoch 9, batch 7000, loss[loss=0.2885, simple_loss=0.3226, pruned_loss=0.1272, over 12914.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.301, pruned_loss=0.11, over 2574851.14 frames. ], batch size: 158, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:44:03,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=161256.33333333334, ans=0.07 2024-06-20 07:44:13,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.40 vs. limit=10.0 2024-06-20 07:44:15,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=161293.0, ans=0.07 2024-06-20 07:44:20,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.69 vs. limit=15.0 2024-06-20 07:44:21,924 INFO [train.py:1028] (0/2) Epoch 9, batch 7050, loss[loss=0.2922, simple_loss=0.3209, pruned_loss=0.1318, over 12790.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.302, pruned_loss=0.1105, over 2581966.40 frames. ], batch size: 176, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:44:24,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=161311.33333333334, ans=0.0 2024-06-20 07:44:29,271 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-88000.pt 2024-06-20 07:44:40,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=161348.0, ans=0.125 2024-06-20 07:44:40,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=161348.0, ans=0.125 2024-06-20 07:44:40,630 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.896e+02 2.023e+02 2.166e+02 2.591e+02, threshold=4.045e+02, percent-clipped=0.0 2024-06-20 07:44:43,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=161348.0, ans=0.2 2024-06-20 07:44:46,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-20 07:44:53,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=161384.66666666666, ans=0.035 2024-06-20 07:44:53,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=161384.66666666666, ans=0.1 2024-06-20 07:44:55,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=161384.66666666666, ans=0.125 2024-06-20 07:44:58,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.75 vs. limit=22.5 2024-06-20 07:44:59,455 INFO [train.py:1028] (0/2) Epoch 9, batch 7100, loss[loss=0.293, simple_loss=0.3313, pruned_loss=0.1273, over 13161.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3025, pruned_loss=0.111, over 2574374.50 frames. ], batch size: 112, lr: 6.46e-03, grad_scale: 64.0 2024-06-20 07:45:02,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161403.0, ans=0.1 2024-06-20 07:45:17,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.62 vs. limit=22.5 2024-06-20 07:45:24,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2024-06-20 07:45:27,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=161476.33333333334, ans=0.0 2024-06-20 07:45:32,239 INFO [train.py:1028] (0/2) Epoch 9, batch 7150, loss[loss=0.2765, simple_loss=0.3091, pruned_loss=0.1219, over 12567.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3032, pruned_loss=0.1111, over 2573321.45 frames. ], batch size: 202, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:45:35,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-20 07:45:49,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.902e+02 2.034e+02 2.249e+02 2.810e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 07:45:52,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161531.33333333334, ans=0.0 2024-06-20 07:45:54,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161531.33333333334, ans=0.0 2024-06-20 07:45:55,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=161549.66666666666, ans=0.125 2024-06-20 07:46:06,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=161568.0, ans=0.95 2024-06-20 07:46:06,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=161568.0, ans=0.0 2024-06-20 07:46:08,417 INFO [train.py:1028] (0/2) Epoch 9, batch 7200, loss[loss=0.2756, simple_loss=0.3193, pruned_loss=0.116, over 13211.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3044, pruned_loss=0.1117, over 2579226.67 frames. ], batch size: 112, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:46:26,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=161623.0, ans=0.2 2024-06-20 07:46:33,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=12.0 2024-06-20 07:46:42,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=161659.66666666666, ans=0.2 2024-06-20 07:46:43,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=161659.66666666666, ans=0.125 2024-06-20 07:46:44,616 INFO [train.py:1028] (0/2) Epoch 9, batch 7250, loss[loss=0.2552, simple_loss=0.3054, pruned_loss=0.1025, over 12974.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3052, pruned_loss=0.1118, over 2579175.97 frames. ], batch size: 36, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:46:47,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=161678.0, ans=0.125 2024-06-20 07:46:54,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.58 vs. limit=22.5 2024-06-20 07:46:58,453 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.912e+02 2.025e+02 2.243e+02 3.186e+02, threshold=4.050e+02, percent-clipped=0.0 2024-06-20 07:47:02,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=161714.66666666666, ans=0.04949747468305833 2024-06-20 07:47:08,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=161733.0, ans=0.0 2024-06-20 07:47:12,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2024-06-20 07:47:13,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161751.33333333334, ans=0.125 2024-06-20 07:47:13,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=161751.33333333334, ans=0.125 2024-06-20 07:47:14,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=161751.33333333334, ans=0.0 2024-06-20 07:47:15,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161751.33333333334, ans=0.125 2024-06-20 07:47:18,008 INFO [train.py:1028] (0/2) Epoch 9, batch 7300, loss[loss=0.2525, simple_loss=0.2969, pruned_loss=0.104, over 12954.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3068, pruned_loss=0.1129, over 2578146.13 frames. ], batch size: 36, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:47:18,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161769.66666666666, ans=0.1 2024-06-20 07:47:24,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=161788.0, ans=0.125 2024-06-20 07:47:26,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=161788.0, ans=0.125 2024-06-20 07:47:28,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.03 vs. limit=22.5 2024-06-20 07:47:32,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=161806.33333333334, ans=0.0 2024-06-20 07:47:32,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=161806.33333333334, ans=0.0 2024-06-20 07:47:38,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.45 vs. limit=15.0 2024-06-20 07:47:40,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.03 vs. limit=15.0 2024-06-20 07:47:49,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=161843.0, ans=0.125 2024-06-20 07:47:49,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=161843.0, ans=0.1 2024-06-20 07:47:53,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=161843.0, ans=0.125 2024-06-20 07:47:54,296 INFO [train.py:1028] (0/2) Epoch 9, batch 7350, loss[loss=0.2908, simple_loss=0.3307, pruned_loss=0.1255, over 13284.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3063, pruned_loss=0.1127, over 2579441.70 frames. ], batch size: 46, lr: 6.45e-03, grad_scale: 64.0 2024-06-20 07:47:58,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=161861.33333333334, ans=0.0 2024-06-20 07:47:59,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.42 vs. limit=22.5 2024-06-20 07:48:05,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=161879.66666666666, ans=0.0 2024-06-20 07:48:05,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161879.66666666666, ans=0.1 2024-06-20 07:48:08,045 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.922e+02 2.132e+02 2.512e+02 3.948e+02, threshold=4.263e+02, percent-clipped=0.0 2024-06-20 07:48:11,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=161898.0, ans=0.0 2024-06-20 07:48:11,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161898.0, ans=0.1 2024-06-20 07:48:17,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=161916.33333333334, ans=0.0 2024-06-20 07:48:28,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=161934.66666666666, ans=0.0 2024-06-20 07:48:28,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.60 vs. limit=15.0 2024-06-20 07:48:29,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=12.0 2024-06-20 07:48:31,325 INFO [train.py:1028] (0/2) Epoch 9, batch 7400, loss[loss=0.2819, simple_loss=0.3338, pruned_loss=0.115, over 13279.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3067, pruned_loss=0.1125, over 2585729.18 frames. ], batch size: 63, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:48:36,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=161953.0, ans=0.125 2024-06-20 07:48:37,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=161953.0, ans=0.07 2024-06-20 07:48:45,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=161989.66666666666, ans=0.0 2024-06-20 07:48:55,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=162008.0, ans=0.5 2024-06-20 07:48:58,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=162026.33333333334, ans=0.125 2024-06-20 07:49:05,207 INFO [train.py:1028] (0/2) Epoch 9, batch 7450, loss[loss=0.2748, simple_loss=0.3132, pruned_loss=0.1182, over 12811.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3069, pruned_loss=0.1124, over 2579200.22 frames. ], batch size: 29, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:49:05,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162044.66666666666, ans=0.1 2024-06-20 07:49:11,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=162063.0, ans=0.0 2024-06-20 07:49:19,406 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.889e+02 2.044e+02 2.294e+02 3.470e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 07:49:24,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=162081.33333333334, ans=0.125 2024-06-20 07:49:30,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=162099.66666666666, ans=0.2 2024-06-20 07:49:38,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2024-06-20 07:49:38,710 INFO [train.py:1028] (0/2) Epoch 9, batch 7500, loss[loss=0.255, simple_loss=0.2918, pruned_loss=0.1091, over 10645.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3077, pruned_loss=0.1128, over 2576322.37 frames. ], batch size: 304, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:49:58,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.94 vs. limit=22.5 2024-06-20 07:49:59,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162173.0, ans=0.1 2024-06-20 07:50:00,642 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.973e+01 2024-06-20 07:50:05,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.43 vs. limit=15.0 2024-06-20 07:50:11,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162209.66666666666, ans=0.0 2024-06-20 07:50:14,570 INFO [train.py:1028] (0/2) Epoch 9, batch 7550, loss[loss=0.2458, simple_loss=0.2908, pruned_loss=0.1004, over 12975.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3085, pruned_loss=0.1135, over 2576868.02 frames. ], batch size: 158, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:50:16,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=162228.0, ans=0.0 2024-06-20 07:50:18,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=162228.0, ans=0.125 2024-06-20 07:50:18,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2024-06-20 07:50:32,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.869e+02 2.051e+02 2.235e+02 2.683e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 07:50:38,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=162283.0, ans=0.0 2024-06-20 07:50:47,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=162301.33333333334, ans=0.125 2024-06-20 07:50:47,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=162301.33333333334, ans=0.125 2024-06-20 07:50:51,625 INFO [train.py:1028] (0/2) Epoch 9, batch 7600, loss[loss=0.2579, simple_loss=0.3021, pruned_loss=0.1068, over 13217.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3083, pruned_loss=0.1133, over 2575908.20 frames. ], batch size: 83, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:50:54,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=162319.66666666666, ans=0.125 2024-06-20 07:50:56,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=162319.66666666666, ans=0.0 2024-06-20 07:51:02,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162338.0, ans=0.1 2024-06-20 07:51:14,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-20 07:51:17,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=162374.66666666666, ans=0.0 2024-06-20 07:51:25,173 INFO [train.py:1028] (0/2) Epoch 9, batch 7650, loss[loss=0.2891, simple_loss=0.3309, pruned_loss=0.1237, over 12824.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3086, pruned_loss=0.1133, over 2571494.48 frames. ], batch size: 33, lr: 6.44e-03, grad_scale: 64.0 2024-06-20 07:51:29,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=162411.33333333334, ans=0.125 2024-06-20 07:51:36,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=162429.66666666666, ans=0.2 2024-06-20 07:51:39,649 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.909e+02 2.071e+02 2.328e+02 3.395e+02, threshold=4.142e+02, percent-clipped=0.0 2024-06-20 07:51:52,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2024-06-20 07:51:53,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=162466.33333333334, ans=0.2 2024-06-20 07:52:02,355 INFO [train.py:1028] (0/2) Epoch 9, batch 7700, loss[loss=0.2991, simple_loss=0.35, pruned_loss=0.1241, over 13222.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3098, pruned_loss=0.1139, over 2569075.03 frames. ], batch size: 63, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:52:25,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=162558.0, ans=0.04949747468305833 2024-06-20 07:52:31,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=162576.33333333334, ans=0.2 2024-06-20 07:52:34,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.38 vs. limit=15.0 2024-06-20 07:52:36,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=162576.33333333334, ans=0.025 2024-06-20 07:52:38,234 INFO [train.py:1028] (0/2) Epoch 9, batch 7750, loss[loss=0.2743, simple_loss=0.3161, pruned_loss=0.1162, over 13229.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3106, pruned_loss=0.1147, over 2573726.79 frames. ], batch size: 72, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:52:52,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.904e+02 2.052e+02 2.223e+02 3.325e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 07:53:07,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=162668.0, ans=0.0 2024-06-20 07:53:11,307 INFO [train.py:1028] (0/2) Epoch 9, batch 7800, loss[loss=0.268, simple_loss=0.3105, pruned_loss=0.1127, over 13127.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3104, pruned_loss=0.1145, over 2578240.35 frames. ], batch size: 95, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:53:21,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=162704.66666666666, ans=0.0 2024-06-20 07:53:47,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=162778.0, ans=0.025 2024-06-20 07:53:48,030 INFO [train.py:1028] (0/2) Epoch 9, batch 7850, loss[loss=0.2636, simple_loss=0.3035, pruned_loss=0.1119, over 11314.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.312, pruned_loss=0.1156, over 2572124.45 frames. ], batch size: 17, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:54:01,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=162814.66666666666, ans=0.2 2024-06-20 07:54:02,180 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.031e+02 2.244e+02 2.640e+02 3.267e+02, threshold=4.488e+02, percent-clipped=0.0 2024-06-20 07:54:04,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.80 vs. limit=15.0 2024-06-20 07:54:07,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162833.0, ans=0.0 2024-06-20 07:54:08,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162833.0, ans=0.1 2024-06-20 07:54:18,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=162851.33333333334, ans=0.0 2024-06-20 07:54:18,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=162851.33333333334, ans=0.0 2024-06-20 07:54:23,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=12.0 2024-06-20 07:54:24,876 INFO [train.py:1028] (0/2) Epoch 9, batch 7900, loss[loss=0.2806, simple_loss=0.3219, pruned_loss=0.1196, over 13215.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.313, pruned_loss=0.1163, over 2571512.87 frames. ], batch size: 77, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:54:25,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=162869.66666666666, ans=0.0 2024-06-20 07:54:27,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.90 vs. limit=15.0 2024-06-20 07:54:28,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-20 07:54:30,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.25 vs. limit=15.0 2024-06-20 07:54:31,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=162888.0, ans=0.125 2024-06-20 07:54:42,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=162906.33333333334, ans=0.0 2024-06-20 07:54:45,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=162924.66666666666, ans=0.0 2024-06-20 07:54:48,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.62 vs. limit=22.5 2024-06-20 07:54:48,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=162924.66666666666, ans=0.07 2024-06-20 07:54:58,202 INFO [train.py:1028] (0/2) Epoch 9, batch 7950, loss[loss=0.2902, simple_loss=0.3239, pruned_loss=0.1283, over 10504.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3136, pruned_loss=0.1163, over 2574462.48 frames. ], batch size: 303, lr: 6.43e-03, grad_scale: 64.0 2024-06-20 07:55:00,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=162961.33333333334, ans=0.5 2024-06-20 07:55:04,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=162979.66666666666, ans=0.125 2024-06-20 07:55:12,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 1.946e+02 2.102e+02 2.334e+02 3.667e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-20 07:55:16,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162998.0, ans=0.1 2024-06-20 07:55:17,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=162998.0, ans=0.125 2024-06-20 07:55:18,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163016.33333333334, ans=0.125 2024-06-20 07:55:19,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=163016.33333333334, ans=0.125 2024-06-20 07:55:21,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=163016.33333333334, ans=0.025 2024-06-20 07:55:24,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.05 vs. limit=10.0 2024-06-20 07:55:30,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.03 vs. limit=10.0 2024-06-20 07:55:31,689 INFO [train.py:1028] (0/2) Epoch 9, batch 8000, loss[loss=0.2524, simple_loss=0.3087, pruned_loss=0.09803, over 12949.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3149, pruned_loss=0.1167, over 2572101.61 frames. ], batch size: 30, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:55:33,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=163053.0, ans=0.0 2024-06-20 07:55:39,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=163053.0, ans=0.125 2024-06-20 07:55:44,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.08 vs. limit=10.0 2024-06-20 07:55:58,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=163108.0, ans=0.125 2024-06-20 07:56:07,774 INFO [train.py:1028] (0/2) Epoch 9, batch 8050, loss[loss=0.2722, simple_loss=0.3123, pruned_loss=0.116, over 13196.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3144, pruned_loss=0.1165, over 2571746.33 frames. ], batch size: 83, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:56:24,891 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.911e+02 2.109e+02 2.364e+02 3.890e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-20 07:56:27,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=163181.33333333334, ans=0.07 2024-06-20 07:56:33,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=163199.66666666666, ans=0.09899494936611666 2024-06-20 07:56:38,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=163218.0, ans=0.125 2024-06-20 07:56:43,287 INFO [train.py:1028] (0/2) Epoch 9, batch 8100, loss[loss=0.2652, simple_loss=0.306, pruned_loss=0.1122, over 13153.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3147, pruned_loss=0.1165, over 2576562.24 frames. ], batch size: 112, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:56:47,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.15 vs. limit=15.0 2024-06-20 07:56:48,300 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.28 vs. limit=22.5 2024-06-20 07:56:59,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=163273.0, ans=0.0 2024-06-20 07:57:01,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163291.33333333334, ans=0.0 2024-06-20 07:57:01,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=163291.33333333334, ans=0.125 2024-06-20 07:57:14,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=163309.66666666666, ans=0.025 2024-06-20 07:57:15,565 INFO [train.py:1028] (0/2) Epoch 9, batch 8150, loss[loss=0.2559, simple_loss=0.2975, pruned_loss=0.1072, over 13112.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3143, pruned_loss=0.1158, over 2579148.25 frames. ], batch size: 121, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:57:15,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=163328.0, ans=0.5 2024-06-20 07:57:16,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=163328.0, ans=0.02 2024-06-20 07:57:28,934 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.922e+02 2.171e+02 2.480e+02 3.860e+02, threshold=4.342e+02, percent-clipped=0.0 2024-06-20 07:57:39,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163383.0, ans=0.0 2024-06-20 07:57:42,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163383.0, ans=0.1 2024-06-20 07:57:42,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-20 07:57:50,889 INFO [train.py:1028] (0/2) Epoch 9, batch 8200, loss[loss=0.2928, simple_loss=0.328, pruned_loss=0.1288, over 13142.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3144, pruned_loss=0.1158, over 2582471.29 frames. ], batch size: 112, lr: 6.42e-03, grad_scale: 64.0 2024-06-20 07:58:04,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163456.33333333334, ans=0.0 2024-06-20 07:58:16,732 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.87 vs. limit=22.5 2024-06-20 07:58:20,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=163493.0, ans=0.025 2024-06-20 07:58:25,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-20 07:58:27,189 INFO [train.py:1028] (0/2) Epoch 9, batch 8250, loss[loss=0.2559, simple_loss=0.309, pruned_loss=0.1014, over 13299.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.315, pruned_loss=0.1157, over 2583090.43 frames. ], batch size: 52, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:58:37,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=163529.66666666666, ans=0.1 2024-06-20 07:58:41,080 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.938e+02 2.066e+02 2.346e+02 3.240e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 07:58:46,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=163566.33333333334, ans=0.2 2024-06-20 07:58:48,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163566.33333333334, ans=0.125 2024-06-20 07:58:52,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=163566.33333333334, ans=0.0 2024-06-20 07:58:52,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163584.66666666666, ans=0.125 2024-06-20 07:58:56,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.60 vs. limit=15.0 2024-06-20 07:59:00,138 INFO [train.py:1028] (0/2) Epoch 9, batch 8300, loss[loss=0.2753, simple_loss=0.3183, pruned_loss=0.1161, over 12955.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3136, pruned_loss=0.1148, over 2580565.50 frames. ], batch size: 102, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:59:14,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=163639.66666666666, ans=0.125 2024-06-20 07:59:15,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=163639.66666666666, ans=0.0 2024-06-20 07:59:16,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=163639.66666666666, ans=0.125 2024-06-20 07:59:21,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=163658.0, ans=0.0 2024-06-20 07:59:27,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=163676.33333333334, ans=0.125 2024-06-20 07:59:36,416 INFO [train.py:1028] (0/2) Epoch 9, batch 8350, loss[loss=0.2799, simple_loss=0.324, pruned_loss=0.1179, over 13176.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3133, pruned_loss=0.1146, over 2581467.04 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 07:59:41,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=163694.66666666666, ans=0.95 2024-06-20 07:59:46,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=15.0 2024-06-20 07:59:50,313 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.869e+02 2.003e+02 2.172e+02 3.219e+02, threshold=4.006e+02, percent-clipped=0.0 2024-06-20 07:59:51,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=163731.33333333334, ans=0.125 2024-06-20 07:59:58,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-20 08:00:07,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163768.0, ans=0.125 2024-06-20 08:00:09,560 INFO [train.py:1028] (0/2) Epoch 9, batch 8400, loss[loss=0.2648, simple_loss=0.306, pruned_loss=0.1118, over 12865.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3132, pruned_loss=0.1148, over 2577749.19 frames. ], batch size: 39, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:00:12,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-20 08:00:14,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.65 vs. limit=15.0 2024-06-20 08:00:19,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=163804.66666666666, ans=0.0 2024-06-20 08:00:33,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=163841.33333333334, ans=0.125 2024-06-20 08:00:46,217 INFO [train.py:1028] (0/2) Epoch 9, batch 8450, loss[loss=0.2599, simple_loss=0.3104, pruned_loss=0.1047, over 13154.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3137, pruned_loss=0.1146, over 2579929.70 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:00:46,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=163878.0, ans=0.125 2024-06-20 08:00:46,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=163878.0, ans=0.125 2024-06-20 08:00:52,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-20 08:01:00,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=163914.66666666666, ans=0.0 2024-06-20 08:01:00,686 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.902e+02 2.013e+02 2.139e+02 2.893e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 08:01:07,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163933.0, ans=0.07 2024-06-20 08:01:07,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=163933.0, ans=0.025 2024-06-20 08:01:15,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=163951.33333333334, ans=0.125 2024-06-20 08:01:17,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=163951.33333333334, ans=0.0 2024-06-20 08:01:20,625 INFO [train.py:1028] (0/2) Epoch 9, batch 8500, loss[loss=0.2527, simple_loss=0.3005, pruned_loss=0.1025, over 12714.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3144, pruned_loss=0.1151, over 2578359.74 frames. ], batch size: 29, lr: 6.41e-03, grad_scale: 64.0 2024-06-20 08:01:28,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163988.0, ans=0.125 2024-06-20 08:01:29,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=27.66 vs. limit=22.5 2024-06-20 08:01:31,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=163988.0, ans=0.02 2024-06-20 08:01:42,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=164006.33333333334, ans=0.025 2024-06-20 08:01:56,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=164043.0, ans=0.0 2024-06-20 08:01:59,523 INFO [train.py:1028] (0/2) Epoch 9, batch 8550, loss[loss=0.2754, simple_loss=0.3199, pruned_loss=0.1155, over 12444.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3146, pruned_loss=0.1152, over 2576676.76 frames. ], batch size: 22, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:01:59,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.27 vs. limit=15.0 2024-06-20 08:02:13,867 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 2.008e+02 2.155e+02 2.309e+02 3.733e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-20 08:02:36,816 INFO [train.py:1028] (0/2) Epoch 9, batch 8600, loss[loss=0.2688, simple_loss=0.3086, pruned_loss=0.1145, over 13182.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.316, pruned_loss=0.1159, over 2573854.15 frames. ], batch size: 112, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:02:40,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=164153.0, ans=0.0 2024-06-20 08:02:53,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=164189.66666666666, ans=0.125 2024-06-20 08:03:11,296 INFO [train.py:1028] (0/2) Epoch 9, batch 8650, loss[loss=0.2589, simple_loss=0.2964, pruned_loss=0.1107, over 13022.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3156, pruned_loss=0.1155, over 2578040.97 frames. ], batch size: 102, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:03:15,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164244.66666666666, ans=0.1 2024-06-20 08:03:20,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=15.0 2024-06-20 08:03:25,107 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 1.961e+02 2.239e+02 2.574e+02 3.794e+02, threshold=4.477e+02, percent-clipped=0.0 2024-06-20 08:03:25,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=164281.33333333334, ans=0.025 2024-06-20 08:03:37,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2024-06-20 08:03:42,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=164318.0, ans=0.0 2024-06-20 08:03:46,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2024-06-20 08:03:46,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.57 vs. limit=22.5 2024-06-20 08:03:47,815 INFO [train.py:1028] (0/2) Epoch 9, batch 8700, loss[loss=0.2872, simple_loss=0.3307, pruned_loss=0.1218, over 13188.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3164, pruned_loss=0.1161, over 2574360.25 frames. ], batch size: 59, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:04:03,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=164373.0, ans=0.2 2024-06-20 08:04:04,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.59 vs. limit=22.5 2024-06-20 08:04:21,930 INFO [train.py:1028] (0/2) Epoch 9, batch 8750, loss[loss=0.2676, simple_loss=0.3094, pruned_loss=0.1129, over 13083.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3163, pruned_loss=0.1164, over 2570857.03 frames. ], batch size: 121, lr: 6.40e-03, grad_scale: 64.0 2024-06-20 08:04:26,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164428.0, ans=0.1 2024-06-20 08:04:27,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=164428.0, ans=0.125 2024-06-20 08:04:33,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164446.33333333334, ans=0.1 2024-06-20 08:04:34,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=164446.33333333334, ans=0.04949747468305833 2024-06-20 08:04:38,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.872e+02 2.037e+02 2.206e+02 3.116e+02, threshold=4.074e+02, percent-clipped=0.0 2024-06-20 08:04:53,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=164501.33333333334, ans=0.125 2024-06-20 08:04:58,078 INFO [train.py:1028] (0/2) Epoch 9, batch 8800, loss[loss=0.2723, simple_loss=0.3192, pruned_loss=0.1128, over 13262.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3165, pruned_loss=0.1164, over 2576203.70 frames. ], batch size: 72, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:05:11,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=164556.33333333334, ans=0.0 2024-06-20 08:05:19,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=164574.66666666666, ans=0.0 2024-06-20 08:05:23,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=164574.66666666666, ans=0.0 2024-06-20 08:05:31,637 INFO [train.py:1028] (0/2) Epoch 9, batch 8850, loss[loss=0.3144, simple_loss=0.3432, pruned_loss=0.1428, over 12484.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3169, pruned_loss=0.117, over 2563817.94 frames. ], batch size: 202, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:05:37,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=164611.33333333334, ans=0.95 2024-06-20 08:05:37,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=164611.33333333334, ans=0.125 2024-06-20 08:05:39,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=164611.33333333334, ans=0.1 2024-06-20 08:05:39,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=164611.33333333334, ans=0.0 2024-06-20 08:05:42,190 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:05:47,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=164648.0, ans=0.0 2024-06-20 08:05:49,007 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.883e+02 2.052e+02 2.301e+02 3.173e+02, threshold=4.104e+02, percent-clipped=0.0 2024-06-20 08:05:49,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=12.0 2024-06-20 08:06:02,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=164684.66666666666, ans=0.0 2024-06-20 08:06:08,266 INFO [train.py:1028] (0/2) Epoch 9, batch 8900, loss[loss=0.2307, simple_loss=0.2771, pruned_loss=0.09212, over 13026.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3168, pruned_loss=0.1167, over 2560459.68 frames. ], batch size: 33, lr: 6.39e-03, grad_scale: 64.0 2024-06-20 08:06:16,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.37 vs. limit=22.5 2024-06-20 08:06:20,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=164721.33333333334, ans=0.125 2024-06-20 08:06:44,423 INFO [train.py:1028] (0/2) Epoch 9, batch 8950, loss[loss=0.2873, simple_loss=0.3254, pruned_loss=0.1246, over 12532.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.317, pruned_loss=0.1164, over 2559921.23 frames. ], batch size: 202, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:06:47,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=164794.66666666666, ans=0.5 2024-06-20 08:06:58,671 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.919e+02 2.074e+02 2.310e+02 3.014e+02, threshold=4.149e+02, percent-clipped=0.0 2024-06-20 08:06:58,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=164831.33333333334, ans=0.125 2024-06-20 08:07:06,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=164849.66666666666, ans=0.125 2024-06-20 08:07:09,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=164849.66666666666, ans=0.0 2024-06-20 08:07:14,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=164868.0, ans=0.2 2024-06-20 08:07:17,726 INFO [train.py:1028] (0/2) Epoch 9, batch 9000, loss[loss=0.2784, simple_loss=0.3246, pruned_loss=0.1161, over 13345.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3171, pruned_loss=0.1159, over 2566116.23 frames. ], batch size: 46, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:07:17,727 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 08:07:25,755 INFO [train.py:1060] (0/2) Epoch 9, validation: loss=0.2006, simple_loss=0.2641, pruned_loss=0.06858, over 351949.00 frames. 2024-06-20 08:07:25,756 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17148MB 2024-06-20 08:07:59,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.46 vs. limit=15.0 2024-06-20 08:08:01,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=15.0 2024-06-20 08:08:01,935 INFO [train.py:1028] (0/2) Epoch 9, batch 9050, loss[loss=0.2802, simple_loss=0.3319, pruned_loss=0.1143, over 11989.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3175, pruned_loss=0.116, over 2566444.62 frames. ], batch size: 18, lr: 6.39e-03, grad_scale: 128.0 2024-06-20 08:08:03,371 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:08:04,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=164978.0, ans=0.2 2024-06-20 08:08:15,514 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 2.008e+02 2.224e+02 2.471e+02 3.519e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-20 08:08:22,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2024-06-20 08:08:26,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=165033.0, ans=0.025 2024-06-20 08:08:27,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=165051.33333333334, ans=0.0 2024-06-20 08:08:31,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=165051.33333333334, ans=0.125 2024-06-20 08:08:34,665 INFO [train.py:1028] (0/2) Epoch 9, batch 9100, loss[loss=0.2801, simple_loss=0.3291, pruned_loss=0.1156, over 13291.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3169, pruned_loss=0.1154, over 2566695.09 frames. ], batch size: 72, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:08:34,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=165069.66666666666, ans=0.2 2024-06-20 08:08:36,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=165069.66666666666, ans=0.125 2024-06-20 08:08:36,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.70 vs. limit=15.0 2024-06-20 08:08:37,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=165069.66666666666, ans=0.0 2024-06-20 08:08:46,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=165106.33333333334, ans=0.125 2024-06-20 08:08:50,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=165106.33333333334, ans=0.0 2024-06-20 08:08:56,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2024-06-20 08:08:57,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=165124.66666666666, ans=0.2 2024-06-20 08:09:07,034 INFO [train.py:1028] (0/2) Epoch 9, batch 9150, loss[loss=0.259, simple_loss=0.3004, pruned_loss=0.1088, over 13154.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3178, pruned_loss=0.1163, over 2567656.71 frames. ], batch size: 77, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:09:12,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=165161.33333333334, ans=0.2 2024-06-20 08:09:23,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.910e+02 2.089e+02 2.265e+02 2.863e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 08:09:27,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=165198.0, ans=0.125 2024-06-20 08:09:30,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.80 vs. limit=15.0 2024-06-20 08:09:32,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=165216.33333333334, ans=0.0 2024-06-20 08:09:33,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=165216.33333333334, ans=0.2 2024-06-20 08:09:35,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=165234.66666666666, ans=0.0 2024-06-20 08:09:41,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=165253.0, ans=0.025 2024-06-20 08:09:42,032 INFO [train.py:1028] (0/2) Epoch 9, batch 9200, loss[loss=0.2829, simple_loss=0.3283, pruned_loss=0.1187, over 12933.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3178, pruned_loss=0.116, over 2570498.30 frames. ], batch size: 36, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:09:43,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2024-06-20 08:09:43,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=16.18 vs. limit=15.0 2024-06-20 08:09:51,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.12 vs. limit=15.0 2024-06-20 08:09:59,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2024-06-20 08:10:02,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=165308.0, ans=0.0 2024-06-20 08:10:05,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=165308.0, ans=0.1 2024-06-20 08:10:06,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=165326.33333333334, ans=0.0 2024-06-20 08:10:08,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=165326.33333333334, ans=0.125 2024-06-20 08:10:12,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=165326.33333333334, ans=0.0 2024-06-20 08:10:13,763 INFO [train.py:1028] (0/2) Epoch 9, batch 9250, loss[loss=0.2649, simple_loss=0.3123, pruned_loss=0.1087, over 13298.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3164, pruned_loss=0.1149, over 2572668.67 frames. ], batch size: 67, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:10:17,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165344.66666666666, ans=0.1 2024-06-20 08:10:20,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=165363.0, ans=0.0 2024-06-20 08:10:26,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=165381.33333333334, ans=0.125 2024-06-20 08:10:27,308 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.924e+02 2.031e+02 2.209e+02 3.361e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-20 08:10:30,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-20 08:10:41,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.00 vs. limit=10.0 2024-06-20 08:10:42,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=165418.0, ans=0.1 2024-06-20 08:10:45,937 INFO [train.py:1028] (0/2) Epoch 9, batch 9300, loss[loss=0.2746, simple_loss=0.3184, pruned_loss=0.1154, over 13288.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3166, pruned_loss=0.1151, over 2570665.33 frames. ], batch size: 40, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:10:48,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=165436.33333333334, ans=0.125 2024-06-20 08:10:50,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=165436.33333333334, ans=0.2 2024-06-20 08:10:59,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=165473.0, ans=0.125 2024-06-20 08:11:00,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=165473.0, ans=0.125 2024-06-20 08:11:13,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.01 vs. limit=22.5 2024-06-20 08:11:17,278 INFO [train.py:1028] (0/2) Epoch 9, batch 9350, loss[loss=0.2768, simple_loss=0.3167, pruned_loss=0.1185, over 12566.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3167, pruned_loss=0.1153, over 2567018.16 frames. ], batch size: 22, lr: 6.38e-03, grad_scale: 128.0 2024-06-20 08:11:25,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.58 vs. limit=15.0 2024-06-20 08:11:32,483 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.928e+02 2.098e+02 2.270e+02 3.271e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 08:11:36,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.78 vs. limit=15.0 2024-06-20 08:11:39,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165583.0, ans=0.1 2024-06-20 08:11:39,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-20 08:11:41,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2024-06-20 08:11:50,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.74 vs. limit=10.0 2024-06-20 08:11:50,551 INFO [train.py:1028] (0/2) Epoch 9, batch 9400, loss[loss=0.2932, simple_loss=0.3413, pruned_loss=0.1225, over 13255.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3172, pruned_loss=0.1157, over 2566792.47 frames. ], batch size: 52, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:11:55,656 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.26 vs. limit=15.0 2024-06-20 08:11:59,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=165638.0, ans=0.125 2024-06-20 08:11:59,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=165638.0, ans=0.125 2024-06-20 08:12:01,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 08:12:09,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=165674.66666666666, ans=0.0 2024-06-20 08:12:21,217 INFO [train.py:1028] (0/2) Epoch 9, batch 9450, loss[loss=0.2966, simple_loss=0.3345, pruned_loss=0.1293, over 12607.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3182, pruned_loss=0.1164, over 2567645.91 frames. ], batch size: 22, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:12:30,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=165729.66666666666, ans=0.125 2024-06-20 08:12:36,134 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.939e+02 2.102e+02 2.280e+02 2.892e+02, threshold=4.203e+02, percent-clipped=0.0 2024-06-20 08:12:38,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=165748.0, ans=0.0 2024-06-20 08:12:45,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=165766.33333333334, ans=0.0 2024-06-20 08:12:53,402 INFO [train.py:1028] (0/2) Epoch 9, batch 9500, loss[loss=0.2918, simple_loss=0.3411, pruned_loss=0.1213, over 13268.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3168, pruned_loss=0.1155, over 2576929.72 frames. ], batch size: 43, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:12:55,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165803.0, ans=0.0 2024-06-20 08:12:55,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=165803.0, ans=0.125 2024-06-20 08:13:03,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.34 vs. limit=6.0 2024-06-20 08:13:03,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=165821.33333333334, ans=0.1 2024-06-20 08:13:11,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.25 vs. limit=22.5 2024-06-20 08:13:12,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=165858.0, ans=0.07 2024-06-20 08:13:14,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=165858.0, ans=0.125 2024-06-20 08:13:15,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=165858.0, ans=0.025 2024-06-20 08:13:16,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.53 vs. limit=15.0 2024-06-20 08:13:17,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=165876.33333333334, ans=0.0 2024-06-20 08:13:23,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.51 vs. limit=15.0 2024-06-20 08:13:23,832 INFO [train.py:1028] (0/2) Epoch 9, batch 9550, loss[loss=0.2433, simple_loss=0.2953, pruned_loss=0.09566, over 12929.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3161, pruned_loss=0.115, over 2572311.91 frames. ], batch size: 39, lr: 6.37e-03, grad_scale: 128.0 2024-06-20 08:13:31,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=165913.0, ans=0.0 2024-06-20 08:13:34,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=165913.0, ans=0.0 2024-06-20 08:13:36,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=165931.33333333334, ans=0.125 2024-06-20 08:13:37,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.982e+02 2.166e+02 2.371e+02 3.673e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 08:13:42,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=12.0 2024-06-20 08:13:54,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=165986.33333333334, ans=0.125 2024-06-20 08:13:54,587 INFO [train.py:1028] (0/2) Epoch 9, batch 9600, loss[loss=0.2833, simple_loss=0.3073, pruned_loss=0.1297, over 10663.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3156, pruned_loss=0.1147, over 2569732.25 frames. ], batch size: 303, lr: 6.37e-03, grad_scale: 64.0 2024-06-20 08:14:00,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=166004.66666666666, ans=0.025 2024-06-20 08:14:07,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=166023.0, ans=0.2 2024-06-20 08:14:16,075 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:14:16,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2024-06-20 08:14:17,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=166041.33333333334, ans=0.125 2024-06-20 08:14:17,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2024-06-20 08:14:22,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=166059.66666666666, ans=0.125 2024-06-20 08:14:27,029 INFO [train.py:1028] (0/2) Epoch 9, batch 9650, loss[loss=0.2765, simple_loss=0.3201, pruned_loss=0.1165, over 13133.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.316, pruned_loss=0.1152, over 2558430.30 frames. ], batch size: 132, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:14:27,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=166078.0, ans=0.125 2024-06-20 08:14:40,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=166114.66666666666, ans=0.125 2024-06-20 08:14:41,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.067e+02 2.177e+02 2.491e+02 3.298e+02, threshold=4.355e+02, percent-clipped=0.0 2024-06-20 08:14:41,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=166114.66666666666, ans=0.125 2024-06-20 08:14:48,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=166133.0, ans=0.025 2024-06-20 08:14:57,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2024-06-20 08:15:01,144 INFO [train.py:1028] (0/2) Epoch 9, batch 9700, loss[loss=0.2755, simple_loss=0.3208, pruned_loss=0.1151, over 13029.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3153, pruned_loss=0.1152, over 2555331.02 frames. ], batch size: 144, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:15:01,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=166169.66666666666, ans=0.0 2024-06-20 08:15:07,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166188.0, ans=0.0 2024-06-20 08:15:07,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=166188.0, ans=0.125 2024-06-20 08:15:07,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=166188.0, ans=0.1 2024-06-20 08:15:20,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2024-06-20 08:15:21,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.76 vs. limit=15.0 2024-06-20 08:15:28,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=166243.0, ans=0.125 2024-06-20 08:15:29,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.22 vs. limit=22.5 2024-06-20 08:15:31,390 INFO [train.py:1028] (0/2) Epoch 9, batch 9750, loss[loss=0.255, simple_loss=0.3032, pruned_loss=0.1034, over 13115.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3144, pruned_loss=0.1146, over 2552466.78 frames. ], batch size: 132, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:15:33,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=166261.33333333334, ans=0.125 2024-06-20 08:15:45,207 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.940e+02 2.091e+02 2.358e+02 3.881e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 08:15:56,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=166316.33333333334, ans=0.125 2024-06-20 08:16:03,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.97 vs. limit=10.0 2024-06-20 08:16:04,314 INFO [train.py:1028] (0/2) Epoch 9, batch 9800, loss[loss=0.2491, simple_loss=0.2999, pruned_loss=0.09919, over 12902.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3135, pruned_loss=0.1135, over 2547185.30 frames. ], batch size: 39, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:16:08,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166353.0, ans=0.1 2024-06-20 08:16:10,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=166371.33333333334, ans=0.0 2024-06-20 08:16:24,924 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:16:35,010 INFO [train.py:1028] (0/2) Epoch 9, batch 9850, loss[loss=0.2768, simple_loss=0.3166, pruned_loss=0.1185, over 13124.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3129, pruned_loss=0.1132, over 2540004.30 frames. ], batch size: 103, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:16:36,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.76 vs. limit=22.5 2024-06-20 08:16:45,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=15.0 2024-06-20 08:16:47,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=15.0 2024-06-20 08:16:48,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=166481.33333333334, ans=0.125 2024-06-20 08:16:49,843 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.971e+02 2.232e+02 2.556e+02 3.991e+02, threshold=4.463e+02, percent-clipped=0.0 2024-06-20 08:16:50,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=166481.33333333334, ans=0.125 2024-06-20 08:17:02,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=166518.0, ans=0.2 2024-06-20 08:17:06,347 INFO [train.py:1028] (0/2) Epoch 9, batch 9900, loss[loss=0.2397, simple_loss=0.2937, pruned_loss=0.09282, over 12942.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3128, pruned_loss=0.1135, over 2533208.89 frames. ], batch size: 39, lr: 6.36e-03, grad_scale: 64.0 2024-06-20 08:17:06,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=166536.33333333334, ans=0.125 2024-06-20 08:17:08,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.91 vs. limit=22.5 2024-06-20 08:17:09,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=166536.33333333334, ans=0.05 2024-06-20 08:17:11,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=166536.33333333334, ans=0.125 2024-06-20 08:17:21,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=166573.0, ans=0.025 2024-06-20 08:17:25,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=166573.0, ans=0.125 2024-06-20 08:17:31,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=166591.33333333334, ans=0.0 2024-06-20 08:17:35,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=166609.66666666666, ans=0.2 2024-06-20 08:17:38,374 INFO [train.py:1028] (0/2) Epoch 9, batch 9950, loss[loss=0.2599, simple_loss=0.2988, pruned_loss=0.1105, over 12813.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3115, pruned_loss=0.1133, over 2527420.60 frames. ], batch size: 29, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:17:44,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 08:17:52,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.911e+02 2.032e+02 2.257e+02 5.473e+02, threshold=4.064e+02, percent-clipped=1.0 2024-06-20 08:17:53,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.83 vs. limit=15.0 2024-06-20 08:17:55,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=166664.66666666666, ans=0.125 2024-06-20 08:17:57,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166683.0, ans=0.125 2024-06-20 08:18:00,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166683.0, ans=0.125 2024-06-20 08:18:03,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.20 vs. limit=22.5 2024-06-20 08:18:10,408 INFO [train.py:1028] (0/2) Epoch 9, batch 10000, loss[loss=0.2575, simple_loss=0.3087, pruned_loss=0.1032, over 12603.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3119, pruned_loss=0.1139, over 2488245.61 frames. ], batch size: 22, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:18:24,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=166756.33333333334, ans=0.125 2024-06-20 08:18:33,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=166774.66666666666, ans=0.0 2024-06-20 08:18:33,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=166774.66666666666, ans=0.125 2024-06-20 08:18:35,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=166774.66666666666, ans=0.2 2024-06-20 08:18:38,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=166793.0, ans=0.07 2024-06-20 08:18:41,961 INFO [train.py:1028] (0/2) Epoch 9, batch 10050, loss[loss=0.2507, simple_loss=0.2984, pruned_loss=0.1015, over 12454.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3118, pruned_loss=0.1148, over 2446967.51 frames. ], batch size: 22, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:18:43,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=21.26 vs. limit=15.0 2024-06-20 08:18:46,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.41 vs. limit=15.0 2024-06-20 08:18:51,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=166829.66666666666, ans=0.125 2024-06-20 08:18:55,349 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.052e+02 2.246e+02 2.639e+02 5.037e+02, threshold=4.492e+02, percent-clipped=6.0 2024-06-20 08:19:03,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=166866.33333333334, ans=0.2 2024-06-20 08:19:04,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=166866.33333333334, ans=0.0 2024-06-20 08:19:05,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.01 vs. limit=22.5 2024-06-20 08:19:12,369 INFO [train.py:1028] (0/2) Epoch 9, batch 10100, loss[loss=0.2716, simple_loss=0.3227, pruned_loss=0.1103, over 10673.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3117, pruned_loss=0.1146, over 2423810.14 frames. ], batch size: 16, lr: 6.35e-03, grad_scale: 64.0 2024-06-20 08:19:15,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=166903.0, ans=0.125 2024-06-20 08:19:25,271 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-9.pt 2024-06-20 08:21:29,056 INFO [train.py:1028] (0/2) Epoch 10, batch 0, loss[loss=0.235, simple_loss=0.2763, pruned_loss=0.09685, over 12951.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2763, pruned_loss=0.09685, over 12951.00 frames. ], batch size: 36, lr: 6.04e-03, grad_scale: 64.0 2024-06-20 08:21:29,057 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 08:21:36,999 INFO [train.py:1060] (0/2) Epoch 10, validation: loss=0.2026, simple_loss=0.2664, pruned_loss=0.06938, over 351949.00 frames. 2024-06-20 08:21:37,000 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 08:21:47,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=166952.5, ans=0.125 2024-06-20 08:21:55,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166970.83333333334, ans=0.0 2024-06-20 08:22:04,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=167007.5, ans=0.025 2024-06-20 08:22:05,134 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:22:10,860 INFO [train.py:1028] (0/2) Epoch 10, batch 50, loss[loss=0.2576, simple_loss=0.3056, pruned_loss=0.1048, over 12638.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.2946, pruned_loss=0.1069, over 574401.86 frames. ], batch size: 29, lr: 6.04e-03, grad_scale: 64.0 2024-06-20 08:22:14,206 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.824e+02 1.970e+02 2.289e+02 3.262e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 08:22:16,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=167025.83333333334, ans=0.2 2024-06-20 08:22:19,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=167044.16666666666, ans=0.04949747468305833 2024-06-20 08:22:20,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167044.16666666666, ans=0.0 2024-06-20 08:22:21,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=167044.16666666666, ans=0.125 2024-06-20 08:22:21,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=167044.16666666666, ans=0.0 2024-06-20 08:22:29,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=167080.83333333334, ans=0.0 2024-06-20 08:22:30,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=167080.83333333334, ans=0.125 2024-06-20 08:22:31,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=27.16 vs. limit=15.0 2024-06-20 08:22:37,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167099.16666666666, ans=0.125 2024-06-20 08:22:45,558 INFO [train.py:1028] (0/2) Epoch 10, batch 100, loss[loss=0.229, simple_loss=0.2796, pruned_loss=0.0892, over 13355.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.2919, pruned_loss=0.1051, over 1017391.72 frames. ], batch size: 46, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:22:50,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2024-06-20 08:22:56,408 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.52 vs. limit=22.5 2024-06-20 08:23:05,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167172.5, ans=0.1 2024-06-20 08:23:06,786 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.14 vs. limit=22.5 2024-06-20 08:23:17,283 INFO [train.py:1028] (0/2) Epoch 10, batch 150, loss[loss=0.2369, simple_loss=0.286, pruned_loss=0.09395, over 12568.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.2906, pruned_loss=0.1035, over 1366264.64 frames. ], batch size: 29, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:23:20,520 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.834e+02 1.991e+02 2.236e+02 2.869e+02, threshold=3.982e+02, percent-clipped=0.0 2024-06-20 08:23:25,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167209.16666666666, ans=0.125 2024-06-20 08:23:32,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=167245.83333333334, ans=0.2 2024-06-20 08:23:35,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-20 08:23:49,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=167282.5, ans=0.2 2024-06-20 08:23:49,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=167282.5, ans=0.5 2024-06-20 08:23:52,452 INFO [train.py:1028] (0/2) Epoch 10, batch 200, loss[loss=0.2998, simple_loss=0.331, pruned_loss=0.1343, over 12480.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.2916, pruned_loss=0.1044, over 1635265.19 frames. ], batch size: 202, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:23:55,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=167300.83333333334, ans=0.025 2024-06-20 08:23:56,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=167300.83333333334, ans=0.125 2024-06-20 08:23:57,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=167300.83333333334, ans=0.2 2024-06-20 08:24:06,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.42 vs. limit=10.0 2024-06-20 08:24:08,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.0 2024-06-20 08:24:08,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.61 vs. limit=15.0 2024-06-20 08:24:08,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=167337.5, ans=0.0 2024-06-20 08:24:15,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=167355.83333333334, ans=0.125 2024-06-20 08:24:19,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=167374.16666666666, ans=0.125 2024-06-20 08:24:21,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=167374.16666666666, ans=0.125 2024-06-20 08:24:23,467 INFO [train.py:1028] (0/2) Epoch 10, batch 250, loss[loss=0.2342, simple_loss=0.2717, pruned_loss=0.09837, over 13039.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.2906, pruned_loss=0.1038, over 1846425.81 frames. ], batch size: 144, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:24:23,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=167392.5, ans=0.125 2024-06-20 08:24:26,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.849e+02 1.948e+02 2.083e+02 2.623e+02, threshold=3.896e+02, percent-clipped=0.0 2024-06-20 08:24:34,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=167410.83333333334, ans=0.025 2024-06-20 08:24:36,514 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:24:45,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=167429.16666666666, ans=0.125 2024-06-20 08:24:50,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=167447.5, ans=0.0 2024-06-20 08:24:53,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.14 vs. limit=12.0 2024-06-20 08:25:00,437 INFO [train.py:1028] (0/2) Epoch 10, batch 300, loss[loss=0.2563, simple_loss=0.2966, pruned_loss=0.108, over 13166.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2898, pruned_loss=0.1029, over 2009946.93 frames. ], batch size: 112, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:25:10,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=167502.5, ans=0.125 2024-06-20 08:25:10,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.91 vs. limit=15.0 2024-06-20 08:25:15,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=167520.83333333334, ans=0.125 2024-06-20 08:25:26,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=167539.16666666666, ans=0.0 2024-06-20 08:25:35,506 INFO [train.py:1028] (0/2) Epoch 10, batch 350, loss[loss=0.2406, simple_loss=0.2909, pruned_loss=0.09514, over 12825.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.29, pruned_loss=0.1027, over 2138930.23 frames. ], batch size: 33, lr: 6.03e-03, grad_scale: 64.0 2024-06-20 08:25:37,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=167575.83333333334, ans=0.0 2024-06-20 08:25:38,695 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.826e+02 2.050e+02 2.217e+02 2.972e+02, threshold=4.100e+02, percent-clipped=0.0 2024-06-20 08:25:39,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=167575.83333333334, ans=0.025 2024-06-20 08:25:48,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=167612.5, ans=0.0 2024-06-20 08:26:07,710 INFO [train.py:1028] (0/2) Epoch 10, batch 400, loss[loss=0.2279, simple_loss=0.2779, pruned_loss=0.08892, over 13212.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.29, pruned_loss=0.1025, over 2239007.00 frames. ], batch size: 63, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:26:09,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=167667.5, ans=0.0 2024-06-20 08:26:12,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 08:26:13,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=167685.83333333334, ans=0.2 2024-06-20 08:26:16,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=167685.83333333334, ans=0.125 2024-06-20 08:26:25,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167704.16666666666, ans=0.1 2024-06-20 08:26:28,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=167722.5, ans=0.125 2024-06-20 08:26:32,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.92 vs. limit=22.5 2024-06-20 08:26:39,464 INFO [train.py:1028] (0/2) Epoch 10, batch 450, loss[loss=0.2281, simple_loss=0.2796, pruned_loss=0.08827, over 13191.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2901, pruned_loss=0.1023, over 2313811.29 frames. ], batch size: 67, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:26:42,543 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.743e+02 1.889e+02 2.077e+02 2.798e+02, threshold=3.777e+02, percent-clipped=0.0 2024-06-20 08:26:56,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167795.83333333334, ans=0.125 2024-06-20 08:26:56,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=167795.83333333334, ans=0.125 2024-06-20 08:27:01,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=167814.16666666666, ans=0.2 2024-06-20 08:27:07,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.93 vs. limit=22.5 2024-06-20 08:27:07,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.03 vs. limit=10.0 2024-06-20 08:27:10,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167832.5, ans=0.125 2024-06-20 08:27:13,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 08:27:14,477 INFO [train.py:1028] (0/2) Epoch 10, batch 500, loss[loss=0.2121, simple_loss=0.2498, pruned_loss=0.08719, over 13136.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.2907, pruned_loss=0.1025, over 2375684.66 frames. ], batch size: 121, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:27:16,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.56 vs. limit=22.5 2024-06-20 08:27:18,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167850.83333333334, ans=0.125 2024-06-20 08:27:18,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=167850.83333333334, ans=0.2 2024-06-20 08:27:29,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2024-06-20 08:27:34,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167887.5, ans=0.0 2024-06-20 08:27:40,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=15.0 2024-06-20 08:27:49,512 INFO [train.py:1028] (0/2) Epoch 10, batch 550, loss[loss=0.2386, simple_loss=0.2786, pruned_loss=0.09935, over 12921.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2907, pruned_loss=0.1024, over 2420194.17 frames. ], batch size: 158, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:27:51,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=167942.5, ans=0.125 2024-06-20 08:27:52,837 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.858e+02 2.019e+02 2.174e+02 3.635e+02, threshold=4.039e+02, percent-clipped=0.0 2024-06-20 08:27:54,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=167942.5, ans=0.04949747468305833 2024-06-20 08:28:13,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=168015.83333333334, ans=0.0 2024-06-20 08:28:17,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-20 08:28:20,936 INFO [train.py:1028] (0/2) Epoch 10, batch 600, loss[loss=0.2139, simple_loss=0.2468, pruned_loss=0.09047, over 13005.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.2895, pruned_loss=0.102, over 2458230.65 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:28:28,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.37 vs. limit=15.0 2024-06-20 08:28:29,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=168052.5, ans=0.0 2024-06-20 08:28:31,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=168052.5, ans=0.1 2024-06-20 08:28:31,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=168052.5, ans=0.125 2024-06-20 08:28:44,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=168089.16666666666, ans=0.0 2024-06-20 08:28:52,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168107.5, ans=0.125 2024-06-20 08:28:53,167 INFO [train.py:1028] (0/2) Epoch 10, batch 650, loss[loss=0.2359, simple_loss=0.2858, pruned_loss=0.09303, over 13212.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2889, pruned_loss=0.1014, over 2489630.39 frames. ], batch size: 59, lr: 6.02e-03, grad_scale: 64.0 2024-06-20 08:28:59,380 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.796e+02 1.891e+02 2.172e+02 2.849e+02, threshold=3.782e+02, percent-clipped=0.0 2024-06-20 08:29:04,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168144.16666666666, ans=0.125 2024-06-20 08:29:26,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168199.16666666666, ans=0.125 2024-06-20 08:29:28,015 INFO [train.py:1028] (0/2) Epoch 10, batch 700, loss[loss=0.2551, simple_loss=0.3006, pruned_loss=0.1048, over 13304.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.2884, pruned_loss=0.1013, over 2512477.21 frames. ], batch size: 46, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:29:37,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168217.5, ans=0.125 2024-06-20 08:29:40,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=168235.83333333334, ans=0.125 2024-06-20 08:29:45,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=168254.16666666666, ans=0.025 2024-06-20 08:29:46,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=12.0 2024-06-20 08:29:47,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2024-06-20 08:29:57,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=168290.83333333334, ans=0.0 2024-06-20 08:30:00,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2024-06-20 08:30:03,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168290.83333333334, ans=0.125 2024-06-20 08:30:04,905 INFO [train.py:1028] (0/2) Epoch 10, batch 750, loss[loss=0.2211, simple_loss=0.2691, pruned_loss=0.08654, over 13221.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2891, pruned_loss=0.1016, over 2528452.47 frames. ], batch size: 63, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:30:06,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=168309.16666666666, ans=0.0 2024-06-20 08:30:08,262 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.803e+02 1.930e+02 2.091e+02 2.746e+02, threshold=3.860e+02, percent-clipped=0.0 2024-06-20 08:30:13,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=15.0 2024-06-20 08:30:15,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=168327.5, ans=0.2 2024-06-20 08:30:24,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=168345.83333333334, ans=0.2 2024-06-20 08:30:39,098 INFO [train.py:1028] (0/2) Epoch 10, batch 800, loss[loss=0.2211, simple_loss=0.2727, pruned_loss=0.08478, over 12926.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.2894, pruned_loss=0.1018, over 2541309.07 frames. ], batch size: 36, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:30:46,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2024-06-20 08:30:49,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=168419.16666666666, ans=0.125 2024-06-20 08:30:52,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=168437.5, ans=0.025 2024-06-20 08:30:55,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=168437.5, ans=0.2 2024-06-20 08:31:06,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168455.83333333334, ans=0.125 2024-06-20 08:31:12,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=168474.16666666666, ans=0.125 2024-06-20 08:31:15,927 INFO [train.py:1028] (0/2) Epoch 10, batch 850, loss[loss=0.2329, simple_loss=0.2765, pruned_loss=0.09464, over 13167.00 frames. ], tot_loss[loss=0.246, simple_loss=0.2888, pruned_loss=0.1016, over 2551757.73 frames. ], batch size: 95, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:31:19,064 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.787e+02 1.922e+02 2.052e+02 2.636e+02, threshold=3.844e+02, percent-clipped=0.0 2024-06-20 08:31:20,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=168492.5, ans=0.0 2024-06-20 08:31:21,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=168510.83333333334, ans=0.07 2024-06-20 08:31:24,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=168510.83333333334, ans=0.125 2024-06-20 08:31:29,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168529.16666666666, ans=0.125 2024-06-20 08:31:32,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=12.0 2024-06-20 08:31:46,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=168565.83333333334, ans=0.025 2024-06-20 08:31:49,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=168565.83333333334, ans=0.125 2024-06-20 08:31:50,917 INFO [train.py:1028] (0/2) Epoch 10, batch 900, loss[loss=0.2418, simple_loss=0.2923, pruned_loss=0.0957, over 12919.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.2886, pruned_loss=0.1017, over 2556335.22 frames. ], batch size: 36, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:32:00,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=15.0 2024-06-20 08:32:01,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.44 vs. limit=10.0 2024-06-20 08:32:19,575 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-92000.pt 2024-06-20 08:32:28,856 INFO [train.py:1028] (0/2) Epoch 10, batch 950, loss[loss=0.2481, simple_loss=0.2979, pruned_loss=0.09916, over 12903.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2885, pruned_loss=0.1017, over 2559180.54 frames. ], batch size: 39, lr: 6.01e-03, grad_scale: 64.0 2024-06-20 08:32:32,006 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.875e+02 2.028e+02 2.246e+02 3.316e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-20 08:32:45,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2024-06-20 08:32:46,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=168712.5, ans=0.025 2024-06-20 08:32:55,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=168749.16666666666, ans=15.0 2024-06-20 08:32:55,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=168749.16666666666, ans=0.125 2024-06-20 08:33:03,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=168749.16666666666, ans=0.2 2024-06-20 08:33:04,213 INFO [train.py:1028] (0/2) Epoch 10, batch 1000, loss[loss=0.2723, simple_loss=0.3131, pruned_loss=0.1157, over 13267.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.2878, pruned_loss=0.1017, over 2561596.29 frames. ], batch size: 49, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:33:11,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=168785.83333333334, ans=0.2 2024-06-20 08:33:16,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=168804.16666666666, ans=0.125 2024-06-20 08:33:17,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.59 vs. limit=22.5 2024-06-20 08:33:20,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.56 vs. limit=22.5 2024-06-20 08:33:22,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=168804.16666666666, ans=0.2 2024-06-20 08:33:23,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.19 vs. limit=15.0 2024-06-20 08:33:25,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=168822.5, ans=0.2 2024-06-20 08:33:29,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=168822.5, ans=0.1 2024-06-20 08:33:32,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=168840.83333333334, ans=0.125 2024-06-20 08:33:35,869 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.85 vs. limit=12.0 2024-06-20 08:33:37,581 INFO [train.py:1028] (0/2) Epoch 10, batch 1050, loss[loss=0.2257, simple_loss=0.2734, pruned_loss=0.08897, over 13201.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.2888, pruned_loss=0.1023, over 2564355.24 frames. ], batch size: 77, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:33:43,907 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.775e+02 1.895e+02 2.104e+02 2.890e+02, threshold=3.790e+02, percent-clipped=0.0 2024-06-20 08:33:45,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-20 08:34:02,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=168914.16666666666, ans=0.025 2024-06-20 08:34:03,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=168914.16666666666, ans=0.2 2024-06-20 08:34:08,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=168932.5, ans=0.0 2024-06-20 08:34:13,604 INFO [train.py:1028] (0/2) Epoch 10, batch 1100, loss[loss=0.2372, simple_loss=0.2885, pruned_loss=0.09296, over 13291.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2895, pruned_loss=0.1021, over 2570680.13 frames. ], batch size: 52, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:34:19,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=168969.16666666666, ans=0.025 2024-06-20 08:34:21,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2024-06-20 08:34:22,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168969.16666666666, ans=0.1 2024-06-20 08:34:32,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=12.0 2024-06-20 08:34:46,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=169042.5, ans=0.125 2024-06-20 08:34:46,462 INFO [train.py:1028] (0/2) Epoch 10, batch 1150, loss[loss=0.2371, simple_loss=0.2889, pruned_loss=0.0926, over 13297.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2905, pruned_loss=0.1027, over 2571347.89 frames. ], batch size: 52, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:34:49,635 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.813e+02 1.939e+02 2.084e+02 2.757e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 08:34:51,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=169042.5, ans=0.0 2024-06-20 08:34:54,918 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2024-06-20 08:35:11,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=169097.5, ans=0.0 2024-06-20 08:35:21,004 INFO [train.py:1028] (0/2) Epoch 10, batch 1200, loss[loss=0.2333, simple_loss=0.2856, pruned_loss=0.09051, over 13188.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.29, pruned_loss=0.1025, over 2573510.69 frames. ], batch size: 77, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:35:28,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=169152.5, ans=0.0 2024-06-20 08:35:36,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=169170.83333333334, ans=0.125 2024-06-20 08:35:37,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=169170.83333333334, ans=0.1 2024-06-20 08:35:38,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169170.83333333334, ans=0.125 2024-06-20 08:35:38,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.53 vs. limit=15.0 2024-06-20 08:35:39,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.32 vs. limit=22.5 2024-06-20 08:35:46,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=169189.16666666666, ans=0.0 2024-06-20 08:35:51,036 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:35:52,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 08:35:54,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2024-06-20 08:35:55,240 INFO [train.py:1028] (0/2) Epoch 10, batch 1250, loss[loss=0.2242, simple_loss=0.2682, pruned_loss=0.0901, over 13194.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.2893, pruned_loss=0.102, over 2582978.46 frames. ], batch size: 112, lr: 6.00e-03, grad_scale: 64.0 2024-06-20 08:35:58,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.812e+02 1.934e+02 2.118e+02 3.061e+02, threshold=3.869e+02, percent-clipped=0.0 2024-06-20 08:36:01,343 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.03 vs. limit=10.0 2024-06-20 08:36:05,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=169244.16666666666, ans=0.09899494936611666 2024-06-20 08:36:19,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=169280.83333333334, ans=0.5 2024-06-20 08:36:25,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2024-06-20 08:36:27,400 INFO [train.py:1028] (0/2) Epoch 10, batch 1300, loss[loss=0.2708, simple_loss=0.3015, pruned_loss=0.12, over 12773.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2896, pruned_loss=0.1021, over 2583039.26 frames. ], batch size: 176, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:36:32,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=169317.5, ans=0.125 2024-06-20 08:36:33,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=169335.83333333334, ans=0.015 2024-06-20 08:36:44,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169354.16666666666, ans=0.125 2024-06-20 08:36:47,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2024-06-20 08:36:56,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.26 vs. limit=15.0 2024-06-20 08:36:57,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=169390.83333333334, ans=0.0 2024-06-20 08:36:59,993 INFO [train.py:1028] (0/2) Epoch 10, batch 1350, loss[loss=0.2411, simple_loss=0.2895, pruned_loss=0.09632, over 13175.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.2888, pruned_loss=0.1014, over 2584927.10 frames. ], batch size: 59, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:37:02,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=169409.16666666666, ans=0.09899494936611666 2024-06-20 08:37:03,108 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.822e+02 2.004e+02 2.196e+02 3.076e+02, threshold=4.008e+02, percent-clipped=0.0 2024-06-20 08:37:09,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=22.5 2024-06-20 08:37:09,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=169427.5, ans=0.0 2024-06-20 08:37:17,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=169445.83333333334, ans=0.125 2024-06-20 08:37:19,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=169445.83333333334, ans=0.125 2024-06-20 08:37:30,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-20 08:37:35,740 INFO [train.py:1028] (0/2) Epoch 10, batch 1400, loss[loss=0.2388, simple_loss=0.2837, pruned_loss=0.09695, over 12538.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.2897, pruned_loss=0.102, over 2585991.28 frames. ], batch size: 25, lr: 5.99e-03, grad_scale: 64.0 2024-06-20 08:37:37,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169500.83333333334, ans=0.1 2024-06-20 08:37:51,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=169537.5, ans=0.125 2024-06-20 08:38:02,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=169555.83333333334, ans=0.2 2024-06-20 08:38:06,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169574.16666666666, ans=0.125 2024-06-20 08:38:06,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=169574.16666666666, ans=0.125 2024-06-20 08:38:11,367 INFO [train.py:1028] (0/2) Epoch 10, batch 1450, loss[loss=0.2357, simple_loss=0.2735, pruned_loss=0.09896, over 13139.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.2897, pruned_loss=0.1023, over 2585814.05 frames. ], batch size: 121, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:38:14,518 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.827e+02 1.984e+02 2.162e+02 3.144e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-20 08:38:20,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=169610.83333333334, ans=0.025 2024-06-20 08:38:23,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=169610.83333333334, ans=0.125 2024-06-20 08:38:24,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=169629.16666666666, ans=0.0 2024-06-20 08:38:34,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=169647.5, ans=0.2 2024-06-20 08:38:35,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=169647.5, ans=0.025 2024-06-20 08:38:39,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=169665.83333333334, ans=0.2 2024-06-20 08:38:40,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=169665.83333333334, ans=0.0 2024-06-20 08:38:40,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=15.0 2024-06-20 08:38:42,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-20 08:38:43,925 INFO [train.py:1028] (0/2) Epoch 10, batch 1500, loss[loss=0.2503, simple_loss=0.293, pruned_loss=0.1037, over 13210.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2902, pruned_loss=0.1027, over 2587667.66 frames. ], batch size: 83, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:38:45,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=169684.16666666666, ans=0.0 2024-06-20 08:38:48,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=169684.16666666666, ans=0.125 2024-06-20 08:38:53,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=169702.5, ans=0.125 2024-06-20 08:38:55,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=169702.5, ans=0.2 2024-06-20 08:38:55,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.90 vs. limit=15.0 2024-06-20 08:38:56,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=169720.83333333334, ans=0.125 2024-06-20 08:38:59,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=169720.83333333334, ans=0.125 2024-06-20 08:39:01,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=169720.83333333334, ans=0.09899494936611666 2024-06-20 08:39:08,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.68 vs. limit=15.0 2024-06-20 08:39:19,232 INFO [train.py:1028] (0/2) Epoch 10, batch 1550, loss[loss=0.2635, simple_loss=0.2999, pruned_loss=0.1135, over 13080.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2906, pruned_loss=0.103, over 2583232.24 frames. ], batch size: 102, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:39:20,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=169775.83333333334, ans=0.2 2024-06-20 08:39:22,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.848e+02 1.985e+02 2.158e+02 2.791e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 08:39:26,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-20 08:39:27,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=169794.16666666666, ans=0.0 2024-06-20 08:39:42,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=169830.83333333334, ans=0.125 2024-06-20 08:39:48,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.30 vs. limit=12.0 2024-06-20 08:39:50,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=169849.16666666666, ans=0.125 2024-06-20 08:39:51,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.18 vs. limit=22.5 2024-06-20 08:39:53,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=169867.5, ans=0.125 2024-06-20 08:39:54,317 INFO [train.py:1028] (0/2) Epoch 10, batch 1600, loss[loss=0.2469, simple_loss=0.2924, pruned_loss=0.1007, over 13220.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2909, pruned_loss=0.1032, over 2578654.67 frames. ], batch size: 77, lr: 5.99e-03, grad_scale: 128.0 2024-06-20 08:40:02,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=169885.83333333334, ans=0.125 2024-06-20 08:40:03,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=169885.83333333334, ans=0.125 2024-06-20 08:40:17,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2024-06-20 08:40:26,518 INFO [train.py:1028] (0/2) Epoch 10, batch 1650, loss[loss=0.2989, simple_loss=0.3281, pruned_loss=0.1348, over 13181.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.2913, pruned_loss=0.1038, over 2574693.18 frames. ], batch size: 95, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:40:29,822 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.825e+02 1.963e+02 2.252e+02 2.989e+02, threshold=3.926e+02, percent-clipped=0.0 2024-06-20 08:40:35,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=169977.5, ans=0.1 2024-06-20 08:40:36,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=169977.5, ans=0.125 2024-06-20 08:40:40,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169995.83333333334, ans=0.1 2024-06-20 08:40:53,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170032.5, ans=0.125 2024-06-20 08:40:54,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=170032.5, ans=0.125 2024-06-20 08:40:58,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2024-06-20 08:40:59,495 INFO [train.py:1028] (0/2) Epoch 10, batch 1700, loss[loss=0.2758, simple_loss=0.318, pruned_loss=0.1169, over 12682.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2911, pruned_loss=0.1034, over 2579431.10 frames. ], batch size: 25, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:41:01,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2024-06-20 08:41:02,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=170050.83333333334, ans=0.1 2024-06-20 08:41:03,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=170050.83333333334, ans=12.0 2024-06-20 08:41:07,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=170050.83333333334, ans=0.125 2024-06-20 08:41:10,481 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:41:11,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=170069.16666666666, ans=0.04949747468305833 2024-06-20 08:41:23,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2024-06-20 08:41:34,568 INFO [train.py:1028] (0/2) Epoch 10, batch 1750, loss[loss=0.2404, simple_loss=0.2909, pruned_loss=0.09491, over 12692.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.291, pruned_loss=0.1033, over 2580850.75 frames. ], batch size: 22, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:41:35,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170142.5, ans=0.125 2024-06-20 08:41:37,849 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.799e+02 1.879e+02 2.025e+02 2.418e+02, threshold=3.758e+02, percent-clipped=0.0 2024-06-20 08:41:49,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2024-06-20 08:41:51,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=170179.16666666666, ans=0.0 2024-06-20 08:41:52,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=14.74 vs. limit=15.0 2024-06-20 08:42:09,643 INFO [train.py:1028] (0/2) Epoch 10, batch 1800, loss[loss=0.2408, simple_loss=0.2924, pruned_loss=0.09461, over 13236.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2913, pruned_loss=0.1033, over 2581540.19 frames. ], batch size: 67, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:42:14,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=170234.16666666666, ans=0.0 2024-06-20 08:42:19,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=12.0 2024-06-20 08:42:31,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=170289.16666666666, ans=0.125 2024-06-20 08:42:34,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.50 vs. limit=15.0 2024-06-20 08:42:40,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170307.5, ans=0.125 2024-06-20 08:42:42,115 INFO [train.py:1028] (0/2) Epoch 10, batch 1850, loss[loss=0.2463, simple_loss=0.2852, pruned_loss=0.1037, over 13217.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2914, pruned_loss=0.1033, over 2583185.93 frames. ], batch size: 83, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:42:44,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=170325.83333333334, ans=0.125 2024-06-20 08:42:45,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.844e+02 1.955e+02 2.174e+02 3.212e+02, threshold=3.911e+02, percent-clipped=0.0 2024-06-20 08:42:48,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170344.16666666666, ans=0.1 2024-06-20 08:42:49,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=170344.16666666666, ans=0.0 2024-06-20 08:43:02,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=170380.83333333334, ans=0.025 2024-06-20 08:43:03,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=170380.83333333334, ans=0.125 2024-06-20 08:43:10,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=11.77 vs. limit=12.0 2024-06-20 08:43:11,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=170399.16666666666, ans=0.125 2024-06-20 08:43:16,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=170417.5, ans=0.125 2024-06-20 08:43:16,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170417.5, ans=0.125 2024-06-20 08:43:16,937 INFO [train.py:1028] (0/2) Epoch 10, batch 1900, loss[loss=0.2595, simple_loss=0.2979, pruned_loss=0.1106, over 13141.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2912, pruned_loss=0.1033, over 2586121.51 frames. ], batch size: 95, lr: 5.98e-03, grad_scale: 128.0 2024-06-20 08:43:25,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.90 vs. limit=10.0 2024-06-20 08:43:29,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170454.16666666666, ans=0.1 2024-06-20 08:43:39,696 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:43:42,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.28 vs. limit=22.5 2024-06-20 08:43:46,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=170490.83333333334, ans=0.0 2024-06-20 08:43:51,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=170490.83333333334, ans=0.0 2024-06-20 08:43:52,458 INFO [train.py:1028] (0/2) Epoch 10, batch 1950, loss[loss=0.2261, simple_loss=0.2754, pruned_loss=0.08843, over 13258.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2898, pruned_loss=0.1028, over 2592199.73 frames. ], batch size: 52, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:43:54,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2024-06-20 08:43:55,777 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.784e+02 1.907e+02 2.063e+02 2.719e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 08:43:56,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=170509.16666666666, ans=0.2 2024-06-20 08:44:02,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=170527.5, ans=0.125 2024-06-20 08:44:16,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170564.16666666666, ans=0.125 2024-06-20 08:44:19,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=170582.5, ans=0.0 2024-06-20 08:44:25,808 INFO [train.py:1028] (0/2) Epoch 10, batch 2000, loss[loss=0.2604, simple_loss=0.3014, pruned_loss=0.1097, over 12540.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2896, pruned_loss=0.1026, over 2588086.02 frames. ], batch size: 22, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:44:28,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=170600.83333333334, ans=15.0 2024-06-20 08:44:30,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=15.0 2024-06-20 08:44:44,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2024-06-20 08:44:50,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=170655.83333333334, ans=0.2 2024-06-20 08:44:52,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=170674.16666666666, ans=0.035 2024-06-20 08:44:52,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170674.16666666666, ans=0.1 2024-06-20 08:44:57,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=170674.16666666666, ans=0.125 2024-06-20 08:44:58,867 INFO [train.py:1028] (0/2) Epoch 10, batch 2050, loss[loss=0.2201, simple_loss=0.267, pruned_loss=0.08662, over 12599.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.29, pruned_loss=0.1032, over 2584061.83 frames. ], batch size: 29, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:45:01,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=170692.5, ans=0.125 2024-06-20 08:45:02,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.832e+02 1.978e+02 2.110e+02 3.070e+02, threshold=3.956e+02, percent-clipped=0.0 2024-06-20 08:45:22,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=170747.5, ans=0.0 2024-06-20 08:45:34,142 INFO [train.py:1028] (0/2) Epoch 10, batch 2100, loss[loss=0.2297, simple_loss=0.2786, pruned_loss=0.09039, over 13175.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2901, pruned_loss=0.1028, over 2586239.84 frames. ], batch size: 59, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:45:35,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=170784.16666666666, ans=0.125 2024-06-20 08:45:46,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=170802.5, ans=22.5 2024-06-20 08:45:49,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=170820.83333333334, ans=0.0 2024-06-20 08:45:50,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=170820.83333333334, ans=0.125 2024-06-20 08:45:57,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=170839.16666666666, ans=0.0 2024-06-20 08:45:59,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=170839.16666666666, ans=0.125 2024-06-20 08:46:07,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=170857.5, ans=0.125 2024-06-20 08:46:09,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=170875.83333333334, ans=0.2 2024-06-20 08:46:09,772 INFO [train.py:1028] (0/2) Epoch 10, batch 2150, loss[loss=0.2203, simple_loss=0.2735, pruned_loss=0.08351, over 13205.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2899, pruned_loss=0.1021, over 2588487.23 frames. ], batch size: 52, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:46:13,018 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.937e+02 2.163e+02 2.379e+02 3.253e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-20 08:46:20,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170894.16666666666, ans=0.125 2024-06-20 08:46:26,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=170912.5, ans=0.0 2024-06-20 08:46:27,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=170912.5, ans=0.125 2024-06-20 08:46:34,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=170930.83333333334, ans=0.125 2024-06-20 08:46:36,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.91 vs. limit=10.0 2024-06-20 08:46:39,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=12.0 2024-06-20 08:46:40,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=170949.16666666666, ans=0.125 2024-06-20 08:46:42,380 INFO [train.py:1028] (0/2) Epoch 10, batch 2200, loss[loss=0.2453, simple_loss=0.2803, pruned_loss=0.1052, over 13180.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2899, pruned_loss=0.102, over 2589748.65 frames. ], batch size: 83, lr: 5.97e-03, grad_scale: 128.0 2024-06-20 08:46:47,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=170967.5, ans=0.0 2024-06-20 08:46:51,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.41 vs. limit=10.0 2024-06-20 08:46:54,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=15.0 2024-06-20 08:46:54,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=171004.16666666666, ans=0.1 2024-06-20 08:46:54,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171004.16666666666, ans=0.125 2024-06-20 08:46:57,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=171004.16666666666, ans=0.125 2024-06-20 08:47:00,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=171004.16666666666, ans=0.125 2024-06-20 08:47:01,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=171022.5, ans=0.0 2024-06-20 08:47:01,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171022.5, ans=0.1 2024-06-20 08:47:02,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=171022.5, ans=0.125 2024-06-20 08:47:09,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171040.83333333334, ans=0.125 2024-06-20 08:47:17,095 INFO [train.py:1028] (0/2) Epoch 10, batch 2250, loss[loss=0.253, simple_loss=0.2979, pruned_loss=0.1041, over 13252.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.2902, pruned_loss=0.1024, over 2587940.73 frames. ], batch size: 63, lr: 5.96e-03, grad_scale: 128.0 2024-06-20 08:47:20,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.790e+02 1.929e+02 2.146e+02 3.033e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 08:47:30,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=171095.83333333334, ans=0.125 2024-06-20 08:47:33,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=171095.83333333334, ans=0.0 2024-06-20 08:47:35,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=171095.83333333334, ans=0.125 2024-06-20 08:47:37,369 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:47:39,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171114.16666666666, ans=0.1 2024-06-20 08:47:47,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=171132.5, ans=0.125 2024-06-20 08:47:52,143 INFO [train.py:1028] (0/2) Epoch 10, batch 2300, loss[loss=0.254, simple_loss=0.298, pruned_loss=0.105, over 12801.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.2905, pruned_loss=0.1025, over 2582379.84 frames. ], batch size: 33, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:47:55,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.26 vs. limit=15.0 2024-06-20 08:47:58,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=171169.16666666666, ans=0.125 2024-06-20 08:48:01,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=171169.16666666666, ans=0.04949747468305833 2024-06-20 08:48:01,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=171169.16666666666, ans=0.0 2024-06-20 08:48:05,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171187.5, ans=0.1 2024-06-20 08:48:11,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2024-06-20 08:48:14,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=171205.83333333334, ans=0.2 2024-06-20 08:48:15,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=171205.83333333334, ans=0.125 2024-06-20 08:48:17,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=171205.83333333334, ans=0.07 2024-06-20 08:48:25,352 INFO [train.py:1028] (0/2) Epoch 10, batch 2350, loss[loss=0.239, simple_loss=0.2809, pruned_loss=0.09853, over 13230.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.2905, pruned_loss=0.1026, over 2585424.43 frames. ], batch size: 67, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:48:26,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=171242.5, ans=0.125 2024-06-20 08:48:29,484 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.886e+02 2.084e+02 2.413e+02 3.562e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 08:48:34,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=171260.83333333334, ans=0.0 2024-06-20 08:48:40,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=171279.16666666666, ans=0.0 2024-06-20 08:48:46,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=171297.5, ans=0.035 2024-06-20 08:48:54,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=171315.83333333334, ans=0.0 2024-06-20 08:48:58,565 INFO [train.py:1028] (0/2) Epoch 10, batch 2400, loss[loss=0.2722, simple_loss=0.3155, pruned_loss=0.1144, over 13345.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2893, pruned_loss=0.1022, over 2587697.03 frames. ], batch size: 46, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:49:12,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171352.5, ans=0.125 2024-06-20 08:49:12,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=171352.5, ans=0.0 2024-06-20 08:49:13,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=171370.83333333334, ans=0.125 2024-06-20 08:49:16,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 08:49:25,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=171389.16666666666, ans=0.125 2024-06-20 08:49:29,540 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.43 vs. limit=10.0 2024-06-20 08:49:35,904 INFO [train.py:1028] (0/2) Epoch 10, batch 2450, loss[loss=0.2558, simple_loss=0.2941, pruned_loss=0.1088, over 13287.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.2887, pruned_loss=0.1023, over 2584231.87 frames. ], batch size: 63, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:49:39,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.830e+02 2.043e+02 2.240e+02 3.065e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 08:49:56,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=171480.83333333334, ans=0.125 2024-06-20 08:50:02,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2024-06-20 08:50:02,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171499.16666666666, ans=0.125 2024-06-20 08:50:08,508 INFO [train.py:1028] (0/2) Epoch 10, batch 2500, loss[loss=0.2291, simple_loss=0.2742, pruned_loss=0.092, over 13189.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.2876, pruned_loss=0.1016, over 2587614.36 frames. ], batch size: 83, lr: 5.96e-03, grad_scale: 64.0 2024-06-20 08:50:15,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=15.0 2024-06-20 08:50:34,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=171590.83333333334, ans=0.0 2024-06-20 08:50:39,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171590.83333333334, ans=0.0 2024-06-20 08:50:40,581 INFO [train.py:1028] (0/2) Epoch 10, batch 2550, loss[loss=0.2598, simple_loss=0.3091, pruned_loss=0.1052, over 12785.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2864, pruned_loss=0.1011, over 2586266.86 frames. ], batch size: 22, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:50:42,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171609.16666666666, ans=0.1 2024-06-20 08:50:44,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.810e+02 1.946e+02 2.108e+02 2.911e+02, threshold=3.891e+02, percent-clipped=0.0 2024-06-20 08:50:50,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=171627.5, ans=0.125 2024-06-20 08:51:06,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=171664.16666666666, ans=0.0 2024-06-20 08:51:06,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.04 vs. limit=15.0 2024-06-20 08:51:17,481 INFO [train.py:1028] (0/2) Epoch 10, batch 2600, loss[loss=0.2579, simple_loss=0.2982, pruned_loss=0.1088, over 13243.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.2848, pruned_loss=0.1005, over 2585609.99 frames. ], batch size: 52, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:51:26,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.90 vs. limit=22.5 2024-06-20 08:51:29,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=171719.16666666666, ans=0.2 2024-06-20 08:51:30,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=171719.16666666666, ans=0.125 2024-06-20 08:51:31,223 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2024-06-20 08:51:31,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2024-06-20 08:51:31,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2024-06-20 08:51:41,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=171755.83333333334, ans=0.0 2024-06-20 08:51:51,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171774.16666666666, ans=0.125 2024-06-20 08:51:52,765 INFO [train.py:1028] (0/2) Epoch 10, batch 2650, loss[loss=0.2154, simple_loss=0.2526, pruned_loss=0.08914, over 13012.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.283, pruned_loss=0.09975, over 2587109.98 frames. ], batch size: 144, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:51:54,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171792.5, ans=0.0 2024-06-20 08:51:56,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.788e+02 1.930e+02 2.141e+02 3.105e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 08:51:58,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=171810.83333333334, ans=0.2 2024-06-20 08:52:13,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=171847.5, ans=15.0 2024-06-20 08:52:18,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171865.83333333334, ans=0.1 2024-06-20 08:52:22,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=15.0 2024-06-20 08:52:25,071 INFO [train.py:1028] (0/2) Epoch 10, batch 2700, loss[loss=0.2331, simple_loss=0.2736, pruned_loss=0.09631, over 13259.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2816, pruned_loss=0.0996, over 2583677.63 frames. ], batch size: 89, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:52:34,616 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:52:36,042 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:52:41,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=171920.83333333334, ans=0.2 2024-06-20 08:52:43,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=171920.83333333334, ans=0.1 2024-06-20 08:52:47,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2024-06-20 08:52:57,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2024-06-20 08:53:02,292 INFO [train.py:1028] (0/2) Epoch 10, batch 2750, loss[loss=0.2164, simple_loss=0.2592, pruned_loss=0.08679, over 13267.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2812, pruned_loss=0.09928, over 2582149.98 frames. ], batch size: 43, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:53:03,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=171975.83333333334, ans=0.0 2024-06-20 08:53:06,272 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.753e+02 1.889e+02 2.145e+02 2.804e+02, threshold=3.779e+02, percent-clipped=0.0 2024-06-20 08:53:18,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=172012.5, ans=0.0 2024-06-20 08:53:28,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172030.83333333334, ans=0.0 2024-06-20 08:53:29,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172030.83333333334, ans=0.125 2024-06-20 08:53:32,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172049.16666666666, ans=0.0 2024-06-20 08:53:32,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=172049.16666666666, ans=0.125 2024-06-20 08:53:38,553 INFO [train.py:1028] (0/2) Epoch 10, batch 2800, loss[loss=0.243, simple_loss=0.2738, pruned_loss=0.106, over 10667.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2808, pruned_loss=0.09939, over 2579512.89 frames. ], batch size: 303, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:53:40,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.64 vs. limit=22.5 2024-06-20 08:53:44,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-20 08:53:45,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.54 vs. limit=15.0 2024-06-20 08:53:50,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172085.83333333334, ans=0.1 2024-06-20 08:53:51,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=172104.16666666666, ans=0.0 2024-06-20 08:54:00,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=172122.5, ans=0.125 2024-06-20 08:54:02,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-20 08:54:03,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172122.5, ans=0.125 2024-06-20 08:54:04,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=172140.83333333334, ans=15.0 2024-06-20 08:54:11,115 INFO [train.py:1028] (0/2) Epoch 10, batch 2850, loss[loss=0.2155, simple_loss=0.2615, pruned_loss=0.0848, over 13346.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2799, pruned_loss=0.09923, over 2577579.99 frames. ], batch size: 49, lr: 5.95e-03, grad_scale: 64.0 2024-06-20 08:54:14,972 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.821e+02 1.960e+02 2.143e+02 3.198e+02, threshold=3.920e+02, percent-clipped=0.0 2024-06-20 08:54:16,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=172177.5, ans=0.125 2024-06-20 08:54:19,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=172177.5, ans=0.125 2024-06-20 08:54:37,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=172232.5, ans=0.125 2024-06-20 08:54:45,552 INFO [train.py:1028] (0/2) Epoch 10, batch 2900, loss[loss=0.2259, simple_loss=0.2723, pruned_loss=0.08975, over 13107.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.2779, pruned_loss=0.09813, over 2585731.96 frames. ], batch size: 55, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:54:52,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=172269.16666666666, ans=0.09899494936611666 2024-06-20 08:55:05,420 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2024-06-20 08:55:16,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=172324.16666666666, ans=0.0 2024-06-20 08:55:18,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=172324.16666666666, ans=0.0 2024-06-20 08:55:21,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=172324.16666666666, ans=15.0 2024-06-20 08:55:21,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172342.5, ans=0.0 2024-06-20 08:55:22,482 INFO [train.py:1028] (0/2) Epoch 10, batch 2950, loss[loss=0.2325, simple_loss=0.2778, pruned_loss=0.09354, over 13209.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.2778, pruned_loss=0.09819, over 2579714.45 frames. ], batch size: 43, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:55:26,592 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.735e+02 1.860e+02 2.020e+02 2.681e+02, threshold=3.720e+02, percent-clipped=0.0 2024-06-20 08:55:36,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=172379.16666666666, ans=0.0 2024-06-20 08:55:37,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=15.0 2024-06-20 08:55:48,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=172397.5, ans=0.05 2024-06-20 08:55:50,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=172415.83333333334, ans=0.125 2024-06-20 08:55:56,318 INFO [train.py:1028] (0/2) Epoch 10, batch 3000, loss[loss=0.2296, simple_loss=0.2742, pruned_loss=0.09251, over 13228.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2769, pruned_loss=0.09771, over 2579975.78 frames. ], batch size: 59, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:55:56,319 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 08:56:03,920 INFO [train.py:1060] (0/2) Epoch 10, validation: loss=0.1983, simple_loss=0.2621, pruned_loss=0.06725, over 351949.00 frames. 2024-06-20 08:56:03,921 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 08:56:12,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172452.5, ans=0.1 2024-06-20 08:56:23,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172489.16666666666, ans=0.1 2024-06-20 08:56:25,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.57 vs. limit=22.5 2024-06-20 08:56:29,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=172489.16666666666, ans=0.125 2024-06-20 08:56:37,234 INFO [train.py:1028] (0/2) Epoch 10, batch 3050, loss[loss=0.2316, simple_loss=0.2738, pruned_loss=0.0947, over 13284.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.2759, pruned_loss=0.09768, over 2579269.61 frames. ], batch size: 46, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:56:40,990 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.763e+02 1.874e+02 2.072e+02 2.669e+02, threshold=3.747e+02, percent-clipped=0.0 2024-06-20 08:56:53,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=172562.5, ans=0.125 2024-06-20 08:57:14,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=172599.16666666666, ans=0.5 2024-06-20 08:57:15,183 INFO [train.py:1028] (0/2) Epoch 10, batch 3100, loss[loss=0.2323, simple_loss=0.2698, pruned_loss=0.09738, over 13046.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.275, pruned_loss=0.0971, over 2580199.36 frames. ], batch size: 144, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:57:15,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=172617.5, ans=0.125 2024-06-20 08:57:18,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=172617.5, ans=0.125 2024-06-20 08:57:23,481 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 08:57:40,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172672.5, ans=0.125 2024-06-20 08:57:44,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=172690.83333333334, ans=0.05 2024-06-20 08:57:48,291 INFO [train.py:1028] (0/2) Epoch 10, batch 3150, loss[loss=0.2312, simple_loss=0.2682, pruned_loss=0.09707, over 12912.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2735, pruned_loss=0.09628, over 2581071.47 frames. ], batch size: 158, lr: 5.94e-03, grad_scale: 64.0 2024-06-20 08:57:48,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=172709.16666666666, ans=0.125 2024-06-20 08:57:48,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=172709.16666666666, ans=0.125 2024-06-20 08:57:52,245 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.754e+02 1.887e+02 2.149e+02 3.244e+02, threshold=3.775e+02, percent-clipped=0.0 2024-06-20 08:58:20,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=172800.83333333334, ans=0.125 2024-06-20 08:58:20,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.46 vs. limit=15.0 2024-06-20 08:58:20,499 INFO [train.py:1028] (0/2) Epoch 10, batch 3200, loss[loss=0.2172, simple_loss=0.2672, pruned_loss=0.08358, over 13170.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2731, pruned_loss=0.09607, over 2580687.08 frames. ], batch size: 55, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:58:38,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=172855.83333333334, ans=0.025 2024-06-20 08:58:46,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=172855.83333333334, ans=10.0 2024-06-20 08:58:48,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172874.16666666666, ans=0.0 2024-06-20 08:58:50,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.88 vs. limit=22.5 2024-06-20 08:58:54,774 INFO [train.py:1028] (0/2) Epoch 10, batch 3250, loss[loss=0.2237, simple_loss=0.27, pruned_loss=0.08868, over 13077.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2728, pruned_loss=0.09611, over 2585060.35 frames. ], batch size: 71, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:58:58,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.724e+02 1.883e+02 2.092e+02 2.896e+02, threshold=3.765e+02, percent-clipped=0.0 2024-06-20 08:59:00,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.38 vs. limit=15.0 2024-06-20 08:59:14,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=172929.16666666666, ans=0.025 2024-06-20 08:59:16,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=172929.16666666666, ans=0.125 2024-06-20 08:59:20,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2024-06-20 08:59:23,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.22 vs. limit=10.0 2024-06-20 08:59:24,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172965.83333333334, ans=0.125 2024-06-20 08:59:31,278 INFO [train.py:1028] (0/2) Epoch 10, batch 3300, loss[loss=0.2365, simple_loss=0.2789, pruned_loss=0.09708, over 12814.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2723, pruned_loss=0.09575, over 2581315.69 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 08:59:32,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172984.16666666666, ans=0.1 2024-06-20 08:59:34,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=172984.16666666666, ans=0.125 2024-06-20 08:59:43,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=173020.83333333334, ans=0.125 2024-06-20 08:59:44,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=173020.83333333334, ans=0.2 2024-06-20 08:59:46,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=173020.83333333334, ans=0.125 2024-06-20 08:59:49,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173020.83333333334, ans=0.1 2024-06-20 08:59:51,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2024-06-20 08:59:54,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=15.0 2024-06-20 08:59:57,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=173057.5, ans=0.5 2024-06-20 09:00:03,539 INFO [train.py:1028] (0/2) Epoch 10, batch 3350, loss[loss=0.2369, simple_loss=0.2734, pruned_loss=0.1001, over 12927.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2718, pruned_loss=0.09576, over 2576564.96 frames. ], batch size: 158, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:00:07,459 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.830e+02 2.006e+02 2.365e+02 3.421e+02, threshold=4.013e+02, percent-clipped=0.0 2024-06-20 09:00:13,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=173094.16666666666, ans=0.04949747468305833 2024-06-20 09:00:13,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=173094.16666666666, ans=0.015 2024-06-20 09:00:21,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=173112.5, ans=0.125 2024-06-20 09:00:35,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=173149.16666666666, ans=0.125 2024-06-20 09:00:37,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=173149.16666666666, ans=0.2 2024-06-20 09:00:40,212 INFO [train.py:1028] (0/2) Epoch 10, batch 3400, loss[loss=0.2366, simple_loss=0.2794, pruned_loss=0.09686, over 12536.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2713, pruned_loss=0.09584, over 2574977.08 frames. ], batch size: 22, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:00:41,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=173167.5, ans=0.125 2024-06-20 09:00:41,734 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:00:44,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=173167.5, ans=0.0 2024-06-20 09:00:44,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=173167.5, ans=0.0 2024-06-20 09:01:14,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=173240.83333333334, ans=0.125 2024-06-20 09:01:16,824 INFO [train.py:1028] (0/2) Epoch 10, batch 3450, loss[loss=0.2593, simple_loss=0.2924, pruned_loss=0.1131, over 12793.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2706, pruned_loss=0.09541, over 2575727.66 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:01:20,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.68 vs. limit=15.0 2024-06-20 09:01:20,747 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.755e+02 1.900e+02 2.101e+02 2.856e+02, threshold=3.800e+02, percent-clipped=0.0 2024-06-20 09:01:28,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2024-06-20 09:01:32,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173295.83333333334, ans=0.125 2024-06-20 09:01:37,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=173314.16666666666, ans=15.0 2024-06-20 09:01:45,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=173332.5, ans=0.125 2024-06-20 09:01:46,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=173332.5, ans=0.125 2024-06-20 09:01:47,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=10.13 vs. limit=12.0 2024-06-20 09:01:47,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.96 vs. limit=10.0 2024-06-20 09:01:50,093 INFO [train.py:1028] (0/2) Epoch 10, batch 3500, loss[loss=0.2407, simple_loss=0.2818, pruned_loss=0.09983, over 12820.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2704, pruned_loss=0.09512, over 2574904.32 frames. ], batch size: 33, lr: 5.93e-03, grad_scale: 64.0 2024-06-20 09:01:51,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=173350.83333333334, ans=0.1 2024-06-20 09:02:00,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=173369.16666666666, ans=0.02 2024-06-20 09:02:19,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=173424.16666666666, ans=0.125 2024-06-20 09:02:20,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.97 vs. limit=22.5 2024-06-20 09:02:22,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=173424.16666666666, ans=0.0 2024-06-20 09:02:24,174 INFO [train.py:1028] (0/2) Epoch 10, batch 3550, loss[loss=0.2104, simple_loss=0.2507, pruned_loss=0.085, over 13097.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2693, pruned_loss=0.09436, over 2576746.23 frames. ], batch size: 95, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:02:26,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.90 vs. limit=15.0 2024-06-20 09:02:28,085 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 1.734e+02 1.852e+02 2.000e+02 2.779e+02, threshold=3.705e+02, percent-clipped=0.0 2024-06-20 09:02:29,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.04 vs. limit=15.0 2024-06-20 09:02:32,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=173460.83333333334, ans=0.0 2024-06-20 09:02:32,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=173460.83333333334, ans=0.09899494936611666 2024-06-20 09:02:37,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=173460.83333333334, ans=0.5 2024-06-20 09:02:42,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=173479.16666666666, ans=0.07 2024-06-20 09:02:43,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.30 vs. limit=15.0 2024-06-20 09:02:49,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=173497.5, ans=0.0 2024-06-20 09:02:49,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=173497.5, ans=0.0 2024-06-20 09:02:49,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=173497.5, ans=0.2 2024-06-20 09:03:00,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173515.83333333334, ans=0.1 2024-06-20 09:03:02,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=173515.83333333334, ans=0.125 2024-06-20 09:03:03,166 INFO [train.py:1028] (0/2) Epoch 10, batch 3600, loss[loss=0.208, simple_loss=0.2543, pruned_loss=0.08082, over 13295.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2685, pruned_loss=0.09415, over 2580504.90 frames. ], batch size: 49, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:03:04,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=173534.16666666666, ans=0.125 2024-06-20 09:03:05,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=173534.16666666666, ans=0.95 2024-06-20 09:03:15,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=173552.5, ans=0.125 2024-06-20 09:03:19,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=173570.83333333334, ans=0.2 2024-06-20 09:03:20,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=15.0 2024-06-20 09:03:33,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=173607.5, ans=0.2 2024-06-20 09:03:34,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.88 vs. limit=22.5 2024-06-20 09:03:36,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=15.0 2024-06-20 09:03:36,372 INFO [train.py:1028] (0/2) Epoch 10, batch 3650, loss[loss=0.241, simple_loss=0.2767, pruned_loss=0.1027, over 13071.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2687, pruned_loss=0.09419, over 2579089.05 frames. ], batch size: 102, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:03:40,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.752e+02 1.874e+02 2.055e+02 2.647e+02, threshold=3.749e+02, percent-clipped=0.0 2024-06-20 09:03:44,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173644.16666666666, ans=0.125 2024-06-20 09:03:46,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=173644.16666666666, ans=0.0 2024-06-20 09:03:47,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=173644.16666666666, ans=0.04949747468305833 2024-06-20 09:03:47,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.15 vs. limit=15.0 2024-06-20 09:03:58,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173680.83333333334, ans=0.1 2024-06-20 09:04:08,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2024-06-20 09:04:09,473 INFO [train.py:1028] (0/2) Epoch 10, batch 3700, loss[loss=0.2058, simple_loss=0.2522, pruned_loss=0.07971, over 13297.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2677, pruned_loss=0.09343, over 2584058.01 frames. ], batch size: 72, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:04:17,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=173735.83333333334, ans=0.0 2024-06-20 09:04:23,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=173754.16666666666, ans=0.025 2024-06-20 09:04:27,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.55 vs. limit=15.0 2024-06-20 09:04:39,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=173790.83333333334, ans=0.0 2024-06-20 09:04:45,231 INFO [train.py:1028] (0/2) Epoch 10, batch 3750, loss[loss=0.2159, simple_loss=0.2657, pruned_loss=0.08302, over 12558.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2668, pruned_loss=0.09291, over 2586198.89 frames. ], batch size: 22, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:04:48,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=173809.16666666666, ans=0.125 2024-06-20 09:04:49,207 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.515e+02 1.748e+02 1.869e+02 2.059e+02 2.560e+02, threshold=3.738e+02, percent-clipped=0.0 2024-06-20 09:04:50,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=173809.16666666666, ans=0.125 2024-06-20 09:05:15,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=173882.5, ans=0.125 2024-06-20 09:05:18,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=173882.5, ans=0.95 2024-06-20 09:05:20,875 INFO [train.py:1028] (0/2) Epoch 10, batch 3800, loss[loss=0.2032, simple_loss=0.2417, pruned_loss=0.0823, over 13189.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.267, pruned_loss=0.09307, over 2584174.23 frames. ], batch size: 83, lr: 5.92e-03, grad_scale: 64.0 2024-06-20 09:05:52,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2024-06-20 09:05:53,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=173992.5, ans=0.125 2024-06-20 09:05:53,885 INFO [train.py:1028] (0/2) Epoch 10, batch 3850, loss[loss=0.2133, simple_loss=0.2477, pruned_loss=0.08946, over 13033.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2664, pruned_loss=0.09283, over 2582723.40 frames. ], batch size: 144, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:05:55,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173992.5, ans=0.1 2024-06-20 09:05:55,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=173992.5, ans=0.125 2024-06-20 09:05:57,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 1.701e+02 1.841e+02 2.007e+02 2.704e+02, threshold=3.682e+02, percent-clipped=0.0 2024-06-20 09:06:05,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174010.83333333334, ans=0.1 2024-06-20 09:06:19,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=174065.83333333334, ans=0.0 2024-06-20 09:06:21,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174065.83333333334, ans=0.125 2024-06-20 09:06:23,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=174065.83333333334, ans=0.125 2024-06-20 09:06:24,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=174065.83333333334, ans=0.125 2024-06-20 09:06:26,259 INFO [train.py:1028] (0/2) Epoch 10, batch 3900, loss[loss=0.2292, simple_loss=0.2707, pruned_loss=0.09385, over 13260.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2666, pruned_loss=0.09299, over 2586173.58 frames. ], batch size: 83, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:06:30,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.32 vs. limit=15.0 2024-06-20 09:06:54,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.34 vs. limit=15.0 2024-06-20 09:07:00,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174157.5, ans=0.125 2024-06-20 09:07:02,622 INFO [train.py:1028] (0/2) Epoch 10, batch 3950, loss[loss=0.2234, simple_loss=0.2551, pruned_loss=0.09589, over 13069.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2664, pruned_loss=0.09283, over 2586909.61 frames. ], batch size: 132, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:07:06,566 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.805e+02 1.974e+02 2.277e+02 3.132e+02, threshold=3.949e+02, percent-clipped=0.0 2024-06-20 09:07:16,427 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:07:16,805 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.45 vs. limit=22.5 2024-06-20 09:07:22,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=174212.5, ans=0.125 2024-06-20 09:07:27,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=174230.83333333334, ans=0.0 2024-06-20 09:07:32,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174249.16666666666, ans=0.1 2024-06-20 09:07:39,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.99 vs. limit=10.0 2024-06-20 09:07:39,501 INFO [train.py:1028] (0/2) Epoch 10, batch 4000, loss[loss=0.2225, simple_loss=0.2708, pruned_loss=0.08712, over 12958.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2661, pruned_loss=0.09285, over 2582260.64 frames. ], batch size: 39, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:07:55,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.86 vs. limit=15.0 2024-06-20 09:07:57,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=174304.16666666666, ans=0.0 2024-06-20 09:08:06,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=174340.83333333334, ans=0.125 2024-06-20 09:08:12,814 INFO [train.py:1028] (0/2) Epoch 10, batch 4050, loss[loss=0.2433, simple_loss=0.2671, pruned_loss=0.1098, over 10988.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2653, pruned_loss=0.09266, over 2579521.76 frames. ], batch size: 304, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:08:16,616 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.400e+02 1.748e+02 1.860e+02 2.069e+02 2.653e+02, threshold=3.719e+02, percent-clipped=0.0 2024-06-20 09:08:18,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.93 vs. limit=22.5 2024-06-20 09:08:19,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=12.0 2024-06-20 09:08:20,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=174377.5, ans=0.125 2024-06-20 09:08:20,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=174377.5, ans=0.0 2024-06-20 09:08:20,920 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:08:25,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=174395.83333333334, ans=0.125 2024-06-20 09:08:29,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=174395.83333333334, ans=0.04949747468305833 2024-06-20 09:08:39,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=174432.5, ans=0.07 2024-06-20 09:08:39,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.27 vs. limit=15.0 2024-06-20 09:08:45,143 INFO [train.py:1028] (0/2) Epoch 10, batch 4100, loss[loss=0.2626, simple_loss=0.2912, pruned_loss=0.117, over 13043.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2651, pruned_loss=0.0928, over 2576776.81 frames. ], batch size: 102, lr: 5.91e-03, grad_scale: 64.0 2024-06-20 09:08:45,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=174450.83333333334, ans=0.125 2024-06-20 09:08:54,113 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:08:56,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=174469.16666666666, ans=0.125 2024-06-20 09:09:01,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2024-06-20 09:09:24,796 INFO [train.py:1028] (0/2) Epoch 10, batch 4150, loss[loss=0.214, simple_loss=0.2556, pruned_loss=0.08623, over 13231.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2644, pruned_loss=0.09231, over 2576624.28 frames. ], batch size: 55, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:09:28,836 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.772e+02 1.903e+02 2.039e+02 2.663e+02, threshold=3.806e+02, percent-clipped=0.0 2024-06-20 09:09:29,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=174542.5, ans=0.0 2024-06-20 09:09:42,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174579.16666666666, ans=0.125 2024-06-20 09:09:47,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=174597.5, ans=0.0 2024-06-20 09:09:56,943 INFO [train.py:1028] (0/2) Epoch 10, batch 4200, loss[loss=0.2251, simple_loss=0.2556, pruned_loss=0.09724, over 13072.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2637, pruned_loss=0.09213, over 2580056.75 frames. ], batch size: 102, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:10:01,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=174634.16666666666, ans=0.2 2024-06-20 09:10:09,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=174670.83333333334, ans=10.0 2024-06-20 09:10:14,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.09 vs. limit=15.0 2024-06-20 09:10:16,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=174689.16666666666, ans=0.125 2024-06-20 09:10:17,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=174689.16666666666, ans=0.0 2024-06-20 09:10:19,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.40 vs. limit=15.0 2024-06-20 09:10:23,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=174707.5, ans=0.125 2024-06-20 09:10:29,712 INFO [train.py:1028] (0/2) Epoch 10, batch 4250, loss[loss=0.211, simple_loss=0.2594, pruned_loss=0.08131, over 13361.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2633, pruned_loss=0.0916, over 2582388.51 frames. ], batch size: 46, lr: 5.90e-03, grad_scale: 64.0 2024-06-20 09:10:29,844 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:10:31,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=174725.83333333334, ans=0.125 2024-06-20 09:10:33,619 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.735e+02 1.895e+02 2.118e+02 3.777e+02, threshold=3.789e+02, percent-clipped=0.0 2024-06-20 09:10:36,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=174744.16666666666, ans=0.125 2024-06-20 09:10:55,618 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:10:58,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=174780.83333333334, ans=0.0 2024-06-20 09:11:09,984 INFO [train.py:1028] (0/2) Epoch 10, batch 4300, loss[loss=0.2343, simple_loss=0.2696, pruned_loss=0.09951, over 13218.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2633, pruned_loss=0.09158, over 2581516.34 frames. ], batch size: 59, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:11:22,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=174854.16666666666, ans=0.0 2024-06-20 09:11:23,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=174854.16666666666, ans=0.05 2024-06-20 09:11:24,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=174854.16666666666, ans=0.125 2024-06-20 09:11:25,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=174854.16666666666, ans=0.95 2024-06-20 09:11:41,695 INFO [train.py:1028] (0/2) Epoch 10, batch 4350, loss[loss=0.2253, simple_loss=0.2741, pruned_loss=0.08826, over 13167.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2631, pruned_loss=0.09176, over 2585987.07 frames. ], batch size: 59, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:11:45,509 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.733e+02 1.892e+02 2.085e+02 2.858e+02, threshold=3.785e+02, percent-clipped=0.0 2024-06-20 09:11:48,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=174927.5, ans=0.0 2024-06-20 09:11:56,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=174945.83333333334, ans=0.125 2024-06-20 09:11:57,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.59 vs. limit=15.0 2024-06-20 09:12:07,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=174982.5, ans=0.0 2024-06-20 09:12:14,102 INFO [train.py:1028] (0/2) Epoch 10, batch 4400, loss[loss=0.1929, simple_loss=0.239, pruned_loss=0.0734, over 13183.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.263, pruned_loss=0.09184, over 2586808.44 frames. ], batch size: 83, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:12:17,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.15 vs. limit=15.0 2024-06-20 09:12:35,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175055.83333333334, ans=0.1 2024-06-20 09:12:37,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2024-06-20 09:12:40,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=175074.16666666666, ans=0.025 2024-06-20 09:12:43,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=175074.16666666666, ans=0.125 2024-06-20 09:12:47,610 INFO [train.py:1028] (0/2) Epoch 10, batch 4450, loss[loss=0.2233, simple_loss=0.2603, pruned_loss=0.09313, over 12858.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2635, pruned_loss=0.09229, over 2581473.30 frames. ], batch size: 33, lr: 5.90e-03, grad_scale: 128.0 2024-06-20 09:12:49,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=175092.5, ans=0.125 2024-06-20 09:12:53,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=175092.5, ans=0.025 2024-06-20 09:12:54,732 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.725e+02 1.854e+02 2.027e+02 2.661e+02, threshold=3.709e+02, percent-clipped=0.0 2024-06-20 09:12:54,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=175092.5, ans=0.0 2024-06-20 09:13:15,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=175147.5, ans=0.125 2024-06-20 09:13:19,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.28 vs. limit=15.0 2024-06-20 09:13:23,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=175165.83333333334, ans=0.125 2024-06-20 09:13:27,140 INFO [train.py:1028] (0/2) Epoch 10, batch 4500, loss[loss=0.2111, simple_loss=0.2481, pruned_loss=0.08712, over 13287.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2624, pruned_loss=0.09178, over 2585719.31 frames. ], batch size: 89, lr: 5.89e-03, grad_scale: 128.0 2024-06-20 09:13:32,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.74 vs. limit=15.0 2024-06-20 09:13:47,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=175239.16666666666, ans=0.0 2024-06-20 09:13:47,368 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=12.0 2024-06-20 09:14:00,517 INFO [train.py:1028] (0/2) Epoch 10, batch 4550, loss[loss=0.1952, simple_loss=0.2382, pruned_loss=0.07611, over 13283.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2626, pruned_loss=0.09165, over 2589318.66 frames. ], batch size: 52, lr: 5.89e-03, grad_scale: 128.0 2024-06-20 09:14:01,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175275.83333333334, ans=0.1 2024-06-20 09:14:04,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=175275.83333333334, ans=0.07 2024-06-20 09:14:04,496 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.742e+02 1.850e+02 2.081e+02 2.683e+02, threshold=3.699e+02, percent-clipped=0.0 2024-06-20 09:14:10,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=175294.16666666666, ans=0.09899494936611666 2024-06-20 09:14:13,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=175312.5, ans=0.95 2024-06-20 09:14:25,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.339e+01 2024-06-20 09:14:25,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=175330.83333333334, ans=0.04949747468305833 2024-06-20 09:14:33,565 INFO [train.py:1028] (0/2) Epoch 10, batch 4600, loss[loss=0.2558, simple_loss=0.2842, pruned_loss=0.1137, over 12628.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2631, pruned_loss=0.09177, over 2584313.67 frames. ], batch size: 202, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:14:41,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.63 vs. limit=22.5 2024-06-20 09:14:43,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=175385.83333333334, ans=0.2 2024-06-20 09:14:48,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=175404.16666666666, ans=0.125 2024-06-20 09:14:48,869 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.00 vs. limit=10.0 2024-06-20 09:14:53,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=175422.5, ans=0.025 2024-06-20 09:15:14,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=175459.16666666666, ans=0.025 2024-06-20 09:15:14,544 INFO [train.py:1028] (0/2) Epoch 10, batch 4650, loss[loss=0.2099, simple_loss=0.2485, pruned_loss=0.08565, over 13160.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2624, pruned_loss=0.09149, over 2587982.24 frames. ], batch size: 132, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:15:19,061 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.743e+02 1.905e+02 2.063e+02 3.392e+02, threshold=3.811e+02, percent-clipped=0.0 2024-06-20 09:15:20,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=175477.5, ans=0.125 2024-06-20 09:15:22,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=175477.5, ans=0.2 2024-06-20 09:15:25,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2024-06-20 09:15:26,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2024-06-20 09:15:32,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175495.83333333334, ans=0.1 2024-06-20 09:15:47,796 INFO [train.py:1028] (0/2) Epoch 10, batch 4700, loss[loss=0.2328, simple_loss=0.271, pruned_loss=0.09729, over 12508.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2625, pruned_loss=0.09175, over 2584003.17 frames. ], batch size: 25, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:15:48,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=175550.83333333334, ans=0.0 2024-06-20 09:15:52,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=175550.83333333334, ans=0.2 2024-06-20 09:16:03,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175587.5, ans=0.1 2024-06-20 09:16:14,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.80 vs. limit=22.5 2024-06-20 09:16:18,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=175624.16666666666, ans=0.2 2024-06-20 09:16:18,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.94 vs. limit=12.0 2024-06-20 09:16:20,606 INFO [train.py:1028] (0/2) Epoch 10, batch 4750, loss[loss=0.2601, simple_loss=0.2882, pruned_loss=0.116, over 12536.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.262, pruned_loss=0.09141, over 2581809.91 frames. ], batch size: 202, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:16:22,840 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:16:25,503 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.770e+02 1.885e+02 2.068e+02 2.454e+02, threshold=3.770e+02, percent-clipped=0.0 2024-06-20 09:16:31,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=175660.83333333334, ans=0.0 2024-06-20 09:16:34,538 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:16:39,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175679.16666666666, ans=0.0 2024-06-20 09:16:43,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=175697.5, ans=0.0 2024-06-20 09:16:50,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=175715.83333333334, ans=0.125 2024-06-20 09:16:57,793 INFO [train.py:1028] (0/2) Epoch 10, batch 4800, loss[loss=0.2132, simple_loss=0.2618, pruned_loss=0.08235, over 13258.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2617, pruned_loss=0.09114, over 2577783.86 frames. ], batch size: 63, lr: 5.89e-03, grad_scale: 64.0 2024-06-20 09:16:58,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=175734.16666666666, ans=0.125 2024-06-20 09:16:59,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2024-06-20 09:17:13,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=175752.5, ans=0.0 2024-06-20 09:17:17,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=175770.83333333334, ans=0.0 2024-06-20 09:17:18,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175770.83333333334, ans=0.1 2024-06-20 09:17:22,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=175789.16666666666, ans=0.0 2024-06-20 09:17:28,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=175807.5, ans=0.125 2024-06-20 09:17:32,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=175807.5, ans=0.0 2024-06-20 09:17:34,447 INFO [train.py:1028] (0/2) Epoch 10, batch 4850, loss[loss=0.219, simple_loss=0.2565, pruned_loss=0.09076, over 13214.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2614, pruned_loss=0.0908, over 2575332.97 frames. ], batch size: 89, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:17:39,382 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.695e+02 1.831e+02 1.973e+02 3.963e+02, threshold=3.661e+02, percent-clipped=1.0 2024-06-20 09:17:40,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=175844.16666666666, ans=0.0 2024-06-20 09:17:56,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=175880.83333333334, ans=0.0 2024-06-20 09:17:58,500 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.126e+01 2024-06-20 09:17:58,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=175880.83333333334, ans=0.125 2024-06-20 09:18:08,517 INFO [train.py:1028] (0/2) Epoch 10, batch 4900, loss[loss=0.2178, simple_loss=0.266, pruned_loss=0.08481, over 13139.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2618, pruned_loss=0.09098, over 2575755.88 frames. ], batch size: 59, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:18:12,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=175917.5, ans=0.025 2024-06-20 09:18:27,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=175954.16666666666, ans=0.04949747468305833 2024-06-20 09:18:28,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=175972.5, ans=0.125 2024-06-20 09:18:32,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2024-06-20 09:18:37,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=175990.83333333334, ans=0.125 2024-06-20 09:18:37,725 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-96000.pt 2024-06-20 09:18:43,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=175990.83333333334, ans=0.125 2024-06-20 09:18:47,000 INFO [train.py:1028] (0/2) Epoch 10, batch 4950, loss[loss=0.2363, simple_loss=0.2721, pruned_loss=0.1003, over 10999.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2627, pruned_loss=0.09176, over 2569921.77 frames. ], batch size: 304, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:18:50,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.19 vs. limit=15.0 2024-06-20 09:18:51,513 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.750e+02 1.864e+02 2.118e+02 3.091e+02, threshold=3.728e+02, percent-clipped=0.0 2024-06-20 09:19:12,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=12.0 2024-06-20 09:19:12,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.32 vs. limit=22.5 2024-06-20 09:19:14,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=176064.16666666666, ans=0.0 2024-06-20 09:19:20,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=176082.5, ans=0.025 2024-06-20 09:19:20,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=176082.5, ans=0.125 2024-06-20 09:19:24,210 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:19:25,087 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.69 vs. limit=15.0 2024-06-20 09:19:25,939 INFO [train.py:1028] (0/2) Epoch 10, batch 5000, loss[loss=0.2158, simple_loss=0.257, pruned_loss=0.08731, over 13127.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2621, pruned_loss=0.09138, over 2573072.37 frames. ], batch size: 95, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:19:28,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=176100.83333333334, ans=0.125 2024-06-20 09:19:36,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=176119.16666666666, ans=0.125 2024-06-20 09:19:46,440 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2024-06-20 09:19:48,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=176155.83333333334, ans=0.125 2024-06-20 09:19:50,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=176155.83333333334, ans=0.125 2024-06-20 09:19:50,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=176155.83333333334, ans=0.07 2024-06-20 09:19:51,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=15.0 2024-06-20 09:19:51,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=176155.83333333334, ans=0.2 2024-06-20 09:19:59,268 INFO [train.py:1028] (0/2) Epoch 10, batch 5050, loss[loss=0.2177, simple_loss=0.2641, pruned_loss=0.08566, over 12949.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2621, pruned_loss=0.09132, over 2572597.91 frames. ], batch size: 36, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:20:00,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=176192.5, ans=0.025 2024-06-20 09:20:03,893 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.693e+02 1.827e+02 1.983e+02 2.942e+02, threshold=3.653e+02, percent-clipped=0.0 2024-06-20 09:20:04,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=176192.5, ans=0.025 2024-06-20 09:20:11,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176210.83333333334, ans=0.1 2024-06-20 09:20:24,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=176247.5, ans=0.95 2024-06-20 09:20:24,610 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.518e+01 2024-06-20 09:20:32,782 INFO [train.py:1028] (0/2) Epoch 10, batch 5100, loss[loss=0.2358, simple_loss=0.2696, pruned_loss=0.101, over 13014.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2619, pruned_loss=0.09167, over 2569156.22 frames. ], batch size: 39, lr: 5.88e-03, grad_scale: 64.0 2024-06-20 09:20:36,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=176284.16666666666, ans=0.04949747468305833 2024-06-20 09:20:40,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=176302.5, ans=0.125 2024-06-20 09:20:43,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=176302.5, ans=0.125 2024-06-20 09:20:45,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=176302.5, ans=0.125 2024-06-20 09:20:55,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=176320.83333333334, ans=0.04949747468305833 2024-06-20 09:21:04,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176339.16666666666, ans=0.1 2024-06-20 09:21:04,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.56 vs. limit=22.5 2024-06-20 09:21:09,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=176357.5, ans=0.2 2024-06-20 09:21:12,884 INFO [train.py:1028] (0/2) Epoch 10, batch 5150, loss[loss=0.2345, simple_loss=0.2676, pruned_loss=0.1008, over 13060.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.262, pruned_loss=0.09208, over 2571213.94 frames. ], batch size: 132, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:21:17,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.752e+02 1.891e+02 2.163e+02 3.075e+02, threshold=3.783e+02, percent-clipped=0.0 2024-06-20 09:21:23,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=176394.16666666666, ans=0.125 2024-06-20 09:21:25,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=176412.5, ans=0.125 2024-06-20 09:21:31,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176412.5, ans=0.1 2024-06-20 09:21:32,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=176430.83333333334, ans=0.125 2024-06-20 09:21:34,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=176430.83333333334, ans=0.125 2024-06-20 09:21:36,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=176430.83333333334, ans=0.125 2024-06-20 09:21:39,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176449.16666666666, ans=0.1 2024-06-20 09:21:45,473 INFO [train.py:1028] (0/2) Epoch 10, batch 5200, loss[loss=0.2236, simple_loss=0.2644, pruned_loss=0.0914, over 13112.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.262, pruned_loss=0.0919, over 2575184.66 frames. ], batch size: 95, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:21:47,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-20 09:21:48,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=176467.5, ans=0.0 2024-06-20 09:21:49,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=176467.5, ans=0.125 2024-06-20 09:21:52,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=176485.83333333334, ans=0.07 2024-06-20 09:22:00,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=176504.16666666666, ans=0.0 2024-06-20 09:22:04,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=15.0 2024-06-20 09:22:12,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=176540.83333333334, ans=0.125 2024-06-20 09:22:13,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-20 09:22:15,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=176540.83333333334, ans=0.125 2024-06-20 09:22:16,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=176540.83333333334, ans=0.125 2024-06-20 09:22:19,020 INFO [train.py:1028] (0/2) Epoch 10, batch 5250, loss[loss=0.2247, simple_loss=0.2678, pruned_loss=0.09079, over 13224.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2621, pruned_loss=0.09183, over 2570386.79 frames. ], batch size: 52, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:22:21,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=176559.16666666666, ans=0.05 2024-06-20 09:22:23,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.727e+02 1.851e+02 2.005e+02 3.149e+02, threshold=3.701e+02, percent-clipped=0.0 2024-06-20 09:22:24,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=176559.16666666666, ans=0.0 2024-06-20 09:22:30,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=176577.5, ans=0.125 2024-06-20 09:22:36,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=176595.83333333334, ans=0.0 2024-06-20 09:22:39,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.37 vs. limit=6.0 2024-06-20 09:22:41,672 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=7.503e+00 2024-06-20 09:22:43,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=176614.16666666666, ans=0.125 2024-06-20 09:22:55,244 INFO [train.py:1028] (0/2) Epoch 10, batch 5300, loss[loss=0.2027, simple_loss=0.2449, pruned_loss=0.08025, over 13004.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2617, pruned_loss=0.09136, over 2568706.12 frames. ], batch size: 144, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:23:18,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=176705.83333333334, ans=0.025 2024-06-20 09:23:18,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=176705.83333333334, ans=0.125 2024-06-20 09:23:22,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=176705.83333333334, ans=0.2 2024-06-20 09:23:32,656 INFO [train.py:1028] (0/2) Epoch 10, batch 5350, loss[loss=0.1797, simple_loss=0.2328, pruned_loss=0.06335, over 11352.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2608, pruned_loss=0.09111, over 2574628.55 frames. ], batch size: 16, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:23:37,500 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.712e+02 1.829e+02 1.978e+02 3.041e+02, threshold=3.658e+02, percent-clipped=0.0 2024-06-20 09:23:39,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.13 vs. limit=15.0 2024-06-20 09:23:42,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=176760.83333333334, ans=0.125 2024-06-20 09:23:47,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=176779.16666666666, ans=0.0 2024-06-20 09:23:49,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=176779.16666666666, ans=0.1 2024-06-20 09:23:51,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=176779.16666666666, ans=0.2 2024-06-20 09:24:05,178 INFO [train.py:1028] (0/2) Epoch 10, batch 5400, loss[loss=0.2416, simple_loss=0.2689, pruned_loss=0.1072, over 12230.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2614, pruned_loss=0.09157, over 2568974.65 frames. ], batch size: 240, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:24:10,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=176834.16666666666, ans=0.0 2024-06-20 09:24:18,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=176870.83333333334, ans=0.0 2024-06-20 09:24:18,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176870.83333333334, ans=0.0 2024-06-20 09:24:23,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=12.0 2024-06-20 09:24:27,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176889.16666666666, ans=0.1 2024-06-20 09:24:38,128 INFO [train.py:1028] (0/2) Epoch 10, batch 5450, loss[loss=0.2296, simple_loss=0.2715, pruned_loss=0.09384, over 12895.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2613, pruned_loss=0.09134, over 2572047.74 frames. ], batch size: 26, lr: 5.87e-03, grad_scale: 64.0 2024-06-20 09:24:39,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=176925.83333333334, ans=0.0 2024-06-20 09:24:46,175 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.752e+02 1.878e+02 2.110e+02 3.255e+02, threshold=3.755e+02, percent-clipped=0.0 2024-06-20 09:24:47,092 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:24:49,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-20 09:24:56,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=176944.16666666666, ans=0.125 2024-06-20 09:25:00,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=176962.5, ans=0.125 2024-06-20 09:25:04,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=176980.83333333334, ans=0.125 2024-06-20 09:25:10,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=12.0 2024-06-20 09:25:16,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=176999.16666666666, ans=0.0 2024-06-20 09:25:18,111 INFO [train.py:1028] (0/2) Epoch 10, batch 5500, loss[loss=0.2661, simple_loss=0.2875, pruned_loss=0.1224, over 12204.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2615, pruned_loss=0.09143, over 2565385.02 frames. ], batch size: 240, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:25:20,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177017.5, ans=0.125 2024-06-20 09:25:23,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=177017.5, ans=0.125 2024-06-20 09:25:29,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.33 vs. limit=10.0 2024-06-20 09:25:33,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=177054.16666666666, ans=0.125 2024-06-20 09:25:45,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=177090.83333333334, ans=10.0 2024-06-20 09:25:51,589 INFO [train.py:1028] (0/2) Epoch 10, batch 5550, loss[loss=0.2461, simple_loss=0.2824, pruned_loss=0.1049, over 13279.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2611, pruned_loss=0.09126, over 2569474.63 frames. ], batch size: 43, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:25:55,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=177109.16666666666, ans=0.125 2024-06-20 09:25:56,205 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.758e+02 1.936e+02 2.203e+02 3.256e+02, threshold=3.872e+02, percent-clipped=0.0 2024-06-20 09:25:56,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177109.16666666666, ans=0.0 2024-06-20 09:26:02,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.86 vs. limit=12.0 2024-06-20 09:26:03,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.43 vs. limit=22.5 2024-06-20 09:26:06,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.75 vs. limit=15.0 2024-06-20 09:26:07,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=177145.83333333334, ans=0.09899494936611666 2024-06-20 09:26:10,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.29 vs. limit=15.0 2024-06-20 09:26:12,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2024-06-20 09:26:16,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=177164.16666666666, ans=0.125 2024-06-20 09:26:19,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177182.5, ans=0.1 2024-06-20 09:26:23,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=177182.5, ans=0.125 2024-06-20 09:26:24,306 INFO [train.py:1028] (0/2) Epoch 10, batch 5600, loss[loss=0.2205, simple_loss=0.2582, pruned_loss=0.09137, over 13296.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2604, pruned_loss=0.09099, over 2571700.27 frames. ], batch size: 89, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:26:40,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=177237.5, ans=0.2 2024-06-20 09:26:44,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177237.5, ans=0.1 2024-06-20 09:26:59,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=177274.16666666666, ans=0.2 2024-06-20 09:26:59,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=177274.16666666666, ans=0.2 2024-06-20 09:27:02,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=177274.16666666666, ans=0.025 2024-06-20 09:27:03,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=177274.16666666666, ans=0.2 2024-06-20 09:27:05,759 INFO [train.py:1028] (0/2) Epoch 10, batch 5650, loss[loss=0.2414, simple_loss=0.2676, pruned_loss=0.1076, over 12662.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2596, pruned_loss=0.09054, over 2576296.95 frames. ], batch size: 202, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:27:09,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=177292.5, ans=0.04949747468305833 2024-06-20 09:27:10,397 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.490e+02 1.728e+02 1.873e+02 2.122e+02 3.234e+02, threshold=3.746e+02, percent-clipped=0.0 2024-06-20 09:27:12,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=12.0 2024-06-20 09:27:15,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=177310.83333333334, ans=0.0 2024-06-20 09:27:19,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=177329.16666666666, ans=0.125 2024-06-20 09:27:20,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=177329.16666666666, ans=0.125 2024-06-20 09:27:28,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2024-06-20 09:27:32,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=15.0 2024-06-20 09:27:35,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=177365.83333333334, ans=0.1 2024-06-20 09:27:35,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177365.83333333334, ans=0.125 2024-06-20 09:27:38,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=177384.16666666666, ans=0.2 2024-06-20 09:27:38,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.36 vs. limit=15.0 2024-06-20 09:27:39,100 INFO [train.py:1028] (0/2) Epoch 10, batch 5700, loss[loss=0.2062, simple_loss=0.2493, pruned_loss=0.08156, over 13221.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2593, pruned_loss=0.09049, over 2579365.97 frames. ], batch size: 63, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:27:51,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=12.0 2024-06-20 09:27:54,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=177420.83333333334, ans=0.0 2024-06-20 09:28:11,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.78 vs. limit=10.0 2024-06-20 09:28:11,560 INFO [train.py:1028] (0/2) Epoch 10, batch 5750, loss[loss=0.2276, simple_loss=0.2614, pruned_loss=0.09685, over 12745.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2602, pruned_loss=0.09087, over 2579729.14 frames. ], batch size: 176, lr: 5.86e-03, grad_scale: 64.0 2024-06-20 09:28:12,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2024-06-20 09:28:16,159 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.742e+02 1.909e+02 2.080e+02 3.192e+02, threshold=3.818e+02, percent-clipped=0.0 2024-06-20 09:28:18,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=177494.16666666666, ans=0.04949747468305833 2024-06-20 09:28:20,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=177494.16666666666, ans=0.125 2024-06-20 09:28:22,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=177494.16666666666, ans=0.2 2024-06-20 09:28:27,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=177512.5, ans=0.2 2024-06-20 09:28:43,530 INFO [train.py:1028] (0/2) Epoch 10, batch 5800, loss[loss=0.24, simple_loss=0.2724, pruned_loss=0.1038, over 12736.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2614, pruned_loss=0.09162, over 2578611.25 frames. ], batch size: 176, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:28:43,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=177567.5, ans=0.0 2024-06-20 09:28:57,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=177585.83333333334, ans=0.125 2024-06-20 09:29:01,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177585.83333333334, ans=0.125 2024-06-20 09:29:02,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=177604.16666666666, ans=0.125 2024-06-20 09:29:11,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=177622.5, ans=0.2 2024-06-20 09:29:23,109 INFO [train.py:1028] (0/2) Epoch 10, batch 5850, loss[loss=0.259, simple_loss=0.2938, pruned_loss=0.1121, over 12468.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.264, pruned_loss=0.0927, over 2576730.08 frames. ], batch size: 202, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:29:27,553 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.832e+02 1.973e+02 2.190e+02 2.931e+02, threshold=3.947e+02, percent-clipped=0.0 2024-06-20 09:29:27,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177659.16666666666, ans=0.1 2024-06-20 09:29:31,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=177677.5, ans=0.05 2024-06-20 09:29:33,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2024-06-20 09:29:38,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=15.0 2024-06-20 09:29:44,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=177714.16666666666, ans=0.1 2024-06-20 09:29:55,881 INFO [train.py:1028] (0/2) Epoch 10, batch 5900, loss[loss=0.2183, simple_loss=0.2505, pruned_loss=0.09301, over 13091.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.266, pruned_loss=0.09356, over 2576122.57 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:29:56,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=177750.83333333334, ans=0.125 2024-06-20 09:29:58,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=177750.83333333334, ans=0.125 2024-06-20 09:30:03,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=177769.16666666666, ans=0.0 2024-06-20 09:30:12,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=177787.5, ans=0.025 2024-06-20 09:30:17,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=177805.83333333334, ans=0.0 2024-06-20 09:30:30,561 INFO [train.py:1028] (0/2) Epoch 10, batch 5950, loss[loss=0.2461, simple_loss=0.278, pruned_loss=0.1071, over 13154.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2674, pruned_loss=0.094, over 2580877.38 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:30:35,372 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.814e+02 1.976e+02 2.492e+02 3.976e+02, threshold=3.953e+02, percent-clipped=1.0 2024-06-20 09:30:58,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=177897.5, ans=0.0 2024-06-20 09:31:03,929 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:31:12,729 INFO [train.py:1028] (0/2) Epoch 10, batch 6000, loss[loss=0.296, simple_loss=0.3172, pruned_loss=0.1374, over 12237.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2696, pruned_loss=0.09516, over 2574733.96 frames. ], batch size: 240, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:31:12,729 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 09:31:20,486 INFO [train.py:1060] (0/2) Epoch 10, validation: loss=0.1984, simple_loss=0.262, pruned_loss=0.06739, over 351949.00 frames. 2024-06-20 09:31:20,487 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 09:31:26,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177952.5, ans=0.1 2024-06-20 09:31:31,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=177952.5, ans=0.125 2024-06-20 09:31:40,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.68 vs. limit=10.0 2024-06-20 09:31:47,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=178007.5, ans=0.125 2024-06-20 09:31:48,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=178007.5, ans=0.0 2024-06-20 09:31:54,226 INFO [train.py:1028] (0/2) Epoch 10, batch 6050, loss[loss=0.2294, simple_loss=0.2644, pruned_loss=0.09722, over 12939.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2707, pruned_loss=0.09528, over 2577362.23 frames. ], batch size: 39, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:31:59,093 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.821e+02 2.033e+02 2.288e+02 2.953e+02, threshold=4.066e+02, percent-clipped=0.0 2024-06-20 09:32:01,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=178044.16666666666, ans=0.2 2024-06-20 09:32:05,253 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:32:07,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=178062.5, ans=0.125 2024-06-20 09:32:10,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2024-06-20 09:32:19,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=178080.83333333334, ans=0.0 2024-06-20 09:32:28,529 INFO [train.py:1028] (0/2) Epoch 10, batch 6100, loss[loss=0.2587, simple_loss=0.2906, pruned_loss=0.1134, over 13082.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2721, pruned_loss=0.09583, over 2579245.98 frames. ], batch size: 121, lr: 5.85e-03, grad_scale: 64.0 2024-06-20 09:32:32,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=178117.5, ans=0.0 2024-06-20 09:32:32,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=178117.5, ans=0.2 2024-06-20 09:32:43,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=178154.16666666666, ans=0.125 2024-06-20 09:32:50,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=178172.5, ans=0.0 2024-06-20 09:32:52,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=178172.5, ans=0.125 2024-06-20 09:32:54,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2024-06-20 09:33:03,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178190.83333333334, ans=0.1 2024-06-20 09:33:03,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=178190.83333333334, ans=0.0 2024-06-20 09:33:10,076 INFO [train.py:1028] (0/2) Epoch 10, batch 6150, loss[loss=0.2577, simple_loss=0.2842, pruned_loss=0.1156, over 10786.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2739, pruned_loss=0.09669, over 2577449.49 frames. ], batch size: 304, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:33:13,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178209.16666666666, ans=0.125 2024-06-20 09:33:14,887 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.823e+02 1.982e+02 2.110e+02 2.998e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 09:33:17,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=178227.5, ans=10.0 2024-06-20 09:33:22,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=178245.83333333334, ans=0.015 2024-06-20 09:33:29,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-20 09:33:37,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=178282.5, ans=0.125 2024-06-20 09:33:43,784 INFO [train.py:1028] (0/2) Epoch 10, batch 6200, loss[loss=0.2319, simple_loss=0.2803, pruned_loss=0.09173, over 13282.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2756, pruned_loss=0.09763, over 2574697.83 frames. ], batch size: 89, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:33:54,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=12.81 vs. limit=15.0 2024-06-20 09:33:58,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.68 vs. limit=15.0 2024-06-20 09:34:08,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178355.83333333334, ans=0.1 2024-06-20 09:34:10,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2024-06-20 09:34:14,490 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-20 09:34:16,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=178374.16666666666, ans=0.0 2024-06-20 09:34:17,447 INFO [train.py:1028] (0/2) Epoch 10, batch 6250, loss[loss=0.2644, simple_loss=0.2938, pruned_loss=0.1175, over 13241.00 frames. ], tot_loss[loss=0.237, simple_loss=0.2772, pruned_loss=0.0984, over 2569115.57 frames. ], batch size: 83, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:34:22,143 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.816e+02 1.936e+02 2.193e+02 3.296e+02, threshold=3.873e+02, percent-clipped=0.0 2024-06-20 09:34:33,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2024-06-20 09:34:44,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=178465.83333333334, ans=0.125 2024-06-20 09:34:45,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.90 vs. limit=22.5 2024-06-20 09:34:49,996 INFO [train.py:1028] (0/2) Epoch 10, batch 6300, loss[loss=0.2491, simple_loss=0.2923, pruned_loss=0.103, over 11137.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2791, pruned_loss=0.09943, over 2563783.21 frames. ], batch size: 16, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:35:16,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=178539.16666666666, ans=0.2 2024-06-20 09:35:16,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=178539.16666666666, ans=0.025 2024-06-20 09:35:16,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2024-06-20 09:35:17,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=178539.16666666666, ans=0.125 2024-06-20 09:35:21,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=178539.16666666666, ans=22.5 2024-06-20 09:35:30,162 INFO [train.py:1028] (0/2) Epoch 10, batch 6350, loss[loss=0.2751, simple_loss=0.3031, pruned_loss=0.1235, over 12511.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2806, pruned_loss=0.09991, over 2574279.89 frames. ], batch size: 202, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:35:30,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.50 vs. limit=10.0 2024-06-20 09:35:35,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.850e+02 2.070e+02 2.243e+02 2.833e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 09:35:38,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=178594.16666666666, ans=0.125 2024-06-20 09:35:46,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=178612.5, ans=0.0 2024-06-20 09:35:48,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178612.5, ans=0.1 2024-06-20 09:35:55,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=178630.83333333334, ans=0.0 2024-06-20 09:35:56,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=178649.16666666666, ans=0.125 2024-06-20 09:36:03,138 INFO [train.py:1028] (0/2) Epoch 10, batch 6400, loss[loss=0.2095, simple_loss=0.2602, pruned_loss=0.07942, over 13200.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2826, pruned_loss=0.1009, over 2576171.71 frames. ], batch size: 67, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:36:23,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.78 vs. limit=6.0 2024-06-20 09:36:26,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=178722.5, ans=0.125 2024-06-20 09:36:35,531 INFO [train.py:1028] (0/2) Epoch 10, batch 6450, loss[loss=0.2812, simple_loss=0.3101, pruned_loss=0.1261, over 12588.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2842, pruned_loss=0.1014, over 2581572.09 frames. ], batch size: 202, lr: 5.84e-03, grad_scale: 64.0 2024-06-20 09:36:40,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.860e+02 2.039e+02 2.300e+02 3.323e+02, threshold=4.077e+02, percent-clipped=0.0 2024-06-20 09:36:44,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.13 vs. limit=10.0 2024-06-20 09:36:51,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2024-06-20 09:36:57,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=178814.16666666666, ans=0.0 2024-06-20 09:36:59,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=178814.16666666666, ans=0.05 2024-06-20 09:37:00,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=178814.16666666666, ans=0.125 2024-06-20 09:37:05,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=178832.5, ans=0.125 2024-06-20 09:37:13,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=178832.5, ans=0.125 2024-06-20 09:37:15,195 INFO [train.py:1028] (0/2) Epoch 10, batch 6500, loss[loss=0.2616, simple_loss=0.2857, pruned_loss=0.1188, over 10692.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2857, pruned_loss=0.1015, over 2583899.66 frames. ], batch size: 304, lr: 5.83e-03, grad_scale: 64.0 2024-06-20 09:37:18,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=178850.83333333334, ans=0.2 2024-06-20 09:37:28,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 09:37:29,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=178887.5, ans=0.125 2024-06-20 09:37:35,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=178905.83333333334, ans=0.125 2024-06-20 09:37:43,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=178924.16666666666, ans=0.0 2024-06-20 09:37:47,998 INFO [train.py:1028] (0/2) Epoch 10, batch 6550, loss[loss=0.2557, simple_loss=0.2989, pruned_loss=0.1062, over 12558.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.2868, pruned_loss=0.1015, over 2587743.90 frames. ], batch size: 22, lr: 5.83e-03, grad_scale: 64.0 2024-06-20 09:37:51,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=178942.5, ans=0.125 2024-06-20 09:37:52,483 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.821e+02 2.005e+02 2.181e+02 3.127e+02, threshold=4.010e+02, percent-clipped=0.0 2024-06-20 09:37:56,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=12.0 2024-06-20 09:37:56,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=178960.83333333334, ans=0.125 2024-06-20 09:38:09,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.91 vs. limit=22.5 2024-06-20 09:38:13,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=179015.83333333334, ans=0.2 2024-06-20 09:38:20,652 INFO [train.py:1028] (0/2) Epoch 10, batch 6600, loss[loss=0.2345, simple_loss=0.284, pruned_loss=0.09254, over 13247.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2874, pruned_loss=0.1017, over 2590885.60 frames. ], batch size: 72, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:38:20,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=179034.16666666666, ans=0.125 2024-06-20 09:38:27,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=179052.5, ans=0.2 2024-06-20 09:38:29,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=12.0 2024-06-20 09:38:38,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=179070.83333333334, ans=0.125 2024-06-20 09:38:40,044 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.22 vs. limit=15.0 2024-06-20 09:38:50,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=179107.5, ans=0.0 2024-06-20 09:38:54,007 INFO [train.py:1028] (0/2) Epoch 10, batch 6650, loss[loss=0.2817, simple_loss=0.3186, pruned_loss=0.1224, over 12974.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2892, pruned_loss=0.1024, over 2584858.42 frames. ], batch size: 158, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:38:58,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.901e+02 2.122e+02 2.447e+02 4.029e+02, threshold=4.243e+02, percent-clipped=1.0 2024-06-20 09:39:07,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=179162.5, ans=0.5 2024-06-20 09:39:12,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=179162.5, ans=0.125 2024-06-20 09:39:30,162 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.63 vs. limit=15.0 2024-06-20 09:39:34,178 INFO [train.py:1028] (0/2) Epoch 10, batch 6700, loss[loss=0.2523, simple_loss=0.295, pruned_loss=0.1049, over 12707.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.2896, pruned_loss=0.1025, over 2585130.21 frames. ], batch size: 176, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:39:35,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=179217.5, ans=0.07 2024-06-20 09:39:37,580 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.923e+01 2024-06-20 09:39:54,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=179272.5, ans=0.125 2024-06-20 09:40:00,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=179290.83333333334, ans=0.0 2024-06-20 09:40:02,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179290.83333333334, ans=0.1 2024-06-20 09:40:07,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=179309.16666666666, ans=0.025 2024-06-20 09:40:07,508 INFO [train.py:1028] (0/2) Epoch 10, batch 6750, loss[loss=0.3177, simple_loss=0.3412, pruned_loss=0.1471, over 12158.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.2898, pruned_loss=0.1027, over 2577764.13 frames. ], batch size: 240, lr: 5.83e-03, grad_scale: 128.0 2024-06-20 09:40:07,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=179309.16666666666, ans=0.125 2024-06-20 09:40:10,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=179309.16666666666, ans=0.0 2024-06-20 09:40:11,804 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.807e+02 1.958e+02 2.132e+02 2.856e+02, threshold=3.916e+02, percent-clipped=0.0 2024-06-20 09:40:11,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=179309.16666666666, ans=0.0 2024-06-20 09:40:18,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=179327.5, ans=0.035 2024-06-20 09:40:26,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=179364.16666666666, ans=0.125 2024-06-20 09:40:29,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=179364.16666666666, ans=0.025 2024-06-20 09:40:29,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2024-06-20 09:40:40,382 INFO [train.py:1028] (0/2) Epoch 10, batch 6800, loss[loss=0.2582, simple_loss=0.2967, pruned_loss=0.1098, over 13251.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.2921, pruned_loss=0.1036, over 2580199.59 frames. ], batch size: 67, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:40:42,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=179400.83333333334, ans=0.2 2024-06-20 09:41:02,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=179455.83333333334, ans=0.0 2024-06-20 09:41:21,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=179492.5, ans=0.0 2024-06-20 09:41:21,624 INFO [train.py:1028] (0/2) Epoch 10, batch 6850, loss[loss=0.2824, simple_loss=0.3311, pruned_loss=0.1168, over 13268.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.2928, pruned_loss=0.1035, over 2583306.99 frames. ], batch size: 63, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:41:24,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=179492.5, ans=10.0 2024-06-20 09:41:26,281 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.898e+02 2.017e+02 2.185e+02 3.488e+02, threshold=4.034e+02, percent-clipped=0.0 2024-06-20 09:41:28,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=179510.83333333334, ans=0.2 2024-06-20 09:41:32,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=179510.83333333334, ans=0.125 2024-06-20 09:41:40,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=179529.16666666666, ans=0.125 2024-06-20 09:41:48,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=179565.83333333334, ans=0.125 2024-06-20 09:41:54,484 INFO [train.py:1028] (0/2) Epoch 10, batch 6900, loss[loss=0.2455, simple_loss=0.2947, pruned_loss=0.09818, over 13268.00 frames. ], tot_loss[loss=0.25, simple_loss=0.2928, pruned_loss=0.1036, over 2586763.19 frames. ], batch size: 49, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:41:54,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=179584.16666666666, ans=0.125 2024-06-20 09:41:56,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.73 vs. limit=15.0 2024-06-20 09:42:27,712 INFO [train.py:1028] (0/2) Epoch 10, batch 6950, loss[loss=0.2137, simple_loss=0.2689, pruned_loss=0.07923, over 11719.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.2933, pruned_loss=0.1036, over 2580327.73 frames. ], batch size: 17, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:42:27,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179675.83333333334, ans=0.1 2024-06-20 09:42:32,250 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.859e+02 2.083e+02 2.369e+02 3.741e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-20 09:42:44,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=179712.5, ans=0.0 2024-06-20 09:42:53,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.47 vs. limit=22.5 2024-06-20 09:42:57,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179749.16666666666, ans=0.1 2024-06-20 09:42:59,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.86 vs. limit=22.5 2024-06-20 09:43:00,386 INFO [train.py:1028] (0/2) Epoch 10, batch 7000, loss[loss=0.2702, simple_loss=0.308, pruned_loss=0.1161, over 12904.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.294, pruned_loss=0.1038, over 2577626.48 frames. ], batch size: 158, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:43:01,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=179767.5, ans=0.125 2024-06-20 09:43:06,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=179785.83333333334, ans=0.125 2024-06-20 09:43:14,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=179804.16666666666, ans=0.125 2024-06-20 09:43:37,418 INFO [train.py:1028] (0/2) Epoch 10, batch 7050, loss[loss=0.2924, simple_loss=0.3239, pruned_loss=0.1305, over 12740.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.2958, pruned_loss=0.1043, over 2584906.94 frames. ], batch size: 176, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:43:41,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.60 vs. limit=15.0 2024-06-20 09:43:42,141 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.936e+02 2.184e+02 2.541e+02 3.673e+02, threshold=4.367e+02, percent-clipped=0.0 2024-06-20 09:43:53,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2024-06-20 09:43:53,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.86 vs. limit=22.5 2024-06-20 09:43:57,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=179914.16666666666, ans=0.1 2024-06-20 09:44:02,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=179914.16666666666, ans=0.0 2024-06-20 09:44:02,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=179914.16666666666, ans=0.0 2024-06-20 09:44:10,340 INFO [train.py:1028] (0/2) Epoch 10, batch 7100, loss[loss=0.2869, simple_loss=0.3257, pruned_loss=0.1241, over 13137.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.2964, pruned_loss=0.105, over 2576398.99 frames. ], batch size: 112, lr: 5.82e-03, grad_scale: 128.0 2024-06-20 09:44:10,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=179950.83333333334, ans=0.015 2024-06-20 09:44:11,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=179950.83333333334, ans=0.025 2024-06-20 09:44:13,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=179950.83333333334, ans=0.125 2024-06-20 09:44:15,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=179950.83333333334, ans=0.125 2024-06-20 09:44:24,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=12.0 2024-06-20 09:44:40,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=180024.16666666666, ans=0.125 2024-06-20 09:44:40,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=180024.16666666666, ans=0.0 2024-06-20 09:44:43,776 INFO [train.py:1028] (0/2) Epoch 10, batch 7150, loss[loss=0.277, simple_loss=0.3163, pruned_loss=0.1188, over 12535.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.2968, pruned_loss=0.1051, over 2573656.17 frames. ], batch size: 202, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:44:48,357 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.819e+02 1.982e+02 2.257e+02 3.628e+02, threshold=3.965e+02, percent-clipped=0.0 2024-06-20 09:45:01,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=180079.16666666666, ans=0.125 2024-06-20 09:45:09,545 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-20 09:45:14,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=180115.83333333334, ans=0.125 2024-06-20 09:45:16,062 INFO [train.py:1028] (0/2) Epoch 10, batch 7200, loss[loss=0.2771, simple_loss=0.3178, pruned_loss=0.1182, over 13102.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.2975, pruned_loss=0.1051, over 2578261.85 frames. ], batch size: 112, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:45:16,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2024-06-20 09:45:18,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=180134.16666666666, ans=0.0 2024-06-20 09:45:30,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=180152.5, ans=0.95 2024-06-20 09:45:34,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=180152.5, ans=0.125 2024-06-20 09:45:38,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=180170.83333333334, ans=0.125 2024-06-20 09:45:51,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=180207.5, ans=0.2 2024-06-20 09:45:55,998 INFO [train.py:1028] (0/2) Epoch 10, batch 7250, loss[loss=0.2142, simple_loss=0.2656, pruned_loss=0.08145, over 13163.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.2982, pruned_loss=0.1052, over 2579628.45 frames. ], batch size: 37, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:45:58,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=180225.83333333334, ans=0.125 2024-06-20 09:46:00,483 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.885e+02 2.035e+02 2.226e+02 3.312e+02, threshold=4.069e+02, percent-clipped=0.0 2024-06-20 09:46:03,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=180244.16666666666, ans=0.2 2024-06-20 09:46:12,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.18 vs. limit=15.0 2024-06-20 09:46:21,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180280.83333333334, ans=0.1 2024-06-20 09:46:25,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180299.16666666666, ans=0.1 2024-06-20 09:46:29,014 INFO [train.py:1028] (0/2) Epoch 10, batch 7300, loss[loss=0.2398, simple_loss=0.2901, pruned_loss=0.09474, over 12973.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3, pruned_loss=0.1061, over 2580026.94 frames. ], batch size: 36, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:46:33,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180317.5, ans=0.1 2024-06-20 09:46:35,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=22.5 2024-06-20 09:46:39,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=180335.83333333334, ans=0.125 2024-06-20 09:46:50,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180372.5, ans=0.1 2024-06-20 09:47:01,493 INFO [train.py:1028] (0/2) Epoch 10, batch 7350, loss[loss=0.2616, simple_loss=0.3048, pruned_loss=0.1092, over 13327.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3008, pruned_loss=0.1065, over 2582199.44 frames. ], batch size: 46, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:47:02,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=180409.16666666666, ans=0.2 2024-06-20 09:47:04,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=180409.16666666666, ans=0.2 2024-06-20 09:47:05,880 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.896e+02 2.114e+02 2.329e+02 3.154e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 09:47:21,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=180464.16666666666, ans=0.2 2024-06-20 09:47:24,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=180464.16666666666, ans=0.125 2024-06-20 09:47:25,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=180464.16666666666, ans=0.125 2024-06-20 09:47:38,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=180482.5, ans=0.2 2024-06-20 09:47:41,129 INFO [train.py:1028] (0/2) Epoch 10, batch 7400, loss[loss=0.2535, simple_loss=0.3074, pruned_loss=0.09981, over 13274.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3006, pruned_loss=0.1065, over 2587493.14 frames. ], batch size: 63, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:47:54,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=180537.5, ans=0.0 2024-06-20 09:47:58,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=180537.5, ans=0.125 2024-06-20 09:48:14,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=180592.5, ans=0.125 2024-06-20 09:48:14,459 INFO [train.py:1028] (0/2) Epoch 10, batch 7450, loss[loss=0.2462, simple_loss=0.2993, pruned_loss=0.0965, over 12657.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3006, pruned_loss=0.1061, over 2581173.76 frames. ], batch size: 29, lr: 5.81e-03, grad_scale: 128.0 2024-06-20 09:48:15,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=180592.5, ans=0.0 2024-06-20 09:48:18,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.32 vs. limit=22.5 2024-06-20 09:48:19,882 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.904e+02 2.062e+02 2.264e+02 2.863e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 09:48:25,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=180610.83333333334, ans=0.0 2024-06-20 09:48:40,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=180647.5, ans=0.125 2024-06-20 09:48:40,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=180647.5, ans=0.1 2024-06-20 09:48:41,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=180665.83333333334, ans=0.0 2024-06-20 09:48:47,773 INFO [train.py:1028] (0/2) Epoch 10, batch 7500, loss[loss=0.2772, simple_loss=0.304, pruned_loss=0.1252, over 10759.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3013, pruned_loss=0.1066, over 2577728.00 frames. ], batch size: 304, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:48:48,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=180684.16666666666, ans=0.125 2024-06-20 09:48:53,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=180684.16666666666, ans=0.125 2024-06-20 09:48:57,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.66 vs. limit=15.0 2024-06-20 09:49:07,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=180739.16666666666, ans=15.0 2024-06-20 09:49:20,626 INFO [train.py:1028] (0/2) Epoch 10, batch 7550, loss[loss=0.2646, simple_loss=0.3037, pruned_loss=0.1128, over 12890.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3019, pruned_loss=0.1071, over 2578090.90 frames. ], batch size: 158, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:49:29,597 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.923e+02 2.156e+02 2.444e+02 3.380e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-20 09:49:29,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2024-06-20 09:49:33,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=180794.16666666666, ans=0.0 2024-06-20 09:49:34,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=180794.16666666666, ans=0.0 2024-06-20 09:49:41,327 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 09:49:43,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.52 vs. limit=22.5 2024-06-20 09:49:46,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=180830.83333333334, ans=0.0 2024-06-20 09:49:46,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2024-06-20 09:49:47,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=180830.83333333334, ans=0.125 2024-06-20 09:49:51,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=180849.16666666666, ans=0.0 2024-06-20 09:49:52,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2024-06-20 09:49:57,387 INFO [train.py:1028] (0/2) Epoch 10, batch 7600, loss[loss=0.2844, simple_loss=0.3265, pruned_loss=0.1211, over 13221.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3034, pruned_loss=0.1079, over 2577606.06 frames. ], batch size: 83, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:50:15,469 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.95 vs. limit=22.5 2024-06-20 09:50:22,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=180922.5, ans=0.125 2024-06-20 09:50:31,359 INFO [train.py:1028] (0/2) Epoch 10, batch 7650, loss[loss=0.2404, simple_loss=0.2853, pruned_loss=0.09781, over 12977.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3038, pruned_loss=0.1082, over 2574780.18 frames. ], batch size: 33, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:50:36,868 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.894e+02 2.012e+02 2.184e+02 3.015e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 09:50:38,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.35 vs. limit=22.5 2024-06-20 09:51:05,915 INFO [train.py:1028] (0/2) Epoch 10, batch 7700, loss[loss=0.2622, simple_loss=0.3174, pruned_loss=0.1035, over 13227.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3039, pruned_loss=0.108, over 2571454.06 frames. ], batch size: 63, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:51:09,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=181050.83333333334, ans=0.125 2024-06-20 09:51:15,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181069.16666666666, ans=0.1 2024-06-20 09:51:16,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.93 vs. limit=6.0 2024-06-20 09:51:41,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=181124.16666666666, ans=0.125 2024-06-20 09:51:45,676 INFO [train.py:1028] (0/2) Epoch 10, batch 7750, loss[loss=0.2299, simple_loss=0.2873, pruned_loss=0.08625, over 13265.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3039, pruned_loss=0.1082, over 2575684.67 frames. ], batch size: 72, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:51:51,150 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.877e+02 2.043e+02 2.298e+02 3.130e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 09:51:52,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=181160.83333333334, ans=0.0 2024-06-20 09:51:52,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.23 vs. limit=22.5 2024-06-20 09:51:55,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=181160.83333333334, ans=0.0 2024-06-20 09:52:18,573 INFO [train.py:1028] (0/2) Epoch 10, batch 7800, loss[loss=0.2866, simple_loss=0.3279, pruned_loss=0.1227, over 13173.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3042, pruned_loss=0.1082, over 2579860.91 frames. ], batch size: 95, lr: 5.80e-03, grad_scale: 64.0 2024-06-20 09:52:21,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=181234.16666666666, ans=0.125 2024-06-20 09:52:26,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=181252.5, ans=0.025 2024-06-20 09:52:42,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=181289.16666666666, ans=0.04949747468305833 2024-06-20 09:52:52,125 INFO [train.py:1028] (0/2) Epoch 10, batch 7850, loss[loss=0.2677, simple_loss=0.306, pruned_loss=0.1147, over 11485.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3055, pruned_loss=0.109, over 2572868.86 frames. ], batch size: 16, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:52:54,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=8.0 2024-06-20 09:52:57,651 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 1.908e+02 2.046e+02 2.194e+02 3.225e+02, threshold=4.093e+02, percent-clipped=0.0 2024-06-20 09:52:59,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=181344.16666666666, ans=0.0 2024-06-20 09:53:05,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=181362.5, ans=0.1 2024-06-20 09:53:17,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181380.83333333334, ans=0.125 2024-06-20 09:53:26,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=181399.16666666666, ans=0.125 2024-06-20 09:53:31,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=181417.5, ans=0.07 2024-06-20 09:53:31,799 INFO [train.py:1028] (0/2) Epoch 10, batch 7900, loss[loss=0.2506, simple_loss=0.2948, pruned_loss=0.1032, over 13145.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3051, pruned_loss=0.1088, over 2571979.42 frames. ], batch size: 77, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:53:38,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181435.83333333334, ans=0.1 2024-06-20 09:53:51,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=181472.5, ans=0.04949747468305833 2024-06-20 09:53:54,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.29 vs. limit=10.0 2024-06-20 09:53:56,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181472.5, ans=0.1 2024-06-20 09:53:59,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=181490.83333333334, ans=0.0 2024-06-20 09:54:00,404 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.215e+01 2024-06-20 09:54:03,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=181490.83333333334, ans=0.0 2024-06-20 09:54:04,708 INFO [train.py:1028] (0/2) Epoch 10, batch 7950, loss[loss=0.2865, simple_loss=0.3161, pruned_loss=0.1285, over 10718.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3052, pruned_loss=0.1083, over 2575513.93 frames. ], batch size: 303, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:54:07,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2024-06-20 09:54:09,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.896e+02 2.024e+02 2.307e+02 3.595e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-20 09:54:16,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181527.5, ans=0.125 2024-06-20 09:54:27,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181564.16666666666, ans=0.1 2024-06-20 09:54:31,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=181582.5, ans=0.125 2024-06-20 09:54:31,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=181582.5, ans=0.0 2024-06-20 09:54:32,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2024-06-20 09:54:35,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=181582.5, ans=0.025 2024-06-20 09:54:37,688 INFO [train.py:1028] (0/2) Epoch 10, batch 8000, loss[loss=0.2244, simple_loss=0.2779, pruned_loss=0.08548, over 12652.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3065, pruned_loss=0.1087, over 2571736.26 frames. ], batch size: 29, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:54:39,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181600.83333333334, ans=0.1 2024-06-20 09:54:56,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.52 vs. limit=15.0 2024-06-20 09:54:59,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=181655.83333333334, ans=0.95 2024-06-20 09:55:01,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=181655.83333333334, ans=0.2 2024-06-20 09:55:14,270 INFO [train.py:1028] (0/2) Epoch 10, batch 8050, loss[loss=0.2781, simple_loss=0.3311, pruned_loss=0.1126, over 13230.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3066, pruned_loss=0.1085, over 2571289.86 frames. ], batch size: 83, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:55:15,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=181692.5, ans=0.125 2024-06-20 09:55:17,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=181692.5, ans=0.07 2024-06-20 09:55:22,931 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.959e+02 2.156e+02 2.331e+02 3.503e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-20 09:55:29,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=181710.83333333334, ans=0.2 2024-06-20 09:55:31,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=181729.16666666666, ans=0.125 2024-06-20 09:55:38,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2024-06-20 09:55:47,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=15.0 2024-06-20 09:55:49,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=181784.16666666666, ans=0.125 2024-06-20 09:55:49,884 INFO [train.py:1028] (0/2) Epoch 10, batch 8100, loss[loss=0.2668, simple_loss=0.3175, pruned_loss=0.1081, over 13160.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3069, pruned_loss=0.1086, over 2575149.43 frames. ], batch size: 112, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:55:50,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=181784.16666666666, ans=0.0 2024-06-20 09:56:01,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 09:56:03,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=181820.83333333334, ans=0.1 2024-06-20 09:56:09,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=181839.16666666666, ans=0.05 2024-06-20 09:56:21,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2024-06-20 09:56:22,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=181857.5, ans=0.125 2024-06-20 09:56:23,666 INFO [train.py:1028] (0/2) Epoch 10, batch 8150, loss[loss=0.2845, simple_loss=0.3227, pruned_loss=0.1232, over 13178.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3075, pruned_loss=0.1086, over 2579925.99 frames. ], batch size: 121, lr: 5.79e-03, grad_scale: 64.0 2024-06-20 09:56:26,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=181875.83333333334, ans=0.125 2024-06-20 09:56:29,222 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.891e+02 1.996e+02 2.154e+02 2.544e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 09:56:32,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=181894.16666666666, ans=0.025 2024-06-20 09:56:34,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=181894.16666666666, ans=0.025 2024-06-20 09:56:36,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.37 vs. limit=6.0 2024-06-20 09:56:42,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181912.5, ans=0.125 2024-06-20 09:56:56,693 INFO [train.py:1028] (0/2) Epoch 10, batch 8200, loss[loss=0.2705, simple_loss=0.3149, pruned_loss=0.113, over 13126.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3077, pruned_loss=0.1086, over 2583304.72 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:57:02,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2024-06-20 09:57:02,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=181985.83333333334, ans=0.025 2024-06-20 09:57:08,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2024-06-20 09:57:11,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=182004.16666666666, ans=0.2 2024-06-20 09:57:20,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=182022.5, ans=0.2 2024-06-20 09:57:29,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=182040.83333333334, ans=0.125 2024-06-20 09:57:33,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=182040.83333333334, ans=0.125 2024-06-20 09:57:36,088 INFO [train.py:1028] (0/2) Epoch 10, batch 8250, loss[loss=0.2663, simple_loss=0.3208, pruned_loss=0.1059, over 13235.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3083, pruned_loss=0.1092, over 2583165.77 frames. ], batch size: 52, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:57:39,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=182059.16666666666, ans=0.5 2024-06-20 09:57:41,428 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.873e+02 2.016e+02 2.203e+02 3.042e+02, threshold=4.032e+02, percent-clipped=0.0 2024-06-20 09:57:48,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=182095.83333333334, ans=0.5 2024-06-20 09:57:54,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2024-06-20 09:57:57,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=182114.16666666666, ans=0.0 2024-06-20 09:58:04,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.82 vs. limit=22.5 2024-06-20 09:58:08,417 INFO [train.py:1028] (0/2) Epoch 10, batch 8300, loss[loss=0.2776, simple_loss=0.314, pruned_loss=0.1206, over 13148.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3075, pruned_loss=0.109, over 2580268.30 frames. ], batch size: 103, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:58:19,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=182169.16666666666, ans=0.125 2024-06-20 09:58:22,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=182187.5, ans=0.125 2024-06-20 09:58:28,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=182205.83333333334, ans=0.2 2024-06-20 09:58:29,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=182205.83333333334, ans=0.0 2024-06-20 09:58:41,228 INFO [train.py:1028] (0/2) Epoch 10, batch 8350, loss[loss=0.2555, simple_loss=0.3056, pruned_loss=0.1027, over 13164.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3068, pruned_loss=0.1081, over 2579803.19 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:58:41,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=182242.5, ans=0.125 2024-06-20 09:58:43,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=182242.5, ans=0.125 2024-06-20 09:58:43,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=182242.5, ans=0.125 2024-06-20 09:58:45,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=182242.5, ans=0.5 2024-06-20 09:58:46,289 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.908e+02 2.012e+02 2.176e+02 3.228e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 09:58:46,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=182242.5, ans=0.125 2024-06-20 09:58:54,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=182279.16666666666, ans=0.125 2024-06-20 09:59:03,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=182297.5, ans=0.125 2024-06-20 09:59:12,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=182315.83333333334, ans=0.0 2024-06-20 09:59:16,779 INFO [train.py:1028] (0/2) Epoch 10, batch 8400, loss[loss=0.257, simple_loss=0.302, pruned_loss=0.106, over 12931.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3069, pruned_loss=0.1083, over 2576927.54 frames. ], batch size: 39, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:59:20,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=182334.16666666666, ans=0.0 2024-06-20 09:59:27,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=182352.5, ans=0.07 2024-06-20 09:59:41,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=182389.16666666666, ans=0.2 2024-06-20 09:59:49,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=182407.5, ans=0.0 2024-06-20 09:59:52,077 INFO [train.py:1028] (0/2) Epoch 10, batch 8450, loss[loss=0.2588, simple_loss=0.3072, pruned_loss=0.1052, over 13221.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3074, pruned_loss=0.1085, over 2578828.95 frames. ], batch size: 112, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 09:59:56,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=182425.83333333334, ans=0.0 2024-06-20 09:59:56,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=182425.83333333334, ans=0.125 2024-06-20 09:59:57,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.954e+02 2.082e+02 2.368e+02 3.764e+02, threshold=4.163e+02, percent-clipped=0.0 2024-06-20 09:59:58,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=182444.16666666666, ans=0.025 2024-06-20 10:00:10,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=182462.5, ans=0.125 2024-06-20 10:00:10,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=182462.5, ans=0.125 2024-06-20 10:00:22,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-20 10:00:25,525 INFO [train.py:1028] (0/2) Epoch 10, batch 8500, loss[loss=0.247, simple_loss=0.3, pruned_loss=0.09699, over 12627.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3088, pruned_loss=0.1092, over 2576869.71 frames. ], batch size: 29, lr: 5.78e-03, grad_scale: 64.0 2024-06-20 10:00:31,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.72 vs. limit=10.0 2024-06-20 10:00:41,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-20 10:00:59,658 INFO [train.py:1028] (0/2) Epoch 10, batch 8550, loss[loss=0.2833, simple_loss=0.3241, pruned_loss=0.1212, over 12644.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3082, pruned_loss=0.1091, over 2575032.53 frames. ], batch size: 22, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:01:05,341 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.924e+02 2.058e+02 2.229e+02 2.913e+02, threshold=4.116e+02, percent-clipped=0.0 2024-06-20 10:01:11,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=182627.5, ans=0.0 2024-06-20 10:01:31,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=182664.16666666666, ans=0.025 2024-06-20 10:01:39,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=182682.5, ans=0.5 2024-06-20 10:01:40,302 INFO [train.py:1028] (0/2) Epoch 10, batch 8600, loss[loss=0.2527, simple_loss=0.2962, pruned_loss=0.1046, over 13123.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3093, pruned_loss=0.1095, over 2572261.69 frames. ], batch size: 112, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:01:49,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=12.0 2024-06-20 10:01:57,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=182737.5, ans=0.025 2024-06-20 10:02:04,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=182755.83333333334, ans=0.125 2024-06-20 10:02:05,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=182755.83333333334, ans=0.2 2024-06-20 10:02:12,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=182774.16666666666, ans=0.125 2024-06-20 10:02:12,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.56 vs. limit=15.0 2024-06-20 10:02:14,456 INFO [train.py:1028] (0/2) Epoch 10, batch 8650, loss[loss=0.263, simple_loss=0.3087, pruned_loss=0.1087, over 13085.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3092, pruned_loss=0.1091, over 2576230.94 frames. ], batch size: 102, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:02:19,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.89 vs. limit=5.0 2024-06-20 10:02:19,515 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.919e+02 2.057e+02 2.186e+02 2.789e+02, threshold=4.114e+02, percent-clipped=0.0 2024-06-20 10:02:24,073 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:02:26,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=182829.16666666666, ans=0.125 2024-06-20 10:02:28,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.74 vs. limit=22.5 2024-06-20 10:02:30,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=182829.16666666666, ans=0.2 2024-06-20 10:02:33,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.84 vs. limit=22.5 2024-06-20 10:02:43,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=182865.83333333334, ans=0.2 2024-06-20 10:02:46,977 INFO [train.py:1028] (0/2) Epoch 10, batch 8700, loss[loss=0.2764, simple_loss=0.3283, pruned_loss=0.1122, over 13174.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3095, pruned_loss=0.1097, over 2573584.30 frames. ], batch size: 59, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:02:51,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.86 vs. limit=10.0 2024-06-20 10:02:53,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.65 vs. limit=15.0 2024-06-20 10:02:54,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-20 10:03:00,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=182920.83333333334, ans=0.125 2024-06-20 10:03:04,310 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:03:13,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=182939.16666666666, ans=0.1 2024-06-20 10:03:16,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=182957.5, ans=0.1 2024-06-20 10:03:18,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=182957.5, ans=0.0 2024-06-20 10:03:18,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=182957.5, ans=0.0 2024-06-20 10:03:19,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=182957.5, ans=0.0 2024-06-20 10:03:23,551 INFO [train.py:1028] (0/2) Epoch 10, batch 8750, loss[loss=0.278, simple_loss=0.3161, pruned_loss=0.12, over 13093.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3098, pruned_loss=0.1102, over 2568686.58 frames. ], batch size: 121, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:03:32,460 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.933e+02 2.061e+02 2.236e+02 3.961e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-20 10:03:37,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=182994.16666666666, ans=0.125 2024-06-20 10:03:43,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=183012.5, ans=0.125 2024-06-20 10:03:51,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.07 vs. limit=22.5 2024-06-20 10:03:54,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=183049.16666666666, ans=0.125 2024-06-20 10:04:00,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.74 vs. limit=10.0 2024-06-20 10:04:01,177 INFO [train.py:1028] (0/2) Epoch 10, batch 8800, loss[loss=0.2862, simple_loss=0.3367, pruned_loss=0.1179, over 13290.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3098, pruned_loss=0.1099, over 2573899.36 frames. ], batch size: 72, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:04:09,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=183085.83333333334, ans=0.125 2024-06-20 10:04:15,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=183104.16666666666, ans=0.05 2024-06-20 10:04:15,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=183104.16666666666, ans=0.125 2024-06-20 10:04:21,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=183122.5, ans=0.0 2024-06-20 10:04:22,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=183122.5, ans=0.125 2024-06-20 10:04:35,072 INFO [train.py:1028] (0/2) Epoch 10, batch 8850, loss[loss=0.2983, simple_loss=0.3382, pruned_loss=0.1292, over 12580.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3098, pruned_loss=0.1101, over 2562637.55 frames. ], batch size: 202, lr: 5.77e-03, grad_scale: 64.0 2024-06-20 10:04:35,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=183159.16666666666, ans=0.0 2024-06-20 10:04:40,598 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.906e+02 2.033e+02 2.167e+02 2.777e+02, threshold=4.066e+02, percent-clipped=0.0 2024-06-20 10:05:04,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=183232.5, ans=0.2 2024-06-20 10:05:10,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183232.5, ans=0.1 2024-06-20 10:05:11,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=183250.83333333334, ans=0.0 2024-06-20 10:05:11,872 INFO [train.py:1028] (0/2) Epoch 10, batch 8900, loss[loss=0.2743, simple_loss=0.3244, pruned_loss=0.1121, over 12928.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3107, pruned_loss=0.1104, over 2560984.17 frames. ], batch size: 33, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:05:38,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=183305.83333333334, ans=0.2 2024-06-20 10:05:42,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2024-06-20 10:05:44,501 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-100000.pt 2024-06-20 10:05:52,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2024-06-20 10:05:52,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.79 vs. limit=22.5 2024-06-20 10:05:53,809 INFO [train.py:1028] (0/2) Epoch 10, batch 8950, loss[loss=0.2797, simple_loss=0.3183, pruned_loss=0.1205, over 12449.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3106, pruned_loss=0.1099, over 2561428.71 frames. ], batch size: 202, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:05:59,225 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.935e+02 2.117e+02 2.442e+02 3.061e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-20 10:06:08,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=183379.16666666666, ans=0.0 2024-06-20 10:06:09,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2024-06-20 10:06:27,681 INFO [train.py:1028] (0/2) Epoch 10, batch 9000, loss[loss=0.2857, simple_loss=0.3327, pruned_loss=0.1194, over 13278.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3113, pruned_loss=0.1102, over 2567231.46 frames. ], batch size: 46, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:06:27,682 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 10:06:35,714 INFO [train.py:1060] (0/2) Epoch 10, validation: loss=0.1973, simple_loss=0.261, pruned_loss=0.06683, over 351949.00 frames. 2024-06-20 10:06:35,714 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 10:06:35,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183434.16666666666, ans=0.125 2024-06-20 10:06:44,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.46 vs. limit=10.0 2024-06-20 10:06:48,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.15 vs. limit=15.0 2024-06-20 10:06:52,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.10 vs. limit=15.0 2024-06-20 10:06:53,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=183470.83333333334, ans=0.0 2024-06-20 10:06:55,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=183489.16666666666, ans=0.07 2024-06-20 10:07:00,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.07 vs. limit=22.5 2024-06-20 10:07:08,489 INFO [train.py:1028] (0/2) Epoch 10, batch 9050, loss[loss=0.2475, simple_loss=0.298, pruned_loss=0.09848, over 10608.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3122, pruned_loss=0.1108, over 2565806.73 frames. ], batch size: 16, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:07:13,626 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.975e+02 2.103e+02 2.317e+02 3.069e+02, threshold=4.207e+02, percent-clipped=0.0 2024-06-20 10:07:23,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=15.0 2024-06-20 10:07:26,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=183562.5, ans=0.0 2024-06-20 10:07:27,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=183580.83333333334, ans=0.125 2024-06-20 10:07:33,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=183580.83333333334, ans=0.125 2024-06-20 10:07:41,859 INFO [train.py:1028] (0/2) Epoch 10, batch 9100, loss[loss=0.254, simple_loss=0.307, pruned_loss=0.1005, over 13257.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.311, pruned_loss=0.1101, over 2566906.28 frames. ], batch size: 72, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:07:53,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=183635.83333333334, ans=0.0 2024-06-20 10:07:58,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.28 vs. limit=22.5 2024-06-20 10:08:10,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.98 vs. limit=22.5 2024-06-20 10:08:15,170 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:08:18,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=183690.83333333334, ans=0.05 2024-06-20 10:08:19,880 INFO [train.py:1028] (0/2) Epoch 10, batch 9150, loss[loss=0.2648, simple_loss=0.3146, pruned_loss=0.1075, over 13135.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3114, pruned_loss=0.1103, over 2568499.29 frames. ], batch size: 77, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:08:20,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.34 vs. limit=22.5 2024-06-20 10:08:24,843 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.906e+02 2.110e+02 2.376e+02 3.577e+02, threshold=4.219e+02, percent-clipped=0.0 2024-06-20 10:08:37,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=183745.83333333334, ans=0.125 2024-06-20 10:08:51,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.44 vs. limit=6.0 2024-06-20 10:08:51,582 INFO [train.py:1028] (0/2) Epoch 10, batch 9200, loss[loss=0.2533, simple_loss=0.2974, pruned_loss=0.1046, over 12920.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3115, pruned_loss=0.1101, over 2571744.34 frames. ], batch size: 36, lr: 5.76e-03, grad_scale: 64.0 2024-06-20 10:08:53,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=183800.83333333334, ans=0.0 2024-06-20 10:09:00,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183819.16666666666, ans=0.1 2024-06-20 10:09:03,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=183837.5, ans=0.0 2024-06-20 10:09:19,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=183874.16666666666, ans=0.025 2024-06-20 10:09:20,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183874.16666666666, ans=0.1 2024-06-20 10:09:22,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=183874.16666666666, ans=0.125 2024-06-20 10:09:23,266 INFO [train.py:1028] (0/2) Epoch 10, batch 9250, loss[loss=0.2778, simple_loss=0.3294, pruned_loss=0.1131, over 13243.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3111, pruned_loss=0.1098, over 2573244.99 frames. ], batch size: 67, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:09:28,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=15.0 2024-06-20 10:09:28,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.937e+02 2.120e+02 2.256e+02 2.902e+02, threshold=4.239e+02, percent-clipped=0.0 2024-06-20 10:09:29,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=183910.83333333334, ans=0.2 2024-06-20 10:09:39,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=183929.16666666666, ans=0.05 2024-06-20 10:09:45,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=183947.5, ans=0.2 2024-06-20 10:09:46,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=183947.5, ans=0.0 2024-06-20 10:09:47,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183947.5, ans=0.125 2024-06-20 10:09:47,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=183947.5, ans=0.025 2024-06-20 10:09:55,664 INFO [train.py:1028] (0/2) Epoch 10, batch 9300, loss[loss=0.2408, simple_loss=0.2934, pruned_loss=0.09411, over 12896.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3103, pruned_loss=0.1092, over 2570635.52 frames. ], batch size: 39, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:10:00,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=183984.16666666666, ans=0.0 2024-06-20 10:10:11,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=184020.83333333334, ans=0.125 2024-06-20 10:10:27,099 INFO [train.py:1028] (0/2) Epoch 10, batch 9350, loss[loss=0.2641, simple_loss=0.3129, pruned_loss=0.1076, over 12561.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3099, pruned_loss=0.1091, over 2567680.20 frames. ], batch size: 22, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:10:27,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=184075.83333333334, ans=0.2 2024-06-20 10:10:32,015 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.938e+02 2.096e+02 2.300e+02 2.887e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-20 10:10:33,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=184094.16666666666, ans=0.125 2024-06-20 10:10:37,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=184094.16666666666, ans=0.125 2024-06-20 10:10:39,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=184112.5, ans=0.125 2024-06-20 10:10:44,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=184112.5, ans=0.125 2024-06-20 10:10:48,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=23.59 vs. limit=22.5 2024-06-20 10:10:58,284 INFO [train.py:1028] (0/2) Epoch 10, batch 9400, loss[loss=0.2814, simple_loss=0.3282, pruned_loss=0.1173, over 13235.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3107, pruned_loss=0.1096, over 2568055.59 frames. ], batch size: 52, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:10:59,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.09 vs. limit=22.5 2024-06-20 10:11:04,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=184185.83333333334, ans=15.0 2024-06-20 10:11:04,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=184185.83333333334, ans=15.0 2024-06-20 10:11:05,593 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:11:07,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=184185.83333333334, ans=0.0 2024-06-20 10:11:09,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=184204.16666666666, ans=0.125 2024-06-20 10:11:10,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=184204.16666666666, ans=0.0 2024-06-20 10:11:11,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=184204.16666666666, ans=0.0 2024-06-20 10:11:12,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184204.16666666666, ans=0.1 2024-06-20 10:11:16,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.55 vs. limit=15.0 2024-06-20 10:11:18,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.29 vs. limit=22.5 2024-06-20 10:11:24,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=184240.83333333334, ans=0.0 2024-06-20 10:11:28,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=184240.83333333334, ans=0.125 2024-06-20 10:11:31,266 INFO [train.py:1028] (0/2) Epoch 10, batch 9450, loss[loss=0.2763, simple_loss=0.3157, pruned_loss=0.1185, over 12638.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3116, pruned_loss=0.1102, over 2568109.57 frames. ], batch size: 22, lr: 5.75e-03, grad_scale: 64.0 2024-06-20 10:11:35,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=184259.16666666666, ans=0.125 2024-06-20 10:11:36,176 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 1.947e+02 2.091e+02 2.298e+02 3.082e+02, threshold=4.182e+02, percent-clipped=0.0 2024-06-20 10:11:41,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184277.5, ans=0.1 2024-06-20 10:11:43,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.43 vs. limit=22.5 2024-06-20 10:11:54,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=8.0 2024-06-20 10:12:04,300 INFO [train.py:1028] (0/2) Epoch 10, batch 9500, loss[loss=0.2752, simple_loss=0.3242, pruned_loss=0.1131, over 13243.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3115, pruned_loss=0.1099, over 2576560.06 frames. ], batch size: 43, lr: 5.75e-03, grad_scale: 128.0 2024-06-20 10:12:06,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=184350.83333333334, ans=0.0 2024-06-20 10:12:06,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=12.0 2024-06-20 10:12:14,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=184369.16666666666, ans=0.2 2024-06-20 10:12:35,096 INFO [train.py:1028] (0/2) Epoch 10, batch 9550, loss[loss=0.2587, simple_loss=0.3082, pruned_loss=0.1046, over 12882.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3111, pruned_loss=0.1097, over 2571709.78 frames. ], batch size: 39, lr: 5.75e-03, grad_scale: 128.0 2024-06-20 10:12:35,301 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:12:40,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.919e+02 2.058e+02 2.238e+02 3.785e+02, threshold=4.115e+02, percent-clipped=0.0 2024-06-20 10:12:44,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.78 vs. limit=10.0 2024-06-20 10:12:46,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2024-06-20 10:12:47,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=184479.16666666666, ans=0.125 2024-06-20 10:12:47,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=184479.16666666666, ans=0.1 2024-06-20 10:12:51,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=184479.16666666666, ans=0.0 2024-06-20 10:12:54,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184497.5, ans=0.1 2024-06-20 10:13:02,906 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.84 vs. limit=15.0 2024-06-20 10:13:06,172 INFO [train.py:1028] (0/2) Epoch 10, batch 9600, loss[loss=0.2954, simple_loss=0.3251, pruned_loss=0.1329, over 10704.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.311, pruned_loss=0.1098, over 2571153.72 frames. ], batch size: 304, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:13:15,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=184552.5, ans=0.025 2024-06-20 10:13:19,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=184570.83333333334, ans=0.0 2024-06-20 10:13:26,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184589.16666666666, ans=0.1 2024-06-20 10:13:30,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=184607.5, ans=0.04949747468305833 2024-06-20 10:13:31,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=184607.5, ans=10.0 2024-06-20 10:13:31,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184607.5, ans=0.1 2024-06-20 10:13:35,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=184607.5, ans=0.0 2024-06-20 10:13:36,838 INFO [train.py:1028] (0/2) Epoch 10, batch 9650, loss[loss=0.2598, simple_loss=0.3008, pruned_loss=0.1094, over 13120.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3112, pruned_loss=0.1104, over 2560832.91 frames. ], batch size: 132, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:13:40,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=184625.83333333334, ans=0.0 2024-06-20 10:13:40,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184625.83333333334, ans=0.1 2024-06-20 10:13:41,815 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 2.022e+02 2.213e+02 2.513e+02 3.683e+02, threshold=4.426e+02, percent-clipped=0.0 2024-06-20 10:13:41,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=184625.83333333334, ans=0.0 2024-06-20 10:13:45,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=184644.16666666666, ans=0.2 2024-06-20 10:13:47,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=184644.16666666666, ans=0.2 2024-06-20 10:13:59,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=184680.83333333334, ans=0.0 2024-06-20 10:14:00,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=184680.83333333334, ans=0.0 2024-06-20 10:14:04,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184680.83333333334, ans=0.1 2024-06-20 10:14:11,882 INFO [train.py:1028] (0/2) Epoch 10, batch 9700, loss[loss=0.2482, simple_loss=0.2883, pruned_loss=0.104, over 13048.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3105, pruned_loss=0.1101, over 2554684.05 frames. ], batch size: 144, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:14:14,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2024-06-20 10:14:16,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=184717.5, ans=0.0 2024-06-20 10:14:25,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184754.16666666666, ans=0.0 2024-06-20 10:14:39,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=184790.83333333334, ans=0.05 2024-06-20 10:14:40,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.35 vs. limit=15.0 2024-06-20 10:14:43,003 INFO [train.py:1028] (0/2) Epoch 10, batch 9750, loss[loss=0.2718, simple_loss=0.3072, pruned_loss=0.1182, over 13064.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3092, pruned_loss=0.1093, over 2552474.01 frames. ], batch size: 132, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:14:47,864 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.920e+02 2.127e+02 2.396e+02 3.268e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 10:14:49,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=12.0 2024-06-20 10:14:50,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2024-06-20 10:14:54,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=184827.5, ans=0.0 2024-06-20 10:14:55,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184845.83333333334, ans=0.1 2024-06-20 10:14:57,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=184845.83333333334, ans=0.125 2024-06-20 10:15:06,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=184864.16666666666, ans=15.0 2024-06-20 10:15:06,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2024-06-20 10:15:13,914 INFO [train.py:1028] (0/2) Epoch 10, batch 9800, loss[loss=0.26, simple_loss=0.3121, pruned_loss=0.104, over 12853.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3093, pruned_loss=0.1092, over 2545414.90 frames. ], batch size: 39, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:15:19,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184919.16666666666, ans=0.125 2024-06-20 10:15:23,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=184919.16666666666, ans=0.0 2024-06-20 10:15:28,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=184937.5, ans=0.025 2024-06-20 10:15:31,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.10 vs. limit=22.5 2024-06-20 10:15:39,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184974.16666666666, ans=0.1 2024-06-20 10:15:45,958 INFO [train.py:1028] (0/2) Epoch 10, batch 9850, loss[loss=0.2802, simple_loss=0.3256, pruned_loss=0.1174, over 13003.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3089, pruned_loss=0.1088, over 2537979.52 frames. ], batch size: 102, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:15:46,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184992.5, ans=0.1 2024-06-20 10:15:48,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184992.5, ans=0.1 2024-06-20 10:15:50,773 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.935e+02 2.098e+02 2.260e+02 3.496e+02, threshold=4.195e+02, percent-clipped=0.0 2024-06-20 10:16:01,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=185029.16666666666, ans=0.0 2024-06-20 10:16:06,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=185047.5, ans=0.125 2024-06-20 10:16:08,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=185047.5, ans=0.125 2024-06-20 10:16:11,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=185065.83333333334, ans=0.125 2024-06-20 10:16:12,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=185065.83333333334, ans=0.125 2024-06-20 10:16:17,465 INFO [train.py:1028] (0/2) Epoch 10, batch 9900, loss[loss=0.2557, simple_loss=0.3005, pruned_loss=0.1054, over 12982.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3077, pruned_loss=0.1087, over 2529911.31 frames. ], batch size: 39, lr: 5.74e-03, grad_scale: 128.0 2024-06-20 10:16:29,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=185120.83333333334, ans=0.125 2024-06-20 10:16:36,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=185139.16666666666, ans=0.125 2024-06-20 10:16:48,647 INFO [train.py:1028] (0/2) Epoch 10, batch 9950, loss[loss=0.2408, simple_loss=0.2873, pruned_loss=0.09717, over 12631.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3063, pruned_loss=0.1086, over 2524822.83 frames. ], batch size: 29, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:16:53,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.899e+02 2.047e+02 2.266e+02 2.909e+02, threshold=4.093e+02, percent-clipped=0.0 2024-06-20 10:16:53,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=185175.83333333334, ans=0.125 2024-06-20 10:16:58,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=185194.16666666666, ans=0.125 2024-06-20 10:17:00,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=185194.16666666666, ans=0.125 2024-06-20 10:17:04,463 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:17:07,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=185230.83333333334, ans=10.0 2024-06-20 10:17:09,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=185230.83333333334, ans=0.0 2024-06-20 10:17:10,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=185230.83333333334, ans=0.2 2024-06-20 10:17:20,227 INFO [train.py:1028] (0/2) Epoch 10, batch 10000, loss[loss=0.2659, simple_loss=0.3184, pruned_loss=0.1067, over 12449.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3071, pruned_loss=0.1093, over 2487632.36 frames. ], batch size: 22, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:17:26,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-20 10:17:37,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=185304.16666666666, ans=0.0 2024-06-20 10:17:42,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.31 vs. limit=15.0 2024-06-20 10:17:44,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.76 vs. limit=22.5 2024-06-20 10:17:49,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2024-06-20 10:17:52,367 INFO [train.py:1028] (0/2) Epoch 10, batch 10050, loss[loss=0.276, simple_loss=0.3243, pruned_loss=0.1139, over 12656.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3071, pruned_loss=0.1102, over 2446123.73 frames. ], batch size: 22, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:17:53,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.52 vs. limit=22.5 2024-06-20 10:17:53,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=185359.16666666666, ans=0.125 2024-06-20 10:17:56,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=185359.16666666666, ans=0.025 2024-06-20 10:17:56,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.050e+02 2.263e+02 2.609e+02 3.503e+02, threshold=4.525e+02, percent-clipped=0.0 2024-06-20 10:18:09,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=185414.16666666666, ans=0.0 2024-06-20 10:18:10,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185414.16666666666, ans=0.0 2024-06-20 10:18:22,249 INFO [train.py:1028] (0/2) Epoch 10, batch 10100, loss[loss=0.2477, simple_loss=0.2974, pruned_loss=0.09899, over 12021.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3061, pruned_loss=0.1093, over 2428927.62 frames. ], batch size: 18, lr: 5.73e-03, grad_scale: 128.0 2024-06-20 10:18:23,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=11.04 vs. limit=10.0 2024-06-20 10:18:36,258 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-10.pt 2024-06-20 10:20:35,865 INFO [train.py:1028] (0/2) Epoch 11, batch 0, loss[loss=0.24, simple_loss=0.2898, pruned_loss=0.09515, over 12922.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2898, pruned_loss=0.09515, over 12922.00 frames. ], batch size: 36, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:20:35,866 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 10:20:42,752 INFO [train.py:1060] (0/2) Epoch 11, validation: loss=0.199, simple_loss=0.2631, pruned_loss=0.06746, over 351949.00 frames. 2024-06-20 10:20:42,753 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 10:20:46,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=185483.83333333334, ans=0.95 2024-06-20 10:20:47,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2024-06-20 10:20:56,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=185520.5, ans=0.0 2024-06-20 10:20:57,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=185520.5, ans=0.125 2024-06-20 10:21:05,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=185520.5, ans=0.0 2024-06-20 10:21:05,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=185538.83333333334, ans=0.025 2024-06-20 10:21:06,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-20 10:21:09,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=185538.83333333334, ans=0.125 2024-06-20 10:21:12,870 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 1.810e+02 1.985e+02 2.238e+02 3.284e+02, threshold=3.969e+02, percent-clipped=0.0 2024-06-20 10:21:19,352 INFO [train.py:1028] (0/2) Epoch 11, batch 50, loss[loss=0.2578, simple_loss=0.3098, pruned_loss=0.1029, over 12753.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2873, pruned_loss=0.1009, over 574433.06 frames. ], batch size: 29, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:21:22,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185575.5, ans=0.1 2024-06-20 10:21:24,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2024-06-20 10:21:43,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=185630.5, ans=0.0 2024-06-20 10:21:43,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185630.5, ans=0.125 2024-06-20 10:21:51,134 INFO [train.py:1028] (0/2) Epoch 11, batch 100, loss[loss=0.2202, simple_loss=0.2755, pruned_loss=0.0824, over 13330.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2864, pruned_loss=0.1001, over 1017962.98 frames. ], batch size: 46, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:22:10,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2024-06-20 10:22:13,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185722.16666666666, ans=0.125 2024-06-20 10:22:15,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=185722.16666666666, ans=0.0 2024-06-20 10:22:19,395 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.879e+02 2.057e+02 2.246e+02 2.900e+02, threshold=4.113e+02, percent-clipped=0.0 2024-06-20 10:22:24,952 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-20 10:22:25,863 INFO [train.py:1028] (0/2) Epoch 11, batch 150, loss[loss=0.2277, simple_loss=0.2731, pruned_loss=0.09112, over 12748.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2852, pruned_loss=0.09872, over 1365035.50 frames. ], batch size: 29, lr: 5.47e-03, grad_scale: 128.0 2024-06-20 10:22:39,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185795.5, ans=0.125 2024-06-20 10:22:42,587 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:22:42,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=185795.5, ans=0.025 2024-06-20 10:22:45,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=185813.83333333334, ans=0.125 2024-06-20 10:22:49,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185813.83333333334, ans=0.125 2024-06-20 10:22:51,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-20 10:22:55,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=185832.16666666666, ans=0.0 2024-06-20 10:22:55,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185832.16666666666, ans=0.125 2024-06-20 10:23:00,938 INFO [train.py:1028] (0/2) Epoch 11, batch 200, loss[loss=0.2682, simple_loss=0.2979, pruned_loss=0.1192, over 12577.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2851, pruned_loss=0.09875, over 1635377.44 frames. ], batch size: 202, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:23:02,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.52 vs. limit=22.5 2024-06-20 10:23:17,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=185887.16666666666, ans=0.125 2024-06-20 10:23:18,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.20 vs. limit=15.0 2024-06-20 10:23:22,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=185905.5, ans=0.125 2024-06-20 10:23:26,995 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 1.759e+02 1.866e+02 2.012e+02 2.445e+02, threshold=3.731e+02, percent-clipped=0.0 2024-06-20 10:23:33,577 INFO [train.py:1028] (0/2) Epoch 11, batch 250, loss[loss=0.2331, simple_loss=0.268, pruned_loss=0.09916, over 12992.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2848, pruned_loss=0.09854, over 1846769.80 frames. ], batch size: 144, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:23:41,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=185960.5, ans=0.04949747468305833 2024-06-20 10:23:46,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=12.0 2024-06-20 10:23:46,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=185978.83333333334, ans=0.125 2024-06-20 10:23:49,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185978.83333333334, ans=0.1 2024-06-20 10:23:51,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.75 vs. limit=22.5 2024-06-20 10:24:04,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2024-06-20 10:24:08,574 INFO [train.py:1028] (0/2) Epoch 11, batch 300, loss[loss=0.2337, simple_loss=0.2699, pruned_loss=0.09874, over 13178.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2862, pruned_loss=0.09912, over 2009909.09 frames. ], batch size: 112, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:24:15,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=186052.16666666666, ans=0.0 2024-06-20 10:24:17,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=186052.16666666666, ans=0.125 2024-06-20 10:24:34,184 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.448e+02 1.791e+02 1.930e+02 2.071e+02 2.838e+02, threshold=3.859e+02, percent-clipped=0.0 2024-06-20 10:24:40,441 INFO [train.py:1028] (0/2) Epoch 11, batch 350, loss[loss=0.2426, simple_loss=0.2916, pruned_loss=0.09681, over 13077.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2862, pruned_loss=0.09912, over 2139527.13 frames. ], batch size: 33, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:24:42,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2024-06-20 10:24:43,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=186125.5, ans=0.125 2024-06-20 10:24:48,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=186143.83333333334, ans=0.0 2024-06-20 10:24:52,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=186143.83333333334, ans=0.125 2024-06-20 10:25:01,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2024-06-20 10:25:04,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2024-06-20 10:25:09,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=186198.83333333334, ans=0.125 2024-06-20 10:25:12,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=186198.83333333334, ans=0.1 2024-06-20 10:25:12,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.267e+01 2024-06-20 10:25:15,884 INFO [train.py:1028] (0/2) Epoch 11, batch 400, loss[loss=0.2272, simple_loss=0.2736, pruned_loss=0.09043, over 13256.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.2858, pruned_loss=0.09873, over 2239862.84 frames. ], batch size: 63, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:25:16,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=186217.16666666666, ans=0.125 2024-06-20 10:25:30,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186253.83333333334, ans=0.1 2024-06-20 10:25:33,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=186253.83333333334, ans=0.2 2024-06-20 10:25:41,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2024-06-20 10:25:41,758 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.798e+02 1.897e+02 2.080e+02 3.041e+02, threshold=3.795e+02, percent-clipped=0.0 2024-06-20 10:25:47,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=186290.5, ans=0.05 2024-06-20 10:25:48,418 INFO [train.py:1028] (0/2) Epoch 11, batch 450, loss[loss=0.2199, simple_loss=0.272, pruned_loss=0.08394, over 13179.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.2852, pruned_loss=0.09822, over 2314415.75 frames. ], batch size: 67, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:25:48,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=186308.83333333334, ans=0.1 2024-06-20 10:25:54,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=186327.16666666666, ans=0.0 2024-06-20 10:26:10,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2024-06-20 10:26:17,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=186382.16666666666, ans=0.0 2024-06-20 10:26:24,188 INFO [train.py:1028] (0/2) Epoch 11, batch 500, loss[loss=0.2271, simple_loss=0.2691, pruned_loss=0.09253, over 13113.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2856, pruned_loss=0.09824, over 2376724.09 frames. ], batch size: 121, lr: 5.46e-03, grad_scale: 128.0 2024-06-20 10:26:25,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=186400.5, ans=0.125 2024-06-20 10:26:45,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=186455.5, ans=0.2 2024-06-20 10:26:49,921 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.839e+02 2.032e+02 2.323e+02 3.466e+02, threshold=4.064e+02, percent-clipped=0.0 2024-06-20 10:26:50,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=186473.83333333334, ans=0.2 2024-06-20 10:26:58,817 INFO [train.py:1028] (0/2) Epoch 11, batch 550, loss[loss=0.2416, simple_loss=0.2767, pruned_loss=0.1032, over 12951.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.2852, pruned_loss=0.09813, over 2421618.99 frames. ], batch size: 158, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:27:08,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=186510.5, ans=0.125 2024-06-20 10:27:13,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=186528.83333333334, ans=0.0 2024-06-20 10:27:30,202 INFO [train.py:1028] (0/2) Epoch 11, batch 600, loss[loss=0.2537, simple_loss=0.2912, pruned_loss=0.1081, over 13088.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2848, pruned_loss=0.09786, over 2459236.43 frames. ], batch size: 144, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:27:39,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=186602.16666666666, ans=0.0 2024-06-20 10:27:42,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=186620.5, ans=0.2 2024-06-20 10:27:55,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.861e+02 2.031e+02 2.323e+02 3.478e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 10:28:02,309 INFO [train.py:1028] (0/2) Epoch 11, batch 650, loss[loss=0.24, simple_loss=0.2921, pruned_loss=0.09392, over 13226.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2841, pruned_loss=0.09719, over 2490421.95 frames. ], batch size: 59, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:28:03,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=186675.5, ans=0.125 2024-06-20 10:28:13,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2024-06-20 10:28:22,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2024-06-20 10:28:22,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.87 vs. limit=22.5 2024-06-20 10:28:30,469 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=12.0 2024-06-20 10:28:38,118 INFO [train.py:1028] (0/2) Epoch 11, batch 700, loss[loss=0.2523, simple_loss=0.2964, pruned_loss=0.1041, over 13235.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.285, pruned_loss=0.09802, over 2512865.90 frames. ], batch size: 46, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:28:41,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186767.16666666666, ans=0.1 2024-06-20 10:28:45,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=186785.5, ans=0.0 2024-06-20 10:28:45,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186785.5, ans=0.1 2024-06-20 10:28:52,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=186803.83333333334, ans=0.0 2024-06-20 10:29:06,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.834e+02 1.996e+02 2.197e+02 3.613e+02, threshold=3.993e+02, percent-clipped=0.0 2024-06-20 10:29:12,918 INFO [train.py:1028] (0/2) Epoch 11, batch 750, loss[loss=0.2297, simple_loss=0.2825, pruned_loss=0.0884, over 13274.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2851, pruned_loss=0.09783, over 2527319.28 frames. ], batch size: 63, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:29:15,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=186858.83333333334, ans=0.125 2024-06-20 10:29:22,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=186877.16666666666, ans=0.0 2024-06-20 10:29:26,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=186895.5, ans=0.0 2024-06-20 10:29:37,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.40 vs. limit=22.5 2024-06-20 10:29:38,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186932.16666666666, ans=0.1 2024-06-20 10:29:39,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.45 vs. limit=15.0 2024-06-20 10:29:45,076 INFO [train.py:1028] (0/2) Epoch 11, batch 800, loss[loss=0.227, simple_loss=0.2722, pruned_loss=0.0909, over 12970.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2849, pruned_loss=0.09768, over 2540201.65 frames. ], batch size: 36, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:29:46,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.27 vs. limit=22.5 2024-06-20 10:29:53,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.23 vs. limit=22.5 2024-06-20 10:29:54,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2024-06-20 10:30:11,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=15.0 2024-06-20 10:30:13,972 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.759e+02 1.880e+02 2.028e+02 2.505e+02, threshold=3.760e+02, percent-clipped=0.0 2024-06-20 10:30:21,054 INFO [train.py:1028] (0/2) Epoch 11, batch 850, loss[loss=0.2322, simple_loss=0.2739, pruned_loss=0.09521, over 13131.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.2843, pruned_loss=0.09732, over 2550566.25 frames. ], batch size: 95, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:30:24,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=187042.16666666666, ans=0.0 2024-06-20 10:30:25,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=187042.16666666666, ans=10.0 2024-06-20 10:30:26,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.45 vs. limit=15.0 2024-06-20 10:30:33,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2024-06-20 10:30:35,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187078.83333333334, ans=0.1 2024-06-20 10:30:39,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=187078.83333333334, ans=0.95 2024-06-20 10:30:43,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.28 vs. limit=22.5 2024-06-20 10:30:43,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=187097.16666666666, ans=10.0 2024-06-20 10:30:51,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=187115.5, ans=0.05 2024-06-20 10:30:54,393 INFO [train.py:1028] (0/2) Epoch 11, batch 900, loss[loss=0.2388, simple_loss=0.2937, pruned_loss=0.09196, over 12988.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.2841, pruned_loss=0.09741, over 2555656.89 frames. ], batch size: 36, lr: 5.45e-03, grad_scale: 128.0 2024-06-20 10:31:01,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=12.0 2024-06-20 10:31:12,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=187170.5, ans=22.5 2024-06-20 10:31:16,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=187170.5, ans=0.5 2024-06-20 10:31:16,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=187170.5, ans=0.0 2024-06-20 10:31:24,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.811e+02 1.950e+02 2.150e+02 2.737e+02, threshold=3.899e+02, percent-clipped=0.0 2024-06-20 10:31:25,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.79 vs. limit=10.0 2024-06-20 10:31:31,516 INFO [train.py:1028] (0/2) Epoch 11, batch 950, loss[loss=0.23, simple_loss=0.2832, pruned_loss=0.08843, over 12929.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2844, pruned_loss=0.09755, over 2558566.30 frames. ], batch size: 39, lr: 5.44e-03, grad_scale: 128.0 2024-06-20 10:31:34,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=187225.5, ans=0.0 2024-06-20 10:31:48,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=187262.16666666666, ans=0.0 2024-06-20 10:31:56,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=187280.5, ans=0.05 2024-06-20 10:31:57,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2024-06-20 10:32:03,882 INFO [train.py:1028] (0/2) Epoch 11, batch 1000, loss[loss=0.2387, simple_loss=0.2788, pruned_loss=0.09932, over 13107.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2841, pruned_loss=0.09766, over 2561480.87 frames. ], batch size: 48, lr: 5.44e-03, grad_scale: 128.0 2024-06-20 10:32:05,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=187317.16666666666, ans=0.1 2024-06-20 10:32:06,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=187317.16666666666, ans=0.2 2024-06-20 10:32:20,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=187353.83333333334, ans=0.125 2024-06-20 10:32:23,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=187353.83333333334, ans=0.5 2024-06-20 10:32:23,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=187353.83333333334, ans=0.0 2024-06-20 10:32:32,506 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.857e+02 2.015e+02 2.276e+02 3.043e+02, threshold=4.029e+02, percent-clipped=0.0 2024-06-20 10:32:38,415 INFO [train.py:1028] (0/2) Epoch 11, batch 1050, loss[loss=0.2086, simple_loss=0.2619, pruned_loss=0.07768, over 13131.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2841, pruned_loss=0.09751, over 2564951.02 frames. ], batch size: 77, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:32:38,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=187408.83333333334, ans=0.0 2024-06-20 10:32:39,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.30 vs. limit=22.5 2024-06-20 10:32:41,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187408.83333333334, ans=0.125 2024-06-20 10:32:41,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=187408.83333333334, ans=10.0 2024-06-20 10:32:43,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=187408.83333333334, ans=0.0 2024-06-20 10:32:44,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=187427.16666666666, ans=0.125 2024-06-20 10:32:45,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=187427.16666666666, ans=0.125 2024-06-20 10:32:47,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=187427.16666666666, ans=0.125 2024-06-20 10:32:50,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2024-06-20 10:32:54,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=187445.5, ans=0.0 2024-06-20 10:32:57,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2024-06-20 10:33:10,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2024-06-20 10:33:14,481 INFO [train.py:1028] (0/2) Epoch 11, batch 1100, loss[loss=0.2366, simple_loss=0.2835, pruned_loss=0.09487, over 13309.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2849, pruned_loss=0.09797, over 2570732.04 frames. ], batch size: 52, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:33:18,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=187500.5, ans=0.2 2024-06-20 10:33:23,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.83 vs. limit=22.5 2024-06-20 10:33:25,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=187518.83333333334, ans=0.125 2024-06-20 10:33:27,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=187537.16666666666, ans=0.1 2024-06-20 10:33:29,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=187537.16666666666, ans=0.125 2024-06-20 10:33:29,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=187537.16666666666, ans=0.125 2024-06-20 10:33:30,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.60 vs. limit=15.0 2024-06-20 10:33:32,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.68 vs. limit=15.0 2024-06-20 10:33:33,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=187555.5, ans=0.125 2024-06-20 10:33:40,733 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.772e+02 1.919e+02 2.075e+02 2.696e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 10:33:42,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=187573.83333333334, ans=0.125 2024-06-20 10:33:46,358 INFO [train.py:1028] (0/2) Epoch 11, batch 1150, loss[loss=0.2717, simple_loss=0.3163, pruned_loss=0.1135, over 13259.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.285, pruned_loss=0.09831, over 2572585.58 frames. ], batch size: 52, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:33:50,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.64 vs. limit=22.5 2024-06-20 10:34:08,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=187647.16666666666, ans=0.2 2024-06-20 10:34:21,239 INFO [train.py:1028] (0/2) Epoch 11, batch 1200, loss[loss=0.2191, simple_loss=0.2681, pruned_loss=0.08501, over 13132.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2847, pruned_loss=0.09831, over 2574418.49 frames. ], batch size: 77, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:34:27,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=187702.16666666666, ans=0.125 2024-06-20 10:34:28,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=187702.16666666666, ans=0.125 2024-06-20 10:34:32,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187702.16666666666, ans=0.125 2024-06-20 10:34:34,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=187720.5, ans=0.025 2024-06-20 10:34:45,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=187738.83333333334, ans=0.2 2024-06-20 10:34:47,780 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.801e+02 1.947e+02 2.075e+02 2.851e+02, threshold=3.895e+02, percent-clipped=0.0 2024-06-20 10:34:53,326 INFO [train.py:1028] (0/2) Epoch 11, batch 1250, loss[loss=0.2407, simple_loss=0.2762, pruned_loss=0.1026, over 13173.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2841, pruned_loss=0.09789, over 2584161.73 frames. ], batch size: 112, lr: 5.44e-03, grad_scale: 64.0 2024-06-20 10:34:58,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=187775.5, ans=0.025 2024-06-20 10:35:14,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=187830.5, ans=0.0 2024-06-20 10:35:19,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=187830.5, ans=0.125 2024-06-20 10:35:21,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=187848.83333333334, ans=0.2 2024-06-20 10:35:28,013 INFO [train.py:1028] (0/2) Epoch 11, batch 1300, loss[loss=0.2649, simple_loss=0.3014, pruned_loss=0.1142, over 12742.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2849, pruned_loss=0.09815, over 2583766.43 frames. ], batch size: 176, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:35:29,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187867.16666666666, ans=0.1 2024-06-20 10:35:36,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=187885.5, ans=0.125 2024-06-20 10:35:36,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.92 vs. limit=12.0 2024-06-20 10:35:37,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=187885.5, ans=0.125 2024-06-20 10:35:43,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=187903.83333333334, ans=0.2 2024-06-20 10:35:54,580 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.806e+02 1.912e+02 2.070e+02 2.814e+02, threshold=3.824e+02, percent-clipped=0.0 2024-06-20 10:35:56,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=187940.5, ans=0.125 2024-06-20 10:36:00,766 INFO [train.py:1028] (0/2) Epoch 11, batch 1350, loss[loss=0.2185, simple_loss=0.2698, pruned_loss=0.08356, over 13241.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.2849, pruned_loss=0.09802, over 2585736.09 frames. ], batch size: 59, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:36:07,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=12.0 2024-06-20 10:36:09,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=12.0 2024-06-20 10:36:24,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-20 10:36:27,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=188013.83333333334, ans=0.1 2024-06-20 10:36:29,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=188013.83333333334, ans=0.1 2024-06-20 10:36:31,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.18 vs. limit=10.0 2024-06-20 10:36:35,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=188032.16666666666, ans=0.025 2024-06-20 10:36:37,678 INFO [train.py:1028] (0/2) Epoch 11, batch 1400, loss[loss=0.2593, simple_loss=0.3052, pruned_loss=0.1068, over 12222.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2845, pruned_loss=0.09773, over 2586704.11 frames. ], batch size: 25, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:36:38,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188050.5, ans=0.1 2024-06-20 10:36:43,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=188050.5, ans=0.125 2024-06-20 10:36:45,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=188068.83333333334, ans=0.025 2024-06-20 10:36:54,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=188087.16666666666, ans=0.125 2024-06-20 10:37:00,659 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:37:04,414 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.835e+02 1.959e+02 2.201e+02 3.148e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 10:37:05,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.31 vs. limit=15.0 2024-06-20 10:37:11,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=188123.83333333334, ans=0.95 2024-06-20 10:37:15,178 INFO [train.py:1028] (0/2) Epoch 11, batch 1450, loss[loss=0.236, simple_loss=0.2741, pruned_loss=0.09894, over 13098.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2844, pruned_loss=0.0977, over 2586574.67 frames. ], batch size: 121, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:37:26,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188160.5, ans=0.1 2024-06-20 10:37:36,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=188197.16666666666, ans=0.0 2024-06-20 10:37:40,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=188197.16666666666, ans=0.0 2024-06-20 10:37:42,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=188215.5, ans=0.2 2024-06-20 10:37:48,543 INFO [train.py:1028] (0/2) Epoch 11, batch 1500, loss[loss=0.2622, simple_loss=0.3056, pruned_loss=0.1094, over 13225.00 frames. ], tot_loss[loss=0.24, simple_loss=0.2844, pruned_loss=0.09783, over 2589078.30 frames. ], batch size: 83, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:37:54,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=188252.16666666666, ans=0.125 2024-06-20 10:37:54,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=188252.16666666666, ans=0.1 2024-06-20 10:38:04,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2024-06-20 10:38:06,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188270.5, ans=0.125 2024-06-20 10:38:09,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=188288.83333333334, ans=0.0 2024-06-20 10:38:17,896 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.799e+02 1.888e+02 2.058e+02 2.888e+02, threshold=3.776e+02, percent-clipped=0.0 2024-06-20 10:38:23,769 INFO [train.py:1028] (0/2) Epoch 11, batch 1550, loss[loss=0.2472, simple_loss=0.2845, pruned_loss=0.105, over 13029.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2845, pruned_loss=0.09792, over 2584431.04 frames. ], batch size: 102, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:38:24,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=188325.5, ans=10.0 2024-06-20 10:38:27,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=188325.5, ans=0.125 2024-06-20 10:38:44,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=188380.5, ans=0.125 2024-06-20 10:38:50,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188398.83333333334, ans=0.1 2024-06-20 10:38:55,959 INFO [train.py:1028] (0/2) Epoch 11, batch 1600, loss[loss=0.2284, simple_loss=0.2766, pruned_loss=0.0901, over 13161.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2846, pruned_loss=0.098, over 2579495.75 frames. ], batch size: 77, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:39:11,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188453.83333333334, ans=0.1 2024-06-20 10:39:21,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2024-06-20 10:39:24,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.840e+02 1.996e+02 2.166e+02 3.343e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 10:39:28,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=188490.5, ans=0.125 2024-06-20 10:39:30,671 INFO [train.py:1028] (0/2) Epoch 11, batch 1650, loss[loss=0.237, simple_loss=0.2799, pruned_loss=0.097, over 13187.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.285, pruned_loss=0.09845, over 2575227.81 frames. ], batch size: 95, lr: 5.43e-03, grad_scale: 64.0 2024-06-20 10:39:33,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188508.83333333334, ans=0.1 2024-06-20 10:39:36,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=188508.83333333334, ans=0.0 2024-06-20 10:39:39,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=15.0 2024-06-20 10:39:45,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.52 vs. limit=15.0 2024-06-20 10:40:03,405 INFO [train.py:1028] (0/2) Epoch 11, batch 1700, loss[loss=0.2489, simple_loss=0.2971, pruned_loss=0.1003, over 12380.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2855, pruned_loss=0.0985, over 2580149.93 frames. ], batch size: 25, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:40:03,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=188600.5, ans=0.0 2024-06-20 10:40:08,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.52 vs. limit=10.0 2024-06-20 10:40:09,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=188618.83333333334, ans=0.125 2024-06-20 10:40:14,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=12.0 2024-06-20 10:40:28,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2024-06-20 10:40:29,906 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.95 vs. limit=22.5 2024-06-20 10:40:34,640 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.768e+02 1.843e+02 2.059e+02 3.009e+02, threshold=3.686e+02, percent-clipped=0.0 2024-06-20 10:40:39,861 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.26 vs. limit=15.0 2024-06-20 10:40:40,769 INFO [train.py:1028] (0/2) Epoch 11, batch 1750, loss[loss=0.2389, simple_loss=0.2902, pruned_loss=0.09383, over 12516.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2856, pruned_loss=0.09818, over 2581087.77 frames. ], batch size: 22, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:40:41,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=188692.16666666666, ans=0.125 2024-06-20 10:41:01,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=188747.16666666666, ans=0.0 2024-06-20 10:41:07,965 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:41:15,742 INFO [train.py:1028] (0/2) Epoch 11, batch 1800, loss[loss=0.2582, simple_loss=0.3054, pruned_loss=0.1055, over 13232.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2852, pruned_loss=0.09813, over 2582034.76 frames. ], batch size: 67, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:41:21,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=188783.83333333334, ans=0.125 2024-06-20 10:41:40,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=15.0 2024-06-20 10:41:42,665 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.882e+02 2.024e+02 2.262e+02 3.124e+02, threshold=4.047e+02, percent-clipped=0.0 2024-06-20 10:41:48,465 INFO [train.py:1028] (0/2) Epoch 11, batch 1850, loss[loss=0.2314, simple_loss=0.2709, pruned_loss=0.09596, over 13224.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.2854, pruned_loss=0.09811, over 2583166.21 frames. ], batch size: 83, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:41:51,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.47 vs. limit=12.0 2024-06-20 10:41:53,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.90 vs. limit=12.0 2024-06-20 10:41:59,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=188893.83333333334, ans=0.0 2024-06-20 10:42:15,587 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:42:24,403 INFO [train.py:1028] (0/2) Epoch 11, batch 1900, loss[loss=0.2424, simple_loss=0.2815, pruned_loss=0.1016, over 13211.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2849, pruned_loss=0.09788, over 2587051.08 frames. ], batch size: 95, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:42:32,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=188985.5, ans=0.125 2024-06-20 10:42:36,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=188985.5, ans=0.09899494936611666 2024-06-20 10:42:39,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2024-06-20 10:42:41,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=189003.83333333334, ans=0.2 2024-06-20 10:42:42,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2024-06-20 10:42:51,051 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:42:51,470 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.805e+02 1.918e+02 2.088e+02 2.984e+02, threshold=3.837e+02, percent-clipped=0.0 2024-06-20 10:42:56,194 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=7.828e+00 2024-06-20 10:42:57,365 INFO [train.py:1028] (0/2) Epoch 11, batch 1950, loss[loss=0.2533, simple_loss=0.3009, pruned_loss=0.1029, over 13251.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2848, pruned_loss=0.098, over 2592855.72 frames. ], batch size: 52, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:42:58,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=189058.83333333334, ans=0.125 2024-06-20 10:43:08,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189077.16666666666, ans=0.125 2024-06-20 10:43:14,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=189095.5, ans=0.125 2024-06-20 10:43:17,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.14 vs. limit=22.5 2024-06-20 10:43:18,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=189095.5, ans=0.0 2024-06-20 10:43:22,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.36 vs. limit=22.5 2024-06-20 10:43:23,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=189113.83333333334, ans=0.125 2024-06-20 10:43:32,164 INFO [train.py:1028] (0/2) Epoch 11, batch 2000, loss[loss=0.2373, simple_loss=0.2847, pruned_loss=0.09498, over 12517.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2849, pruned_loss=0.09818, over 2588224.31 frames. ], batch size: 22, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:43:34,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2024-06-20 10:43:41,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=189168.83333333334, ans=0.0 2024-06-20 10:43:43,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=189168.83333333334, ans=0.125 2024-06-20 10:43:44,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=189187.16666666666, ans=0.0 2024-06-20 10:43:47,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=189187.16666666666, ans=0.125 2024-06-20 10:43:50,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=189187.16666666666, ans=0.125 2024-06-20 10:43:54,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.97 vs. limit=10.0 2024-06-20 10:43:59,062 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.867e+02 1.988e+02 2.158e+02 2.974e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 10:44:04,772 INFO [train.py:1028] (0/2) Epoch 11, batch 2050, loss[loss=0.2301, simple_loss=0.2776, pruned_loss=0.09128, over 12573.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2844, pruned_loss=0.09795, over 2583807.54 frames. ], batch size: 29, lr: 5.42e-03, grad_scale: 64.0 2024-06-20 10:44:05,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=189242.16666666666, ans=0.125 2024-06-20 10:44:06,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189242.16666666666, ans=0.1 2024-06-20 10:44:06,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=189242.16666666666, ans=0.2 2024-06-20 10:44:13,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=189260.5, ans=0.0 2024-06-20 10:44:15,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=189260.5, ans=0.125 2024-06-20 10:44:22,966 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:44:25,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=189278.83333333334, ans=0.125 2024-06-20 10:44:26,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=189297.16666666666, ans=0.125 2024-06-20 10:44:30,483 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:44:37,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=189315.5, ans=0.125 2024-06-20 10:44:38,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=189333.83333333334, ans=0.125 2024-06-20 10:44:39,115 INFO [train.py:1028] (0/2) Epoch 11, batch 2100, loss[loss=0.238, simple_loss=0.2875, pruned_loss=0.09427, over 13171.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2846, pruned_loss=0.09758, over 2586995.97 frames. ], batch size: 59, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:44:44,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=189333.83333333334, ans=0.07 2024-06-20 10:44:45,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=189352.16666666666, ans=0.125 2024-06-20 10:44:50,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189352.16666666666, ans=0.1 2024-06-20 10:44:53,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=189370.5, ans=0.0 2024-06-20 10:44:55,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=189370.5, ans=0.125 2024-06-20 10:44:56,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.87 vs. limit=22.5 2024-06-20 10:45:00,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=189388.83333333334, ans=0.125 2024-06-20 10:45:08,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.802e+02 1.959e+02 2.149e+02 2.935e+02, threshold=3.918e+02, percent-clipped=0.0 2024-06-20 10:45:12,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2024-06-20 10:45:12,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=15.0 2024-06-20 10:45:14,706 INFO [train.py:1028] (0/2) Epoch 11, batch 2150, loss[loss=0.2008, simple_loss=0.2579, pruned_loss=0.07184, over 13235.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2846, pruned_loss=0.09727, over 2588942.79 frames. ], batch size: 52, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:45:41,313 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:45:43,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=189498.83333333334, ans=0.1 2024-06-20 10:45:47,663 INFO [train.py:1028] (0/2) Epoch 11, batch 2200, loss[loss=0.232, simple_loss=0.2725, pruned_loss=0.09576, over 13229.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2845, pruned_loss=0.09734, over 2589160.66 frames. ], batch size: 83, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:45:57,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=189535.5, ans=0.2 2024-06-20 10:45:58,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=189535.5, ans=0.2 2024-06-20 10:46:14,554 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.823e+02 1.935e+02 2.111e+02 2.773e+02, threshold=3.870e+02, percent-clipped=0.0 2024-06-20 10:46:20,334 INFO [train.py:1028] (0/2) Epoch 11, batch 2250, loss[loss=0.2559, simple_loss=0.3038, pruned_loss=0.104, over 13304.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.2838, pruned_loss=0.09701, over 2587931.02 frames. ], batch size: 63, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:46:38,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.42 vs. limit=22.5 2024-06-20 10:46:38,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=189645.5, ans=0.125 2024-06-20 10:46:49,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=189682.16666666666, ans=0.125 2024-06-20 10:46:49,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-06-20 10:46:53,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=189682.16666666666, ans=0.1 2024-06-20 10:46:56,468 INFO [train.py:1028] (0/2) Epoch 11, batch 2300, loss[loss=0.2464, simple_loss=0.2984, pruned_loss=0.0972, over 12931.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.2836, pruned_loss=0.0966, over 2582673.47 frames. ], batch size: 33, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:47:25,820 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.835e+02 1.966e+02 2.134e+02 3.036e+02, threshold=3.933e+02, percent-clipped=0.0 2024-06-20 10:47:30,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=189792.16666666666, ans=0.05 2024-06-20 10:47:31,460 INFO [train.py:1028] (0/2) Epoch 11, batch 2350, loss[loss=0.2336, simple_loss=0.2808, pruned_loss=0.09319, over 13245.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.2835, pruned_loss=0.09658, over 2585971.57 frames. ], batch size: 67, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:47:32,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=189792.16666666666, ans=0.025 2024-06-20 10:47:38,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=189810.5, ans=0.125 2024-06-20 10:47:40,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.97 vs. limit=22.5 2024-06-20 10:47:47,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.21 vs. limit=15.0 2024-06-20 10:47:47,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2024-06-20 10:47:54,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=189847.16666666666, ans=0.2 2024-06-20 10:47:59,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=189865.5, ans=0.0 2024-06-20 10:48:01,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.73 vs. limit=12.0 2024-06-20 10:48:03,716 INFO [train.py:1028] (0/2) Epoch 11, batch 2400, loss[loss=0.2332, simple_loss=0.2833, pruned_loss=0.09149, over 13308.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2827, pruned_loss=0.09639, over 2589628.80 frames. ], batch size: 46, lr: 5.41e-03, grad_scale: 64.0 2024-06-20 10:48:03,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189883.83333333334, ans=0.125 2024-06-20 10:48:05,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=189883.83333333334, ans=0.5 2024-06-20 10:48:21,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=189920.5, ans=0.0 2024-06-20 10:48:22,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=189920.5, ans=0.125 2024-06-20 10:48:23,734 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.01 vs. limit=12.0 2024-06-20 10:48:32,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=189957.16666666666, ans=0.025 2024-06-20 10:48:33,118 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 1.825e+02 1.933e+02 2.100e+02 2.976e+02, threshold=3.866e+02, percent-clipped=0.0 2024-06-20 10:48:35,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=189957.16666666666, ans=0.125 2024-06-20 10:48:39,021 INFO [train.py:1028] (0/2) Epoch 11, batch 2450, loss[loss=0.2185, simple_loss=0.261, pruned_loss=0.08801, over 13246.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2822, pruned_loss=0.09669, over 2584834.51 frames. ], batch size: 63, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:48:39,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=189975.5, ans=0.0 2024-06-20 10:48:41,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=189975.5, ans=0.2 2024-06-20 10:48:42,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=189975.5, ans=0.0 2024-06-20 10:49:04,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=190030.5, ans=0.125 2024-06-20 10:49:13,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=190048.83333333334, ans=0.07 2024-06-20 10:49:14,489 INFO [train.py:1028] (0/2) Epoch 11, batch 2500, loss[loss=0.2224, simple_loss=0.2681, pruned_loss=0.08832, over 13205.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.2817, pruned_loss=0.09654, over 2587311.92 frames. ], batch size: 83, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:49:19,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.68 vs. limit=15.0 2024-06-20 10:49:20,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=190085.5, ans=0.125 2024-06-20 10:49:26,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=190085.5, ans=0.025 2024-06-20 10:49:37,169 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:49:39,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2024-06-20 10:49:41,063 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.835e+02 2.008e+02 2.321e+02 3.089e+02, threshold=4.016e+02, percent-clipped=0.0 2024-06-20 10:49:45,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=190140.5, ans=0.07 2024-06-20 10:49:46,945 INFO [train.py:1028] (0/2) Epoch 11, batch 2550, loss[loss=0.2281, simple_loss=0.2919, pruned_loss=0.08214, over 12429.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.2814, pruned_loss=0.0966, over 2587066.64 frames. ], batch size: 22, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:50:01,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=190195.5, ans=0.2 2024-06-20 10:50:08,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2024-06-20 10:50:14,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190213.83333333334, ans=0.1 2024-06-20 10:50:22,001 INFO [train.py:1028] (0/2) Epoch 11, batch 2600, loss[loss=0.2258, simple_loss=0.2746, pruned_loss=0.0885, over 13264.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.28, pruned_loss=0.0961, over 2587598.42 frames. ], batch size: 52, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:50:25,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=190250.5, ans=0.0 2024-06-20 10:50:31,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=15.0 2024-06-20 10:50:41,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=190305.5, ans=0.0 2024-06-20 10:50:47,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=190323.83333333334, ans=0.125 2024-06-20 10:50:51,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.769e+02 1.984e+02 2.157e+02 2.753e+02, threshold=3.968e+02, percent-clipped=0.0 2024-06-20 10:50:52,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190323.83333333334, ans=0.1 2024-06-20 10:50:57,007 INFO [train.py:1028] (0/2) Epoch 11, batch 2650, loss[loss=0.2402, simple_loss=0.2767, pruned_loss=0.1019, over 13012.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2785, pruned_loss=0.09571, over 2587781.12 frames. ], batch size: 144, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:51:04,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.82 vs. limit=15.0 2024-06-20 10:51:06,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=190360.5, ans=10.0 2024-06-20 10:51:13,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=15.0 2024-06-20 10:51:21,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=12.0 2024-06-20 10:51:29,417 INFO [train.py:1028] (0/2) Epoch 11, batch 2700, loss[loss=0.2421, simple_loss=0.2793, pruned_loss=0.1025, over 13218.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2774, pruned_loss=0.09536, over 2584556.93 frames. ], batch size: 89, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:51:30,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=190433.83333333334, ans=0.0 2024-06-20 10:51:32,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.43 vs. limit=15.0 2024-06-20 10:51:35,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=190452.16666666666, ans=0.025 2024-06-20 10:51:36,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=190452.16666666666, ans=0.125 2024-06-20 10:51:39,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=190452.16666666666, ans=0.0 2024-06-20 10:51:41,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=190452.16666666666, ans=0.025 2024-06-20 10:51:48,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=190488.83333333334, ans=0.125 2024-06-20 10:51:55,790 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.881e+02 2.051e+02 2.264e+02 3.364e+02, threshold=4.103e+02, percent-clipped=0.0 2024-06-20 10:51:59,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=190507.16666666666, ans=0.125 2024-06-20 10:52:01,917 INFO [train.py:1028] (0/2) Epoch 11, batch 2750, loss[loss=0.2414, simple_loss=0.2788, pruned_loss=0.102, over 13254.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2762, pruned_loss=0.09465, over 2580713.60 frames. ], batch size: 43, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:52:06,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=190525.5, ans=0.125 2024-06-20 10:52:07,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=190525.5, ans=0.025 2024-06-20 10:52:13,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=190543.83333333334, ans=0.0 2024-06-20 10:52:16,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=190543.83333333334, ans=0.0 2024-06-20 10:52:21,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=190562.16666666666, ans=0.0 2024-06-20 10:52:21,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=190562.16666666666, ans=0.0 2024-06-20 10:52:21,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=190562.16666666666, ans=0.125 2024-06-20 10:52:29,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=190580.5, ans=0.125 2024-06-20 10:52:33,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.07 vs. limit=15.0 2024-06-20 10:52:43,020 INFO [train.py:1028] (0/2) Epoch 11, batch 2800, loss[loss=0.2554, simple_loss=0.2841, pruned_loss=0.1134, over 10945.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2756, pruned_loss=0.09475, over 2578799.87 frames. ], batch size: 304, lr: 5.40e-03, grad_scale: 64.0 2024-06-20 10:52:43,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=190617.16666666666, ans=0.025 2024-06-20 10:52:47,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=190617.16666666666, ans=0.0 2024-06-20 10:52:48,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=190617.16666666666, ans=0.0 2024-06-20 10:52:48,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=190617.16666666666, ans=0.125 2024-06-20 10:52:51,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=190635.5, ans=0.2 2024-06-20 10:52:54,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=190635.5, ans=0.0 2024-06-20 10:52:58,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=190653.83333333334, ans=0.0 2024-06-20 10:53:00,027 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-104000.pt 2024-06-20 10:53:05,139 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.811e+00 2024-06-20 10:53:05,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=190653.83333333334, ans=0.125 2024-06-20 10:53:12,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=190672.16666666666, ans=0.04949747468305833 2024-06-20 10:53:14,449 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.802e+02 1.931e+02 2.136e+02 2.816e+02, threshold=3.861e+02, percent-clipped=0.0 2024-06-20 10:53:15,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=190690.5, ans=0.125 2024-06-20 10:53:16,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.99 vs. limit=12.0 2024-06-20 10:53:19,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=190708.83333333334, ans=0.0 2024-06-20 10:53:20,503 INFO [train.py:1028] (0/2) Epoch 11, batch 2850, loss[loss=0.226, simple_loss=0.2738, pruned_loss=0.08908, over 13337.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2751, pruned_loss=0.09482, over 2577702.96 frames. ], batch size: 49, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:53:23,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=12.0 2024-06-20 10:53:28,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=190727.16666666666, ans=0.025 2024-06-20 10:53:30,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=190727.16666666666, ans=0.125 2024-06-20 10:53:34,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=190745.5, ans=0.0 2024-06-20 10:53:41,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=190763.83333333334, ans=0.2 2024-06-20 10:53:52,333 INFO [train.py:1028] (0/2) Epoch 11, batch 2900, loss[loss=0.2206, simple_loss=0.268, pruned_loss=0.08662, over 13120.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2725, pruned_loss=0.09349, over 2585998.64 frames. ], batch size: 55, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:53:55,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=190800.5, ans=0.0 2024-06-20 10:54:03,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=190818.83333333334, ans=0.125 2024-06-20 10:54:06,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=190818.83333333334, ans=0.125 2024-06-20 10:54:21,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=190873.83333333334, ans=0.025 2024-06-20 10:54:22,517 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.741e+02 1.853e+02 2.002e+02 2.630e+02, threshold=3.707e+02, percent-clipped=0.0 2024-06-20 10:54:26,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=190873.83333333334, ans=0.125 2024-06-20 10:54:28,379 INFO [train.py:1028] (0/2) Epoch 11, batch 2950, loss[loss=0.2285, simple_loss=0.2651, pruned_loss=0.09595, over 13289.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2726, pruned_loss=0.09361, over 2579805.25 frames. ], batch size: 43, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:54:30,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.93 vs. limit=15.0 2024-06-20 10:54:32,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=190892.16666666666, ans=0.04949747468305833 2024-06-20 10:54:35,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.54 vs. limit=6.0 2024-06-20 10:54:42,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=190910.5, ans=0.125 2024-06-20 10:54:50,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=190928.83333333334, ans=0.125 2024-06-20 10:54:55,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=190947.16666666666, ans=0.1 2024-06-20 10:55:05,016 INFO [train.py:1028] (0/2) Epoch 11, batch 3000, loss[loss=0.2149, simple_loss=0.2639, pruned_loss=0.08299, over 13117.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2708, pruned_loss=0.09252, over 2578196.74 frames. ], batch size: 59, lr: 5.39e-03, grad_scale: 64.0 2024-06-20 10:55:05,017 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 10:55:12,925 INFO [train.py:1060] (0/2) Epoch 11, validation: loss=0.1958, simple_loss=0.2602, pruned_loss=0.06568, over 351949.00 frames. 2024-06-20 10:55:12,925 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 10:55:13,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2024-06-20 10:55:18,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=190983.83333333334, ans=0.2 2024-06-20 10:55:19,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191002.16666666666, ans=0.125 2024-06-20 10:55:28,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.63 vs. limit=15.0 2024-06-20 10:55:36,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=191038.83333333334, ans=0.125 2024-06-20 10:55:39,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191038.83333333334, ans=0.1 2024-06-20 10:55:39,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2024-06-20 10:55:40,818 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.809e+02 1.927e+02 2.146e+02 2.811e+02, threshold=3.855e+02, percent-clipped=0.0 2024-06-20 10:55:41,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=191057.16666666666, ans=0.025 2024-06-20 10:55:41,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=191057.16666666666, ans=10.0 2024-06-20 10:55:42,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191057.16666666666, ans=0.125 2024-06-20 10:55:46,899 INFO [train.py:1028] (0/2) Epoch 11, batch 3050, loss[loss=0.2364, simple_loss=0.2715, pruned_loss=0.1006, over 13281.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2699, pruned_loss=0.09231, over 2578143.72 frames. ], batch size: 46, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:56:04,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=191112.16666666666, ans=0.025 2024-06-20 10:56:06,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.45 vs. limit=15.0 2024-06-20 10:56:07,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191112.16666666666, ans=0.1 2024-06-20 10:56:23,236 INFO [train.py:1028] (0/2) Epoch 11, batch 3100, loss[loss=0.2057, simple_loss=0.2433, pruned_loss=0.0841, over 13021.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2691, pruned_loss=0.09196, over 2579632.11 frames. ], batch size: 144, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:56:29,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=191185.5, ans=0.125 2024-06-20 10:56:35,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=191203.83333333334, ans=0.2 2024-06-20 10:56:35,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.15 vs. limit=22.5 2024-06-20 10:56:42,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=191203.83333333334, ans=0.1 2024-06-20 10:56:53,345 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.796e+02 1.917e+02 2.100e+02 2.638e+02, threshold=3.834e+02, percent-clipped=0.0 2024-06-20 10:56:53,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=191240.5, ans=0.125 2024-06-20 10:56:59,210 INFO [train.py:1028] (0/2) Epoch 11, batch 3150, loss[loss=0.2229, simple_loss=0.2556, pruned_loss=0.09506, over 12937.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2677, pruned_loss=0.09134, over 2581292.86 frames. ], batch size: 158, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:57:09,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=191277.16666666666, ans=0.2 2024-06-20 10:57:24,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=191313.83333333334, ans=0.125 2024-06-20 10:57:32,406 INFO [train.py:1028] (0/2) Epoch 11, batch 3200, loss[loss=0.2171, simple_loss=0.2656, pruned_loss=0.08431, over 13129.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2673, pruned_loss=0.091, over 2580834.24 frames. ], batch size: 55, lr: 5.39e-03, grad_scale: 128.0 2024-06-20 10:57:35,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=191350.5, ans=0.1 2024-06-20 10:57:38,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=191368.83333333334, ans=0.125 2024-06-20 10:57:48,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=191387.16666666666, ans=0.125 2024-06-20 10:57:59,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=191405.5, ans=0.95 2024-06-20 10:58:01,496 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.791e+02 1.907e+02 2.153e+02 2.812e+02, threshold=3.815e+02, percent-clipped=0.0 2024-06-20 10:58:02,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2024-06-20 10:58:06,876 INFO [train.py:1028] (0/2) Epoch 11, batch 3250, loss[loss=0.2024, simple_loss=0.2531, pruned_loss=0.07579, over 13230.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.267, pruned_loss=0.09111, over 2584097.06 frames. ], batch size: 72, lr: 5.38e-03, grad_scale: 128.0 2024-06-20 10:58:17,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191460.5, ans=0.125 2024-06-20 10:58:25,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=191478.83333333334, ans=0.125 2024-06-20 10:58:32,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191515.5, ans=0.0 2024-06-20 10:58:44,307 INFO [train.py:1028] (0/2) Epoch 11, batch 3300, loss[loss=0.2409, simple_loss=0.2743, pruned_loss=0.1037, over 12783.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2668, pruned_loss=0.09086, over 2582301.16 frames. ], batch size: 176, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 10:58:44,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=191533.83333333334, ans=0.0 2024-06-20 10:58:55,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191552.16666666666, ans=0.1 2024-06-20 10:59:05,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=191588.83333333334, ans=0.1 2024-06-20 10:59:10,976 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.800e+02 1.946e+02 2.154e+02 3.177e+02, threshold=3.891e+02, percent-clipped=0.0 2024-06-20 10:59:11,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2024-06-20 10:59:16,112 INFO [train.py:1028] (0/2) Epoch 11, batch 3350, loss[loss=0.2447, simple_loss=0.2747, pruned_loss=0.1074, over 12981.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2671, pruned_loss=0.09144, over 2578356.23 frames. ], batch size: 159, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 10:59:26,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.57 vs. limit=22.5 2024-06-20 10:59:27,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=191643.83333333334, ans=0.2 2024-06-20 10:59:29,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.77 vs. limit=15.0 2024-06-20 10:59:34,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=191662.16666666666, ans=0.0 2024-06-20 10:59:37,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=191680.5, ans=0.125 2024-06-20 10:59:40,196 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 10:59:50,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=191698.83333333334, ans=0.0 2024-06-20 10:59:51,608 INFO [train.py:1028] (0/2) Epoch 11, batch 3400, loss[loss=0.2053, simple_loss=0.2541, pruned_loss=0.07824, over 12485.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2663, pruned_loss=0.09142, over 2576491.62 frames. ], batch size: 22, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:00:04,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191753.83333333334, ans=0.0 2024-06-20 11:00:04,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=191753.83333333334, ans=0.0 2024-06-20 11:00:11,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2024-06-20 11:00:16,326 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:00:19,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=191790.5, ans=0.125 2024-06-20 11:00:19,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.751e+02 1.900e+02 2.077e+02 2.634e+02, threshold=3.801e+02, percent-clipped=0.0 2024-06-20 11:00:22,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=191790.5, ans=0.5 2024-06-20 11:00:28,435 INFO [train.py:1028] (0/2) Epoch 11, batch 3450, loss[loss=0.2227, simple_loss=0.2589, pruned_loss=0.0933, over 12802.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2659, pruned_loss=0.0911, over 2577411.11 frames. ], batch size: 176, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:00:29,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=191808.83333333334, ans=10.0 2024-06-20 11:00:33,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=191808.83333333334, ans=0.0 2024-06-20 11:00:45,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=191845.5, ans=0.125 2024-06-20 11:00:47,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.22 vs. limit=22.5 2024-06-20 11:00:50,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2024-06-20 11:00:55,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=191882.16666666666, ans=0.125 2024-06-20 11:01:02,168 INFO [train.py:1028] (0/2) Epoch 11, batch 3500, loss[loss=0.2079, simple_loss=0.2495, pruned_loss=0.08313, over 12929.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2649, pruned_loss=0.0905, over 2575769.70 frames. ], batch size: 33, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:01:04,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=191900.5, ans=0.125 2024-06-20 11:01:16,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191937.16666666666, ans=0.125 2024-06-20 11:01:17,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=191937.16666666666, ans=0.09899494936611666 2024-06-20 11:01:29,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=191973.83333333334, ans=0.125 2024-06-20 11:01:30,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.566e+02 1.714e+02 1.844e+02 1.989e+02 2.720e+02, threshold=3.687e+02, percent-clipped=0.0 2024-06-20 11:01:30,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=191973.83333333334, ans=0.0 2024-06-20 11:01:30,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=191973.83333333334, ans=0.125 2024-06-20 11:01:32,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191973.83333333334, ans=0.125 2024-06-20 11:01:33,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=191973.83333333334, ans=0.0 2024-06-20 11:01:36,086 INFO [train.py:1028] (0/2) Epoch 11, batch 3550, loss[loss=0.2123, simple_loss=0.2519, pruned_loss=0.08634, over 13175.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2647, pruned_loss=0.09007, over 2578116.93 frames. ], batch size: 95, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:01:37,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2024-06-20 11:01:42,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=192010.5, ans=0.125 2024-06-20 11:01:43,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=192010.5, ans=0.125 2024-06-20 11:01:43,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.16 vs. limit=22.5 2024-06-20 11:02:04,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=192047.16666666666, ans=0.0 2024-06-20 11:02:12,393 INFO [train.py:1028] (0/2) Epoch 11, batch 3600, loss[loss=0.2252, simple_loss=0.2697, pruned_loss=0.0904, over 13287.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2645, pruned_loss=0.09042, over 2582009.98 frames. ], batch size: 49, lr: 5.38e-03, grad_scale: 64.0 2024-06-20 11:02:14,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=192083.83333333334, ans=0.125 2024-06-20 11:02:27,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.84 vs. limit=22.5 2024-06-20 11:02:35,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=192138.83333333334, ans=0.125 2024-06-20 11:02:37,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=192138.83333333334, ans=0.125 2024-06-20 11:02:38,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=192138.83333333334, ans=0.125 2024-06-20 11:02:43,226 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.576e+02 1.766e+02 1.959e+02 2.210e+02 3.050e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 11:02:48,505 INFO [train.py:1028] (0/2) Epoch 11, batch 3650, loss[loss=0.2041, simple_loss=0.2441, pruned_loss=0.08203, over 13014.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2642, pruned_loss=0.08999, over 2580149.46 frames. ], batch size: 102, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:03:02,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=192212.16666666666, ans=0.0 2024-06-20 11:03:04,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=192212.16666666666, ans=0.0 2024-06-20 11:03:05,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192212.16666666666, ans=0.1 2024-06-20 11:03:10,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192230.5, ans=0.1 2024-06-20 11:03:11,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=192230.5, ans=0.025 2024-06-20 11:03:19,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192248.83333333334, ans=0.1 2024-06-20 11:03:21,542 INFO [train.py:1028] (0/2) Epoch 11, batch 3700, loss[loss=0.2506, simple_loss=0.2923, pruned_loss=0.1045, over 13271.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2632, pruned_loss=0.08948, over 2584740.65 frames. ], batch size: 72, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:03:22,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=192267.16666666666, ans=0.2 2024-06-20 11:03:46,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.07 vs. limit=15.0 2024-06-20 11:03:49,148 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.513e+02 1.686e+02 1.812e+02 1.906e+02 2.838e+02, threshold=3.625e+02, percent-clipped=0.0 2024-06-20 11:03:49,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=192340.5, ans=0.125 2024-06-20 11:03:54,473 INFO [train.py:1028] (0/2) Epoch 11, batch 3750, loss[loss=0.2196, simple_loss=0.2638, pruned_loss=0.08771, over 12625.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.263, pruned_loss=0.08933, over 2587205.34 frames. ], batch size: 22, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:03:54,609 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:03:54,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2024-06-20 11:04:04,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192358.83333333334, ans=0.1 2024-06-20 11:04:28,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192432.16666666666, ans=0.1 2024-06-20 11:04:30,977 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.62 vs. limit=15.0 2024-06-20 11:04:31,832 INFO [train.py:1028] (0/2) Epoch 11, batch 3800, loss[loss=0.2146, simple_loss=0.2505, pruned_loss=0.08933, over 13207.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2627, pruned_loss=0.08925, over 2585018.03 frames. ], batch size: 83, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:04:39,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.86 vs. limit=15.0 2024-06-20 11:04:45,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=192468.83333333334, ans=0.125 2024-06-20 11:04:46,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=192468.83333333334, ans=0.0 2024-06-20 11:04:47,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=192468.83333333334, ans=0.125 2024-06-20 11:04:48,543 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0 2024-06-20 11:04:54,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=192487.16666666666, ans=0.125 2024-06-20 11:05:03,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2024-06-20 11:05:04,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.774e+02 1.905e+02 2.091e+02 2.703e+02, threshold=3.810e+02, percent-clipped=0.0 2024-06-20 11:05:05,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=192523.83333333334, ans=0.0 2024-06-20 11:05:06,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=192523.83333333334, ans=0.125 2024-06-20 11:05:09,711 INFO [train.py:1028] (0/2) Epoch 11, batch 3850, loss[loss=0.2298, simple_loss=0.2673, pruned_loss=0.09615, over 13036.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2623, pruned_loss=0.08885, over 2585137.54 frames. ], batch size: 144, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:05:19,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-20 11:05:20,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=192560.5, ans=0.2 2024-06-20 11:05:37,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=192615.5, ans=0.0 2024-06-20 11:05:42,647 INFO [train.py:1028] (0/2) Epoch 11, batch 3900, loss[loss=0.2178, simple_loss=0.2533, pruned_loss=0.09113, over 13185.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2619, pruned_loss=0.08888, over 2589334.98 frames. ], batch size: 83, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:05:49,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=192652.16666666666, ans=0.125 2024-06-20 11:05:53,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=192652.16666666666, ans=0.0 2024-06-20 11:05:55,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.75 vs. limit=15.0 2024-06-20 11:05:57,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=192670.5, ans=0.0 2024-06-20 11:05:59,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=192670.5, ans=0.0 2024-06-20 11:06:09,879 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.796e+02 1.976e+02 2.264e+02 3.821e+02, threshold=3.951e+02, percent-clipped=1.0 2024-06-20 11:06:18,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192725.5, ans=0.125 2024-06-20 11:06:18,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192725.5, ans=0.125 2024-06-20 11:06:18,600 INFO [train.py:1028] (0/2) Epoch 11, batch 3950, loss[loss=0.2145, simple_loss=0.2494, pruned_loss=0.08983, over 13065.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2613, pruned_loss=0.08824, over 2590722.87 frames. ], batch size: 132, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:06:19,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192725.5, ans=0.1 2024-06-20 11:06:21,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=15.0 2024-06-20 11:06:21,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=192725.5, ans=0.125 2024-06-20 11:06:32,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192762.16666666666, ans=0.1 2024-06-20 11:06:35,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=192762.16666666666, ans=0.1 2024-06-20 11:06:37,300 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2024-06-20 11:06:41,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=192780.5, ans=0.025 2024-06-20 11:06:49,445 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.343e+00 2024-06-20 11:06:54,442 INFO [train.py:1028] (0/2) Epoch 11, batch 4000, loss[loss=0.2677, simple_loss=0.3043, pruned_loss=0.1155, over 12992.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2612, pruned_loss=0.08856, over 2584424.37 frames. ], batch size: 39, lr: 5.37e-03, grad_scale: 64.0 2024-06-20 11:06:55,637 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2024-06-20 11:06:59,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=192817.16666666666, ans=0.125 2024-06-20 11:07:05,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=192835.5, ans=0.125 2024-06-20 11:07:08,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=192853.83333333334, ans=0.0 2024-06-20 11:07:16,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=192872.16666666666, ans=0.0 2024-06-20 11:07:22,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=192890.5, ans=0.0 2024-06-20 11:07:22,677 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.781e+02 1.904e+02 2.079e+02 2.447e+02, threshold=3.808e+02, percent-clipped=0.0 2024-06-20 11:07:27,919 INFO [train.py:1028] (0/2) Epoch 11, batch 4050, loss[loss=0.2294, simple_loss=0.2618, pruned_loss=0.09854, over 10947.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2611, pruned_loss=0.08865, over 2581263.38 frames. ], batch size: 304, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:07:37,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=192927.16666666666, ans=0.07 2024-06-20 11:07:40,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=192945.5, ans=0.125 2024-06-20 11:07:41,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192945.5, ans=0.1 2024-06-20 11:07:48,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=192963.83333333334, ans=0.125 2024-06-20 11:07:51,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.21 vs. limit=15.0 2024-06-20 11:08:00,947 INFO [train.py:1028] (0/2) Epoch 11, batch 4100, loss[loss=0.2284, simple_loss=0.2561, pruned_loss=0.1003, over 13159.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2616, pruned_loss=0.08918, over 2577054.35 frames. ], batch size: 103, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:08:01,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.98 vs. limit=6.0 2024-06-20 11:08:14,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=193018.83333333334, ans=0.035 2024-06-20 11:08:17,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.85 vs. limit=15.0 2024-06-20 11:08:23,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=193055.5, ans=0.125 2024-06-20 11:08:27,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=193055.5, ans=0.125 2024-06-20 11:08:28,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2024-06-20 11:08:31,796 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.776e+02 1.880e+02 2.082e+02 2.955e+02, threshold=3.761e+02, percent-clipped=0.0 2024-06-20 11:08:33,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=193073.83333333334, ans=0.5 2024-06-20 11:08:34,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=193073.83333333334, ans=0.0 2024-06-20 11:08:37,260 INFO [train.py:1028] (0/2) Epoch 11, batch 4150, loss[loss=0.2321, simple_loss=0.2709, pruned_loss=0.09662, over 13166.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2614, pruned_loss=0.08896, over 2576023.51 frames. ], batch size: 55, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:08:47,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=193110.5, ans=0.0 2024-06-20 11:08:57,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=193128.83333333334, ans=0.5 2024-06-20 11:08:58,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=193128.83333333334, ans=0.125 2024-06-20 11:09:01,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=15.0 2024-06-20 11:09:13,165 INFO [train.py:1028] (0/2) Epoch 11, batch 4200, loss[loss=0.2172, simple_loss=0.2556, pruned_loss=0.08938, over 13054.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2603, pruned_loss=0.08872, over 2578102.53 frames. ], batch size: 102, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:09:32,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.06 vs. limit=22.5 2024-06-20 11:09:35,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=193238.83333333334, ans=0.0 2024-06-20 11:09:38,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=193238.83333333334, ans=0.0 2024-06-20 11:09:39,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=193257.16666666666, ans=0.0 2024-06-20 11:09:41,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.714e+02 1.886e+02 2.127e+02 3.060e+02, threshold=3.772e+02, percent-clipped=0.0 2024-06-20 11:09:46,457 INFO [train.py:1028] (0/2) Epoch 11, batch 4250, loss[loss=0.2051, simple_loss=0.2487, pruned_loss=0.08074, over 13354.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2596, pruned_loss=0.08804, over 2581600.30 frames. ], batch size: 46, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:09:50,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=15.0 2024-06-20 11:09:50,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=193275.5, ans=0.125 2024-06-20 11:09:56,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=193293.83333333334, ans=0.0 2024-06-20 11:10:04,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=193312.16666666666, ans=0.0 2024-06-20 11:10:06,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=22.5 2024-06-20 11:10:18,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193348.83333333334, ans=0.1 2024-06-20 11:10:22,214 INFO [train.py:1028] (0/2) Epoch 11, batch 4300, loss[loss=0.2003, simple_loss=0.2441, pruned_loss=0.07823, over 13203.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2592, pruned_loss=0.08772, over 2581253.09 frames. ], batch size: 59, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:10:34,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=193403.83333333334, ans=0.2 2024-06-20 11:10:48,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2024-06-20 11:10:51,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.726e+02 1.861e+02 2.085e+02 2.586e+02, threshold=3.721e+02, percent-clipped=0.0 2024-06-20 11:10:56,977 INFO [train.py:1028] (0/2) Epoch 11, batch 4350, loss[loss=0.228, simple_loss=0.2714, pruned_loss=0.09231, over 13216.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2588, pruned_loss=0.0876, over 2585930.85 frames. ], batch size: 59, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:10:57,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=193458.83333333334, ans=0.125 2024-06-20 11:11:01,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=193458.83333333334, ans=0.5 2024-06-20 11:11:08,032 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.445e+00 2024-06-20 11:11:29,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=193550.5, ans=0.125 2024-06-20 11:11:29,746 INFO [train.py:1028] (0/2) Epoch 11, batch 4400, loss[loss=0.1977, simple_loss=0.2437, pruned_loss=0.07583, over 13236.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2584, pruned_loss=0.08741, over 2586830.59 frames. ], batch size: 83, lr: 5.36e-03, grad_scale: 64.0 2024-06-20 11:11:40,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 11:11:57,194 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.735e+02 1.867e+02 2.033e+02 2.727e+02, threshold=3.735e+02, percent-clipped=0.0 2024-06-20 11:12:02,717 INFO [train.py:1028] (0/2) Epoch 11, batch 4450, loss[loss=0.2413, simple_loss=0.2789, pruned_loss=0.1018, over 12910.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2592, pruned_loss=0.08798, over 2580691.66 frames. ], batch size: 33, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:12:33,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 11:12:37,562 INFO [train.py:1028] (0/2) Epoch 11, batch 4500, loss[loss=0.1993, simple_loss=0.2434, pruned_loss=0.07762, over 13248.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2586, pruned_loss=0.08748, over 2584850.67 frames. ], batch size: 89, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:12:37,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=193733.83333333334, ans=0.125 2024-06-20 11:12:54,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=193752.16666666666, ans=0.125 2024-06-20 11:12:57,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=193770.5, ans=0.125 2024-06-20 11:13:09,473 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.427e+02 1.709e+02 1.848e+02 1.984e+02 3.055e+02, threshold=3.695e+02, percent-clipped=0.0 2024-06-20 11:13:14,757 INFO [train.py:1028] (0/2) Epoch 11, batch 4550, loss[loss=0.1824, simple_loss=0.2349, pruned_loss=0.06497, over 13290.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2585, pruned_loss=0.08763, over 2588358.95 frames. ], batch size: 52, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:13:19,900 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:13:21,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=193843.83333333334, ans=0.2 2024-06-20 11:13:25,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=193843.83333333334, ans=0.1 2024-06-20 11:13:36,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2024-06-20 11:13:44,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=193898.83333333334, ans=0.2 2024-06-20 11:13:47,333 INFO [train.py:1028] (0/2) Epoch 11, batch 4600, loss[loss=0.2212, simple_loss=0.2606, pruned_loss=0.09086, over 12592.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2582, pruned_loss=0.08767, over 2584037.63 frames. ], batch size: 202, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:14:09,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193972.16666666666, ans=0.1 2024-06-20 11:14:19,650 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.727e+02 1.828e+02 1.972e+02 2.584e+02, threshold=3.656e+02, percent-clipped=0.0 2024-06-20 11:14:20,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=193990.5, ans=0.0 2024-06-20 11:14:24,717 INFO [train.py:1028] (0/2) Epoch 11, batch 4650, loss[loss=0.1763, simple_loss=0.217, pruned_loss=0.06784, over 13084.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2573, pruned_loss=0.08742, over 2587020.67 frames. ], batch size: 132, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:14:41,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=194045.5, ans=0.025 2024-06-20 11:14:49,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=194063.83333333334, ans=0.0 2024-06-20 11:14:50,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=194063.83333333334, ans=0.125 2024-06-20 11:14:51,521 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=12.0 2024-06-20 11:14:51,929 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:14:58,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=194082.16666666666, ans=0.0 2024-06-20 11:15:00,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.35 vs. limit=15.0 2024-06-20 11:15:00,803 INFO [train.py:1028] (0/2) Epoch 11, batch 4700, loss[loss=0.2272, simple_loss=0.2622, pruned_loss=0.09603, over 12525.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2572, pruned_loss=0.08725, over 2582180.17 frames. ], batch size: 25, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:15:06,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=194100.5, ans=0.125 2024-06-20 11:15:09,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.63 vs. limit=10.0 2024-06-20 11:15:09,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.92 vs. limit=15.0 2024-06-20 11:15:11,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=194118.83333333334, ans=0.2 2024-06-20 11:15:16,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 11:15:19,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-20 11:15:28,314 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.745e+02 1.870e+02 2.132e+02 2.626e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 11:15:33,804 INFO [train.py:1028] (0/2) Epoch 11, batch 4750, loss[loss=0.2236, simple_loss=0.2615, pruned_loss=0.09285, over 12629.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2569, pruned_loss=0.08728, over 2579545.29 frames. ], batch size: 202, lr: 5.35e-03, grad_scale: 64.0 2024-06-20 11:15:37,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.72 vs. limit=15.0 2024-06-20 11:15:54,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=12.0 2024-06-20 11:15:57,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=194247.16666666666, ans=0.2 2024-06-20 11:15:58,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=194247.16666666666, ans=0.0 2024-06-20 11:15:58,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=194247.16666666666, ans=0.125 2024-06-20 11:16:04,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=194265.5, ans=0.125 2024-06-20 11:16:05,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=194265.5, ans=0.2 2024-06-20 11:16:07,226 INFO [train.py:1028] (0/2) Epoch 11, batch 4800, loss[loss=0.212, simple_loss=0.2579, pruned_loss=0.08302, over 13322.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2568, pruned_loss=0.08728, over 2576095.33 frames. ], batch size: 63, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:16:07,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=194283.83333333334, ans=0.95 2024-06-20 11:16:19,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=194302.16666666666, ans=0.1 2024-06-20 11:16:21,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-20 11:16:25,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=194320.5, ans=0.125 2024-06-20 11:16:34,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=15.0 2024-06-20 11:16:38,823 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.753e+02 1.925e+02 2.184e+02 3.304e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 11:16:47,538 INFO [train.py:1028] (0/2) Epoch 11, batch 4850, loss[loss=0.2352, simple_loss=0.2761, pruned_loss=0.09719, over 13210.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2569, pruned_loss=0.08728, over 2573784.27 frames. ], batch size: 89, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:16:49,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2024-06-20 11:16:58,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=194393.83333333334, ans=0.125 2024-06-20 11:17:05,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=194412.16666666666, ans=0.125 2024-06-20 11:17:15,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194448.83333333334, ans=0.1 2024-06-20 11:17:21,267 INFO [train.py:1028] (0/2) Epoch 11, batch 4900, loss[loss=0.2053, simple_loss=0.2528, pruned_loss=0.0789, over 13194.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2569, pruned_loss=0.08735, over 2574827.61 frames. ], batch size: 59, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:17:21,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=194467.16666666666, ans=0.125 2024-06-20 11:17:24,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2024-06-20 11:17:27,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=194485.5, ans=0.125 2024-06-20 11:17:31,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=194485.5, ans=0.125 2024-06-20 11:17:32,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=194485.5, ans=0.0 2024-06-20 11:17:38,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194503.83333333334, ans=0.1 2024-06-20 11:17:43,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=194522.16666666666, ans=0.0 2024-06-20 11:17:48,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=15.0 2024-06-20 11:17:49,000 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.737e+02 1.830e+02 2.030e+02 3.131e+02, threshold=3.660e+02, percent-clipped=0.0 2024-06-20 11:17:53,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=194558.83333333334, ans=0.125 2024-06-20 11:17:54,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=194558.83333333334, ans=0.125 2024-06-20 11:17:54,462 INFO [train.py:1028] (0/2) Epoch 11, batch 4950, loss[loss=0.2392, simple_loss=0.2597, pruned_loss=0.1094, over 10927.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2565, pruned_loss=0.08724, over 2568508.60 frames. ], batch size: 304, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:18:01,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=194577.16666666666, ans=0.125 2024-06-20 11:18:11,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=194595.5, ans=0.125 2024-06-20 11:18:24,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=194632.16666666666, ans=0.0 2024-06-20 11:18:30,315 INFO [train.py:1028] (0/2) Epoch 11, batch 5000, loss[loss=0.2237, simple_loss=0.2583, pruned_loss=0.09458, over 13174.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2569, pruned_loss=0.0871, over 2573519.22 frames. ], batch size: 95, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:18:35,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=194650.5, ans=0.125 2024-06-20 11:18:46,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=15.0 2024-06-20 11:18:48,791 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:18:52,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.01 vs. limit=6.0 2024-06-20 11:18:57,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=194705.5, ans=0.07 2024-06-20 11:18:59,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=194723.83333333334, ans=0.125 2024-06-20 11:19:01,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=194723.83333333334, ans=0.0 2024-06-20 11:19:01,722 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.710e+02 1.858e+02 1.986e+02 2.517e+02, threshold=3.716e+02, percent-clipped=0.0 2024-06-20 11:19:06,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=194723.83333333334, ans=0.125 2024-06-20 11:19:07,268 INFO [train.py:1028] (0/2) Epoch 11, batch 5050, loss[loss=0.1911, simple_loss=0.2368, pruned_loss=0.07269, over 12929.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2566, pruned_loss=0.0865, over 2571522.90 frames. ], batch size: 36, lr: 5.34e-03, grad_scale: 64.0 2024-06-20 11:19:22,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=194778.83333333334, ans=0.0 2024-06-20 11:19:25,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=194778.83333333334, ans=0.0 2024-06-20 11:19:40,495 INFO [train.py:1028] (0/2) Epoch 11, batch 5100, loss[loss=0.2485, simple_loss=0.2887, pruned_loss=0.1041, over 12929.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2573, pruned_loss=0.08734, over 2568120.96 frames. ], batch size: 39, lr: 5.34e-03, grad_scale: 32.0 2024-06-20 11:19:46,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.22 vs. limit=15.0 2024-06-20 11:19:48,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=194852.16666666666, ans=0.125 2024-06-20 11:19:49,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=194852.16666666666, ans=0.1 2024-06-20 11:19:50,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=194852.16666666666, ans=0.0 2024-06-20 11:19:58,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=194870.5, ans=0.0 2024-06-20 11:20:01,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=12.0 2024-06-20 11:20:03,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=194888.83333333334, ans=0.0 2024-06-20 11:20:08,175 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 1.791e+02 1.947e+02 2.161e+02 2.735e+02, threshold=3.895e+02, percent-clipped=0.0 2024-06-20 11:20:15,956 INFO [train.py:1028] (0/2) Epoch 11, batch 5150, loss[loss=0.2206, simple_loss=0.2559, pruned_loss=0.09269, over 13103.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2579, pruned_loss=0.08785, over 2570304.09 frames. ], batch size: 132, lr: 5.34e-03, grad_scale: 32.0 2024-06-20 11:20:19,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=194925.5, ans=0.0 2024-06-20 11:20:20,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194925.5, ans=0.1 2024-06-20 11:20:26,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194943.83333333334, ans=0.1 2024-06-20 11:20:26,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.57 vs. limit=15.0 2024-06-20 11:20:33,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=194962.16666666666, ans=0.2 2024-06-20 11:20:36,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2024-06-20 11:20:40,001 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.663e+01 2024-06-20 11:20:40,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=15.0 2024-06-20 11:20:42,869 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.544e+00 2024-06-20 11:20:47,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 11:20:48,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=194998.83333333334, ans=0.0 2024-06-20 11:20:48,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194998.83333333334, ans=0.1 2024-06-20 11:20:51,429 INFO [train.py:1028] (0/2) Epoch 11, batch 5200, loss[loss=0.2259, simple_loss=0.2615, pruned_loss=0.09516, over 13199.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2574, pruned_loss=0.08752, over 2573195.79 frames. ], batch size: 95, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:20:58,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195035.5, ans=0.125 2024-06-20 11:21:00,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=195035.5, ans=0.0 2024-06-20 11:21:20,015 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.481e+02 1.726e+02 1.910e+02 2.059e+02 2.832e+02, threshold=3.820e+02, percent-clipped=0.0 2024-06-20 11:21:24,647 INFO [train.py:1028] (0/2) Epoch 11, batch 5250, loss[loss=0.2061, simple_loss=0.2576, pruned_loss=0.07735, over 13226.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2575, pruned_loss=0.0875, over 2569199.82 frames. ], batch size: 52, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:21:27,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=195108.83333333334, ans=0.2 2024-06-20 11:21:36,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=195127.16666666666, ans=0.125 2024-06-20 11:21:39,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2024-06-20 11:21:40,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195145.5, ans=0.1 2024-06-20 11:21:55,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195182.16666666666, ans=0.125 2024-06-20 11:21:56,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195182.16666666666, ans=0.125 2024-06-20 11:21:57,620 INFO [train.py:1028] (0/2) Epoch 11, batch 5300, loss[loss=0.2041, simple_loss=0.2431, pruned_loss=0.08257, over 13055.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2574, pruned_loss=0.08747, over 2565189.52 frames. ], batch size: 144, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:22:00,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=195200.5, ans=0.07 2024-06-20 11:22:18,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=195237.16666666666, ans=0.0 2024-06-20 11:22:25,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=195255.5, ans=0.125 2024-06-20 11:22:30,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=195273.83333333334, ans=0.125 2024-06-20 11:22:30,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=195273.83333333334, ans=0.125 2024-06-20 11:22:32,047 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.536e+02 1.753e+02 1.866e+02 2.076e+02 2.891e+02, threshold=3.732e+02, percent-clipped=0.0 2024-06-20 11:22:32,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=195273.83333333334, ans=0.04949747468305833 2024-06-20 11:22:36,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=195292.16666666666, ans=0.2 2024-06-20 11:22:37,182 INFO [train.py:1028] (0/2) Epoch 11, batch 5350, loss[loss=0.1997, simple_loss=0.2526, pruned_loss=0.07338, over 11233.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2574, pruned_loss=0.08739, over 2572029.76 frames. ], batch size: 16, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:22:38,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=195292.16666666666, ans=0.2 2024-06-20 11:22:50,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=195328.83333333334, ans=0.0 2024-06-20 11:22:53,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.31 vs. limit=12.0 2024-06-20 11:23:01,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=195347.16666666666, ans=0.0 2024-06-20 11:23:02,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=195365.5, ans=0.2 2024-06-20 11:23:04,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.91 vs. limit=15.0 2024-06-20 11:23:07,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=195365.5, ans=0.0 2024-06-20 11:23:09,546 INFO [train.py:1028] (0/2) Epoch 11, batch 5400, loss[loss=0.2437, simple_loss=0.2719, pruned_loss=0.1077, over 12324.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2572, pruned_loss=0.08755, over 2564881.14 frames. ], batch size: 241, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:23:10,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=195383.83333333334, ans=0.04949747468305833 2024-06-20 11:23:13,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195383.83333333334, ans=0.125 2024-06-20 11:23:27,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=195420.5, ans=0.2 2024-06-20 11:23:29,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.39 vs. limit=15.0 2024-06-20 11:23:32,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=195438.83333333334, ans=0.0 2024-06-20 11:23:35,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=195438.83333333334, ans=0.0 2024-06-20 11:23:36,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=195457.16666666666, ans=0.025 2024-06-20 11:23:37,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=195457.16666666666, ans=0.0 2024-06-20 11:23:38,565 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.795e+02 1.962e+02 2.264e+02 3.593e+02, threshold=3.924e+02, percent-clipped=0.0 2024-06-20 11:23:39,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=195457.16666666666, ans=0.0 2024-06-20 11:23:43,057 INFO [train.py:1028] (0/2) Epoch 11, batch 5450, loss[loss=0.2219, simple_loss=0.2635, pruned_loss=0.09012, over 12432.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2571, pruned_loss=0.08723, over 2568438.76 frames. ], batch size: 25, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:23:52,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=195493.83333333334, ans=0.125 2024-06-20 11:23:57,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.96 vs. limit=10.0 2024-06-20 11:24:04,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.84 vs. limit=22.5 2024-06-20 11:24:07,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=195530.5, ans=0.125 2024-06-20 11:24:15,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=195548.83333333334, ans=0.0 2024-06-20 11:24:16,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195548.83333333334, ans=0.1 2024-06-20 11:24:18,844 INFO [train.py:1028] (0/2) Epoch 11, batch 5500, loss[loss=0.2445, simple_loss=0.2729, pruned_loss=0.1081, over 12250.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2567, pruned_loss=0.08711, over 2562147.57 frames. ], batch size: 240, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:24:31,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=195585.5, ans=0.0 2024-06-20 11:24:34,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=195603.83333333334, ans=0.1 2024-06-20 11:24:37,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=195603.83333333334, ans=0.125 2024-06-20 11:24:40,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195603.83333333334, ans=0.1 2024-06-20 11:24:50,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.512e+02 1.743e+02 1.905e+02 2.292e+02 3.353e+02, threshold=3.809e+02, percent-clipped=0.0 2024-06-20 11:24:53,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=195640.5, ans=0.2 2024-06-20 11:24:54,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.02 vs. limit=10.0 2024-06-20 11:24:55,250 INFO [train.py:1028] (0/2) Epoch 11, batch 5550, loss[loss=0.2078, simple_loss=0.2497, pruned_loss=0.0829, over 13287.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2558, pruned_loss=0.0865, over 2566127.21 frames. ], batch size: 43, lr: 5.33e-03, grad_scale: 32.0 2024-06-20 11:24:55,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195658.83333333334, ans=0.1 2024-06-20 11:25:06,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=12.0 2024-06-20 11:25:06,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=195677.16666666666, ans=0.0 2024-06-20 11:25:09,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2024-06-20 11:25:18,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=195713.83333333334, ans=0.125 2024-06-20 11:25:19,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195713.83333333334, ans=0.125 2024-06-20 11:25:25,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=195732.16666666666, ans=0.125 2024-06-20 11:25:27,388 INFO [train.py:1028] (0/2) Epoch 11, batch 5600, loss[loss=0.2326, simple_loss=0.27, pruned_loss=0.09754, over 13258.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2555, pruned_loss=0.08641, over 2568842.05 frames. ], batch size: 89, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:25:28,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=195750.5, ans=0.0 2024-06-20 11:25:31,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195750.5, ans=0.0 2024-06-20 11:25:32,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=195750.5, ans=0.125 2024-06-20 11:25:34,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=195768.83333333334, ans=0.125 2024-06-20 11:25:40,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2024-06-20 11:25:45,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=195787.16666666666, ans=0.125 2024-06-20 11:25:56,183 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.437e+02 1.836e+02 1.971e+02 2.207e+02 3.418e+02, threshold=3.942e+02, percent-clipped=0.0 2024-06-20 11:25:57,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=195823.83333333334, ans=0.2 2024-06-20 11:26:00,605 INFO [train.py:1028] (0/2) Epoch 11, batch 5650, loss[loss=0.2342, simple_loss=0.2708, pruned_loss=0.09881, over 12557.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2559, pruned_loss=0.08634, over 2573972.72 frames. ], batch size: 202, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:26:09,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=195860.5, ans=0.125 2024-06-20 11:26:10,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=22.5 2024-06-20 11:26:10,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.50 vs. limit=22.5 2024-06-20 11:26:21,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=195878.83333333334, ans=0.2 2024-06-20 11:26:21,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=195878.83333333334, ans=0.2 2024-06-20 11:26:27,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.57 vs. limit=15.0 2024-06-20 11:26:32,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195915.5, ans=0.1 2024-06-20 11:26:35,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=195915.5, ans=0.125 2024-06-20 11:26:39,570 INFO [train.py:1028] (0/2) Epoch 11, batch 5700, loss[loss=0.2113, simple_loss=0.2599, pruned_loss=0.08136, over 13266.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2557, pruned_loss=0.08619, over 2577932.48 frames. ], batch size: 63, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:26:42,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=195933.83333333334, ans=0.125 2024-06-20 11:26:45,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-20 11:26:46,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=195952.16666666666, ans=0.125 2024-06-20 11:26:52,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=195970.5, ans=0.2 2024-06-20 11:27:03,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=195988.83333333334, ans=0.0 2024-06-20 11:27:07,235 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.747e+02 1.894e+02 2.154e+02 3.380e+02, threshold=3.789e+02, percent-clipped=0.0 2024-06-20 11:27:10,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=196007.16666666666, ans=0.0 2024-06-20 11:27:10,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.34 vs. limit=15.0 2024-06-20 11:27:11,587 INFO [train.py:1028] (0/2) Epoch 11, batch 5750, loss[loss=0.2403, simple_loss=0.2737, pruned_loss=0.1035, over 12746.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2564, pruned_loss=0.08657, over 2578349.21 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:27:15,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=196025.5, ans=0.2 2024-06-20 11:27:23,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196043.83333333334, ans=0.1 2024-06-20 11:27:24,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=196062.16666666666, ans=0.125 2024-06-20 11:27:24,399 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:27:37,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=196098.83333333334, ans=0.07 2024-06-20 11:27:44,499 INFO [train.py:1028] (0/2) Epoch 11, batch 5800, loss[loss=0.2395, simple_loss=0.28, pruned_loss=0.09951, over 12740.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2581, pruned_loss=0.08797, over 2577183.49 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:27:49,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.94 vs. limit=15.0 2024-06-20 11:27:49,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=196117.16666666666, ans=0.0 2024-06-20 11:28:11,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=196172.16666666666, ans=0.125 2024-06-20 11:28:19,611 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.768e+02 1.870e+02 1.994e+02 2.604e+02, threshold=3.740e+02, percent-clipped=0.0 2024-06-20 11:28:24,330 INFO [train.py:1028] (0/2) Epoch 11, batch 5850, loss[loss=0.2369, simple_loss=0.2733, pruned_loss=0.1002, over 12559.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2603, pruned_loss=0.08894, over 2576021.13 frames. ], batch size: 202, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:28:32,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=196227.16666666666, ans=0.0 2024-06-20 11:28:35,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2024-06-20 11:28:36,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-20 11:28:55,525 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:28:57,310 INFO [train.py:1028] (0/2) Epoch 11, batch 5900, loss[loss=0.2184, simple_loss=0.2571, pruned_loss=0.08986, over 13151.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2629, pruned_loss=0.09007, over 2575503.79 frames. ], batch size: 121, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:28:58,713 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:29:00,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=196300.5, ans=0.035 2024-06-20 11:29:13,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=196337.16666666666, ans=0.1 2024-06-20 11:29:24,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=16.47 vs. limit=15.0 2024-06-20 11:29:25,568 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.771e+02 1.959e+02 2.095e+02 3.057e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 11:29:28,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=196373.83333333334, ans=0.1 2024-06-20 11:29:30,214 INFO [train.py:1028] (0/2) Epoch 11, batch 5950, loss[loss=0.2016, simple_loss=0.2423, pruned_loss=0.08043, over 13094.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2643, pruned_loss=0.09062, over 2580149.66 frames. ], batch size: 121, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:29:31,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196392.16666666666, ans=0.1 2024-06-20 11:29:32,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=196392.16666666666, ans=0.125 2024-06-20 11:29:41,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=196410.5, ans=0.025 2024-06-20 11:29:44,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196428.83333333334, ans=0.0 2024-06-20 11:29:53,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196447.16666666666, ans=0.1 2024-06-20 11:29:53,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=196447.16666666666, ans=0.0 2024-06-20 11:30:01,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=196465.5, ans=0.2 2024-06-20 11:30:05,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=196483.83333333334, ans=0.0 2024-06-20 11:30:05,970 INFO [train.py:1028] (0/2) Epoch 11, batch 6000, loss[loss=0.3025, simple_loss=0.3173, pruned_loss=0.1438, over 12233.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2661, pruned_loss=0.09155, over 2573519.77 frames. ], batch size: 241, lr: 5.32e-03, grad_scale: 32.0 2024-06-20 11:30:05,971 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 11:30:11,026 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8481, 1.4722, 1.6768, 1.4467], device='cuda:0') 2024-06-20 11:30:14,270 INFO [train.py:1060] (0/2) Epoch 11, validation: loss=0.1963, simple_loss=0.2602, pruned_loss=0.06621, over 351949.00 frames. 2024-06-20 11:30:14,271 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 11:30:17,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.58 vs. limit=10.0 2024-06-20 11:30:25,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=196502.16666666666, ans=0.125 2024-06-20 11:30:29,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=196502.16666666666, ans=0.09899494936611666 2024-06-20 11:30:40,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=196538.83333333334, ans=0.0 2024-06-20 11:30:46,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.82 vs. limit=15.0 2024-06-20 11:30:48,458 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.815e+02 1.948e+02 2.204e+02 3.333e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 11:30:51,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=196557.16666666666, ans=0.125 2024-06-20 11:30:53,438 INFO [train.py:1028] (0/2) Epoch 11, batch 6050, loss[loss=0.2335, simple_loss=0.2727, pruned_loss=0.09712, over 13007.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2684, pruned_loss=0.09255, over 2576138.37 frames. ], batch size: 39, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:30:56,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196575.5, ans=0.125 2024-06-20 11:30:59,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=12.0 2024-06-20 11:31:02,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.46 vs. limit=22.5 2024-06-20 11:31:06,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=196593.83333333334, ans=0.125 2024-06-20 11:31:09,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196612.16666666666, ans=0.1 2024-06-20 11:31:11,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=196612.16666666666, ans=0.025 2024-06-20 11:31:12,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-20 11:31:24,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2024-06-20 11:31:28,051 INFO [train.py:1028] (0/2) Epoch 11, batch 6100, loss[loss=0.2213, simple_loss=0.2608, pruned_loss=0.09092, over 13124.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2686, pruned_loss=0.09244, over 2578494.13 frames. ], batch size: 121, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:31:33,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=196667.16666666666, ans=0.2 2024-06-20 11:31:36,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=196685.5, ans=0.0 2024-06-20 11:31:42,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.72 vs. limit=15.0 2024-06-20 11:31:45,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196703.83333333334, ans=0.125 2024-06-20 11:31:48,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196722.16666666666, ans=0.0 2024-06-20 11:31:51,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=196722.16666666666, ans=0.0 2024-06-20 11:31:52,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=196722.16666666666, ans=0.025 2024-06-20 11:31:54,340 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2024-06-20 11:31:57,686 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.778e+02 1.877e+02 2.160e+02 3.097e+02, threshold=3.754e+02, percent-clipped=0.0 2024-06-20 11:32:00,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.09 vs. limit=15.0 2024-06-20 11:32:02,550 INFO [train.py:1028] (0/2) Epoch 11, batch 6150, loss[loss=0.2284, simple_loss=0.2595, pruned_loss=0.09866, over 10793.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2697, pruned_loss=0.09287, over 2577259.36 frames. ], batch size: 304, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:32:03,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.92 vs. limit=15.0 2024-06-20 11:32:03,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.64 vs. limit=10.0 2024-06-20 11:32:13,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2024-06-20 11:32:40,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=196832.16666666666, ans=0.0 2024-06-20 11:32:42,689 INFO [train.py:1028] (0/2) Epoch 11, batch 6200, loss[loss=0.2395, simple_loss=0.2796, pruned_loss=0.09976, over 13209.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2708, pruned_loss=0.09325, over 2574318.11 frames. ], batch size: 89, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:32:48,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=196868.83333333334, ans=0.125 2024-06-20 11:32:50,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=196868.83333333334, ans=0.0 2024-06-20 11:32:56,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=196887.16666666666, ans=0.125 2024-06-20 11:32:57,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=196887.16666666666, ans=10.0 2024-06-20 11:33:09,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.71 vs. limit=15.0 2024-06-20 11:33:11,636 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.777e+02 1.930e+02 2.112e+02 2.896e+02, threshold=3.860e+02, percent-clipped=0.0 2024-06-20 11:33:16,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.42 vs. limit=15.0 2024-06-20 11:33:16,529 INFO [train.py:1028] (0/2) Epoch 11, batch 6250, loss[loss=0.2294, simple_loss=0.2702, pruned_loss=0.09429, over 13215.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2729, pruned_loss=0.09412, over 2566777.31 frames. ], batch size: 83, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:33:19,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196942.16666666666, ans=0.1 2024-06-20 11:33:24,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196960.5, ans=0.125 2024-06-20 11:33:31,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=8.0 2024-06-20 11:33:32,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=196978.83333333334, ans=0.125 2024-06-20 11:33:33,462 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:33:38,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=196997.16666666666, ans=0.125 2024-06-20 11:33:39,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=15.0 2024-06-20 11:33:42,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2024-06-20 11:33:44,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=197015.5, ans=0.125 2024-06-20 11:33:45,630 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.76 vs. limit=22.5 2024-06-20 11:33:49,921 INFO [train.py:1028] (0/2) Epoch 11, batch 6300, loss[loss=0.2402, simple_loss=0.29, pruned_loss=0.0952, over 10775.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2737, pruned_loss=0.0942, over 2562018.26 frames. ], batch size: 16, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:33:52,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=197033.83333333334, ans=0.035 2024-06-20 11:33:53,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=197033.83333333334, ans=0.125 2024-06-20 11:33:56,029 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=15.0 2024-06-20 11:34:01,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-20 11:34:08,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=197070.5, ans=15.0 2024-06-20 11:34:20,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=197107.16666666666, ans=0.125 2024-06-20 11:34:21,984 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.793e+02 1.925e+02 2.118e+02 3.266e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 11:34:22,637 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-20 11:34:26,613 INFO [train.py:1028] (0/2) Epoch 11, batch 6350, loss[loss=0.2738, simple_loss=0.3058, pruned_loss=0.1209, over 12608.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2752, pruned_loss=0.09451, over 2571385.91 frames. ], batch size: 202, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:34:41,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=197143.83333333334, ans=0.95 2024-06-20 11:34:45,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=197162.16666666666, ans=0.025 2024-06-20 11:34:46,747 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:34:51,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=197180.5, ans=0.04949747468305833 2024-06-20 11:35:03,005 INFO [train.py:1028] (0/2) Epoch 11, batch 6400, loss[loss=0.2038, simple_loss=0.2512, pruned_loss=0.07816, over 13237.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2767, pruned_loss=0.09511, over 2573042.12 frames. ], batch size: 67, lr: 5.31e-03, grad_scale: 32.0 2024-06-20 11:35:09,778 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.80 vs. limit=15.0 2024-06-20 11:35:19,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=197253.83333333334, ans=0.0 2024-06-20 11:35:25,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=197272.16666666666, ans=0.125 2024-06-20 11:35:30,737 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.869e+02 2.017e+02 2.148e+02 2.915e+02, threshold=4.035e+02, percent-clipped=0.0 2024-06-20 11:35:35,128 INFO [train.py:1028] (0/2) Epoch 11, batch 6450, loss[loss=0.2806, simple_loss=0.3161, pruned_loss=0.1225, over 12551.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2784, pruned_loss=0.096, over 2578186.00 frames. ], batch size: 202, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:35:36,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=197308.83333333334, ans=0.025 2024-06-20 11:35:37,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197308.83333333334, ans=0.1 2024-06-20 11:36:07,327 INFO [train.py:1028] (0/2) Epoch 11, batch 6500, loss[loss=0.2616, simple_loss=0.2926, pruned_loss=0.1153, over 10614.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.2801, pruned_loss=0.09652, over 2581733.60 frames. ], batch size: 303, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:36:08,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.22 vs. limit=22.5 2024-06-20 11:36:13,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2024-06-20 11:36:31,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=197455.5, ans=0.125 2024-06-20 11:36:39,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197473.83333333334, ans=0.0 2024-06-20 11:36:41,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.470e+02 1.890e+02 2.094e+02 2.274e+02 3.406e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 11:36:44,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=197473.83333333334, ans=0.025 2024-06-20 11:36:45,929 INFO [train.py:1028] (0/2) Epoch 11, batch 6550, loss[loss=0.2332, simple_loss=0.2786, pruned_loss=0.09394, over 12748.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.2819, pruned_loss=0.09722, over 2586974.71 frames. ], batch size: 22, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:36:52,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=197510.5, ans=0.125 2024-06-20 11:36:53,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=197510.5, ans=0.2 2024-06-20 11:36:55,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=197510.5, ans=0.0 2024-06-20 11:37:00,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=197528.83333333334, ans=0.0 2024-06-20 11:37:04,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2024-06-20 11:37:09,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=197547.16666666666, ans=0.0 2024-06-20 11:37:09,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.12 vs. limit=22.5 2024-06-20 11:37:11,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.50 vs. limit=10.0 2024-06-20 11:37:15,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197565.5, ans=0.1 2024-06-20 11:37:15,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=197565.5, ans=0.0 2024-06-20 11:37:18,827 INFO [train.py:1028] (0/2) Epoch 11, batch 6600, loss[loss=0.2387, simple_loss=0.2884, pruned_loss=0.09455, over 13282.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.282, pruned_loss=0.09719, over 2589221.47 frames. ], batch size: 72, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:37:24,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=197583.83333333334, ans=0.0 2024-06-20 11:37:36,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=197620.5, ans=0.125 2024-06-20 11:37:37,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.12 vs. limit=22.5 2024-06-20 11:37:45,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=197657.16666666666, ans=0.125 2024-06-20 11:37:47,189 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.877e+02 1.977e+02 2.147e+02 2.728e+02, threshold=3.953e+02, percent-clipped=0.0 2024-06-20 11:37:51,595 INFO [train.py:1028] (0/2) Epoch 11, batch 6650, loss[loss=0.2845, simple_loss=0.3195, pruned_loss=0.1247, over 12974.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.284, pruned_loss=0.09776, over 2585614.16 frames. ], batch size: 158, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:38:04,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2024-06-20 11:38:08,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=197712.16666666666, ans=0.07 2024-06-20 11:38:08,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197712.16666666666, ans=0.1 2024-06-20 11:38:10,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=197712.16666666666, ans=0.125 2024-06-20 11:38:28,332 INFO [train.py:1028] (0/2) Epoch 11, batch 6700, loss[loss=0.2565, simple_loss=0.2996, pruned_loss=0.1067, over 12827.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.2851, pruned_loss=0.09824, over 2584725.19 frames. ], batch size: 176, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:38:33,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=197767.16666666666, ans=0.125 2024-06-20 11:38:39,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=197785.5, ans=0.125 2024-06-20 11:38:40,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=197785.5, ans=0.1 2024-06-20 11:38:46,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=197803.83333333334, ans=0.125 2024-06-20 11:38:50,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.04 vs. limit=15.0 2024-06-20 11:38:52,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.26 vs. limit=15.0 2024-06-20 11:39:00,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.924e+02 2.089e+02 2.464e+02 3.621e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 11:39:04,679 INFO [train.py:1028] (0/2) Epoch 11, batch 6750, loss[loss=0.3139, simple_loss=0.3441, pruned_loss=0.1419, over 12130.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.2864, pruned_loss=0.09915, over 2579157.83 frames. ], batch size: 240, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:39:07,789 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2024-06-20 11:39:14,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.22 vs. limit=22.5 2024-06-20 11:39:14,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=197877.16666666666, ans=0.2 2024-06-20 11:39:26,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2024-06-20 11:39:34,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2024-06-20 11:39:37,139 INFO [train.py:1028] (0/2) Epoch 11, batch 6800, loss[loss=0.2166, simple_loss=0.2635, pruned_loss=0.08484, over 13275.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2878, pruned_loss=0.09957, over 2580968.43 frames. ], batch size: 67, lr: 5.30e-03, grad_scale: 32.0 2024-06-20 11:39:54,247 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-108000.pt 2024-06-20 11:40:00,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=197987.16666666666, ans=0.0 2024-06-20 11:40:03,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=198005.5, ans=0.125 2024-06-20 11:40:06,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2024-06-20 11:40:10,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.969e+02 2.155e+02 2.386e+02 4.167e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-20 11:40:14,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.42 vs. limit=10.0 2024-06-20 11:40:15,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2024-06-20 11:40:15,591 INFO [train.py:1028] (0/2) Epoch 11, batch 6850, loss[loss=0.2841, simple_loss=0.3288, pruned_loss=0.1197, over 13269.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2879, pruned_loss=0.0992, over 2583675.50 frames. ], batch size: 63, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:40:17,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=198042.16666666666, ans=15.0 2024-06-20 11:40:20,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=198042.16666666666, ans=0.0 2024-06-20 11:40:21,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.38 vs. limit=22.5 2024-06-20 11:40:21,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=12.0 2024-06-20 11:40:38,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=198097.16666666666, ans=0.125 2024-06-20 11:40:50,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=198115.5, ans=0.2 2024-06-20 11:40:52,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=198115.5, ans=0.125 2024-06-20 11:40:55,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198115.5, ans=0.1 2024-06-20 11:40:56,117 INFO [train.py:1028] (0/2) Epoch 11, batch 6900, loss[loss=0.2492, simple_loss=0.2887, pruned_loss=0.1048, over 13296.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.289, pruned_loss=0.09974, over 2586659.36 frames. ], batch size: 49, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:40:57,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.93 vs. limit=15.0 2024-06-20 11:41:05,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=198152.16666666666, ans=0.125 2024-06-20 11:41:12,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=198170.5, ans=0.125 2024-06-20 11:41:24,025 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.838e+02 1.970e+02 2.138e+02 3.106e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 11:41:24,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=198207.16666666666, ans=0.125 2024-06-20 11:41:28,674 INFO [train.py:1028] (0/2) Epoch 11, batch 6950, loss[loss=0.2023, simple_loss=0.2517, pruned_loss=0.07644, over 12119.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.2893, pruned_loss=0.09948, over 2580763.60 frames. ], batch size: 17, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:41:34,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=198243.83333333334, ans=10.0 2024-06-20 11:41:42,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=198262.16666666666, ans=0.125 2024-06-20 11:41:43,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=198262.16666666666, ans=6.0 2024-06-20 11:41:49,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=198280.5, ans=0.2 2024-06-20 11:41:55,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=198298.83333333334, ans=0.1 2024-06-20 11:41:58,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198298.83333333334, ans=0.1 2024-06-20 11:41:58,396 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:42:01,426 INFO [train.py:1028] (0/2) Epoch 11, batch 7000, loss[loss=0.2617, simple_loss=0.3003, pruned_loss=0.1115, over 12924.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.2896, pruned_loss=0.09935, over 2575680.54 frames. ], batch size: 158, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:42:20,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=198353.83333333334, ans=0.125 2024-06-20 11:42:22,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=198372.16666666666, ans=0.2 2024-06-20 11:42:24,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=198372.16666666666, ans=0.125 2024-06-20 11:42:30,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2024-06-20 11:42:31,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.888e+02 2.046e+02 2.225e+02 3.105e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 11:42:39,218 INFO [train.py:1028] (0/2) Epoch 11, batch 7050, loss[loss=0.2561, simple_loss=0.2926, pruned_loss=0.1098, over 12787.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.2908, pruned_loss=0.1, over 2582697.25 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2024-06-20 11:42:44,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=198408.83333333334, ans=0.0 2024-06-20 11:42:45,513 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:42:59,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.42 vs. limit=22.5 2024-06-20 11:43:01,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=198445.5, ans=0.0 2024-06-20 11:43:03,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198463.83333333334, ans=0.1 2024-06-20 11:43:03,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=198463.83333333334, ans=0.2 2024-06-20 11:43:16,006 INFO [train.py:1028] (0/2) Epoch 11, batch 7100, loss[loss=0.2775, simple_loss=0.3198, pruned_loss=0.1176, over 13138.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2921, pruned_loss=0.1009, over 2575448.96 frames. ], batch size: 112, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:43:16,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=198500.5, ans=0.0 2024-06-20 11:43:28,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=198518.83333333334, ans=0.025 2024-06-20 11:43:39,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.94 vs. limit=15.0 2024-06-20 11:43:41,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.18 vs. limit=22.5 2024-06-20 11:43:43,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2024-06-20 11:43:45,295 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.901e+02 2.035e+02 2.226e+02 2.939e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 11:43:46,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=198573.83333333334, ans=0.125 2024-06-20 11:43:49,998 INFO [train.py:1028] (0/2) Epoch 11, batch 7150, loss[loss=0.3164, simple_loss=0.3456, pruned_loss=0.1436, over 12509.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.2923, pruned_loss=0.1008, over 2574941.80 frames. ], batch size: 202, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:43:50,872 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:43:52,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=198592.16666666666, ans=0.125 2024-06-20 11:43:54,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=198592.16666666666, ans=0.125 2024-06-20 11:43:57,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=198610.5, ans=0.125 2024-06-20 11:43:59,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=198610.5, ans=0.025 2024-06-20 11:43:59,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=198610.5, ans=0.0 2024-06-20 11:44:06,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=198628.83333333334, ans=0.125 2024-06-20 11:44:12,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.53 vs. limit=15.0 2024-06-20 11:44:17,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198665.5, ans=0.1 2024-06-20 11:44:19,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=198665.5, ans=0.125 2024-06-20 11:44:22,959 INFO [train.py:1028] (0/2) Epoch 11, batch 7200, loss[loss=0.2708, simple_loss=0.3102, pruned_loss=0.1157, over 13168.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2944, pruned_loss=0.1018, over 2580024.20 frames. ], batch size: 112, lr: 5.29e-03, grad_scale: 64.0 2024-06-20 11:44:23,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=198683.83333333334, ans=0.125 2024-06-20 11:44:28,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=198683.83333333334, ans=0.125 2024-06-20 11:44:29,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=198702.16666666666, ans=0.025 2024-06-20 11:44:36,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=198720.5, ans=0.125 2024-06-20 11:44:39,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=198720.5, ans=0.1 2024-06-20 11:44:41,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.25 vs. limit=10.0 2024-06-20 11:44:42,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=198738.83333333334, ans=0.025 2024-06-20 11:44:54,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198757.16666666666, ans=0.1 2024-06-20 11:44:55,324 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.906e+02 2.094e+02 2.287e+02 3.088e+02, threshold=4.189e+02, percent-clipped=0.0 2024-06-20 11:45:04,183 INFO [train.py:1028] (0/2) Epoch 11, batch 7250, loss[loss=0.2318, simple_loss=0.2864, pruned_loss=0.08861, over 12929.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.2942, pruned_loss=0.1014, over 2580930.42 frames. ], batch size: 36, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:45:06,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=198775.5, ans=0.0 2024-06-20 11:45:06,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=198775.5, ans=0.0 2024-06-20 11:45:21,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198812.16666666666, ans=0.1 2024-06-20 11:45:27,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=198830.5, ans=0.2 2024-06-20 11:45:28,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198830.5, ans=0.1 2024-06-20 11:45:31,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=198848.83333333334, ans=22.5 2024-06-20 11:45:35,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=12.0 2024-06-20 11:45:35,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198848.83333333334, ans=0.1 2024-06-20 11:45:37,367 INFO [train.py:1028] (0/2) Epoch 11, batch 7300, loss[loss=0.2665, simple_loss=0.3107, pruned_loss=0.1111, over 12937.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.2947, pruned_loss=0.1018, over 2580817.34 frames. ], batch size: 36, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:45:42,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=198867.16666666666, ans=0.0 2024-06-20 11:45:46,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=198885.5, ans=0.0 2024-06-20 11:45:54,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=198903.83333333334, ans=0.125 2024-06-20 11:46:05,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 1.864e+02 2.012e+02 2.157e+02 3.018e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-20 11:46:10,522 INFO [train.py:1028] (0/2) Epoch 11, batch 7350, loss[loss=0.2834, simple_loss=0.3238, pruned_loss=0.1216, over 13313.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.295, pruned_loss=0.1019, over 2581564.71 frames. ], batch size: 46, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:46:10,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.15 vs. limit=15.0 2024-06-20 11:46:43,849 INFO [train.py:1028] (0/2) Epoch 11, batch 7400, loss[loss=0.2632, simple_loss=0.3074, pruned_loss=0.1095, over 13251.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.2953, pruned_loss=0.1019, over 2586863.49 frames. ], batch size: 63, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:46:46,544 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.625e-01 2024-06-20 11:47:02,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=199087.16666666666, ans=0.2 2024-06-20 11:47:10,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199105.5, ans=0.1 2024-06-20 11:47:15,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=12.0 2024-06-20 11:47:19,877 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.904e+02 1.986e+02 2.156e+02 2.713e+02, threshold=3.972e+02, percent-clipped=0.0 2024-06-20 11:47:21,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.35 vs. limit=12.0 2024-06-20 11:47:24,716 INFO [train.py:1028] (0/2) Epoch 11, batch 7450, loss[loss=0.2018, simple_loss=0.257, pruned_loss=0.07335, over 12666.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2951, pruned_loss=0.1017, over 2579433.03 frames. ], batch size: 29, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:47:26,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=199142.16666666666, ans=0.2 2024-06-20 11:47:33,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199160.5, ans=0.0 2024-06-20 11:47:41,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=199178.83333333334, ans=0.0 2024-06-20 11:47:51,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=199215.5, ans=0.1 2024-06-20 11:47:57,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=199215.5, ans=0.0 2024-06-20 11:47:58,992 INFO [train.py:1028] (0/2) Epoch 11, batch 7500, loss[loss=0.2683, simple_loss=0.3053, pruned_loss=0.1156, over 10526.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2966, pruned_loss=0.1026, over 2576961.25 frames. ], batch size: 303, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:48:09,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.06 vs. limit=15.0 2024-06-20 11:48:11,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.29 vs. limit=12.0 2024-06-20 11:48:27,629 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.992e+02 2.217e+02 2.632e+02 3.692e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 11:48:31,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=15.0 2024-06-20 11:48:32,325 INFO [train.py:1028] (0/2) Epoch 11, batch 7550, loss[loss=0.2429, simple_loss=0.2866, pruned_loss=0.09961, over 12878.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.2976, pruned_loss=0.1034, over 2575953.71 frames. ], batch size: 158, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:48:48,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=199362.16666666666, ans=0.1 2024-06-20 11:48:58,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=199380.5, ans=0.125 2024-06-20 11:49:08,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=199398.83333333334, ans=0.125 2024-06-20 11:49:11,813 INFO [train.py:1028] (0/2) Epoch 11, batch 7600, loss[loss=0.2546, simple_loss=0.3013, pruned_loss=0.1039, over 13245.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.2988, pruned_loss=0.1039, over 2575646.24 frames. ], batch size: 83, lr: 5.28e-03, grad_scale: 64.0 2024-06-20 11:49:20,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=199435.5, ans=0.0 2024-06-20 11:49:22,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=199435.5, ans=0.125 2024-06-20 11:49:22,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=12.0 2024-06-20 11:49:26,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=199453.83333333334, ans=0.0 2024-06-20 11:49:28,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199453.83333333334, ans=0.0 2024-06-20 11:49:34,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=199472.16666666666, ans=0.125 2024-06-20 11:49:36,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2024-06-20 11:49:40,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.886e+02 2.069e+02 2.331e+02 3.178e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-20 11:49:41,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2024-06-20 11:49:45,806 INFO [train.py:1028] (0/2) Epoch 11, batch 7650, loss[loss=0.2446, simple_loss=0.2931, pruned_loss=0.09803, over 12932.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.2986, pruned_loss=0.1038, over 2572945.64 frames. ], batch size: 33, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:49:49,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=199508.83333333334, ans=0.1 2024-06-20 11:50:08,951 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.458e+00 2024-06-20 11:50:10,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=199563.83333333334, ans=0.125 2024-06-20 11:50:13,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=199582.16666666666, ans=0.0 2024-06-20 11:50:16,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199582.16666666666, ans=0.125 2024-06-20 11:50:18,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199582.16666666666, ans=0.125 2024-06-20 11:50:19,913 INFO [train.py:1028] (0/2) Epoch 11, batch 7700, loss[loss=0.2832, simple_loss=0.3387, pruned_loss=0.1138, over 13220.00 frames. ], tot_loss[loss=0.253, simple_loss=0.2988, pruned_loss=0.1037, over 2569582.97 frames. ], batch size: 63, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:50:24,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=199600.5, ans=12.0 2024-06-20 11:50:29,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=199618.83333333334, ans=0.125 2024-06-20 11:50:29,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199618.83333333334, ans=0.0 2024-06-20 11:50:32,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199637.16666666666, ans=0.125 2024-06-20 11:50:46,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199673.83333333334, ans=0.1 2024-06-20 11:50:47,407 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.944e+02 2.183e+02 2.511e+02 4.277e+02, threshold=4.366e+02, percent-clipped=1.0 2024-06-20 11:50:55,390 INFO [train.py:1028] (0/2) Epoch 11, batch 7750, loss[loss=0.2311, simple_loss=0.2939, pruned_loss=0.08421, over 13260.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.299, pruned_loss=0.1037, over 2574768.30 frames. ], batch size: 72, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:50:57,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.01 vs. limit=15.0 2024-06-20 11:51:07,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=199710.5, ans=0.025 2024-06-20 11:51:18,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=199747.16666666666, ans=0.025 2024-06-20 11:51:22,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=199747.16666666666, ans=0.125 2024-06-20 11:51:24,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.79 vs. limit=15.0 2024-06-20 11:51:27,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199765.5, ans=0.1 2024-06-20 11:51:29,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.73 vs. limit=22.5 2024-06-20 11:51:30,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2024-06-20 11:51:32,143 INFO [train.py:1028] (0/2) Epoch 11, batch 7800, loss[loss=0.2664, simple_loss=0.307, pruned_loss=0.1129, over 13208.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.299, pruned_loss=0.1037, over 2579463.28 frames. ], batch size: 95, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:51:34,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=199783.83333333334, ans=0.1 2024-06-20 11:51:34,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=199783.83333333334, ans=0.125 2024-06-20 11:51:37,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=199783.83333333334, ans=0.125 2024-06-20 11:51:50,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=199820.5, ans=0.2 2024-06-20 11:51:51,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=199820.5, ans=0.2 2024-06-20 11:52:01,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=15.0 2024-06-20 11:52:01,385 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.911e+02 2.073e+02 2.295e+02 3.294e+02, threshold=4.146e+02, percent-clipped=0.0 2024-06-20 11:52:02,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=199857.16666666666, ans=0.0 2024-06-20 11:52:06,068 INFO [train.py:1028] (0/2) Epoch 11, batch 7850, loss[loss=0.2373, simple_loss=0.2869, pruned_loss=0.09387, over 10915.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3004, pruned_loss=0.1045, over 2572892.29 frames. ], batch size: 16, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:52:07,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 11:52:08,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=199875.5, ans=0.1 2024-06-20 11:52:11,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2024-06-20 11:52:37,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=199948.83333333334, ans=0.0 2024-06-20 11:52:38,859 INFO [train.py:1028] (0/2) Epoch 11, batch 7900, loss[loss=0.2672, simple_loss=0.3175, pruned_loss=0.1085, over 13148.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3001, pruned_loss=0.1042, over 2572298.78 frames. ], batch size: 77, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:52:45,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=199967.16666666666, ans=0.0 2024-06-20 11:53:11,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=200040.5, ans=0.125 2024-06-20 11:53:14,130 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.938e+02 2.170e+02 2.552e+02 4.164e+02, threshold=4.340e+02, percent-clipped=1.0 2024-06-20 11:53:16,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=200040.5, ans=0.125 2024-06-20 11:53:17,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2024-06-20 11:53:18,679 INFO [train.py:1028] (0/2) Epoch 11, batch 7950, loss[loss=0.2545, simple_loss=0.2891, pruned_loss=0.11, over 10583.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3007, pruned_loss=0.1044, over 2575423.44 frames. ], batch size: 303, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:53:21,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=200058.83333333334, ans=0.125 2024-06-20 11:53:21,164 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=9.025e-01 2024-06-20 11:53:22,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.14 vs. limit=15.0 2024-06-20 11:53:23,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=200058.83333333334, ans=0.125 2024-06-20 11:53:26,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=200077.16666666666, ans=0.125 2024-06-20 11:53:29,424 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 11:53:38,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=200113.83333333334, ans=0.07 2024-06-20 11:53:45,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=200132.16666666666, ans=0.125 2024-06-20 11:53:51,213 INFO [train.py:1028] (0/2) Epoch 11, batch 8000, loss[loss=0.2176, simple_loss=0.2772, pruned_loss=0.07902, over 12546.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3018, pruned_loss=0.105, over 2572073.78 frames. ], batch size: 29, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:53:53,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.80 vs. limit=22.5 2024-06-20 11:54:08,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200187.16666666666, ans=0.125 2024-06-20 11:54:16,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=200205.5, ans=0.035 2024-06-20 11:54:19,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.934e+02 2.202e+02 2.647e+02 3.478e+02, threshold=4.405e+02, percent-clipped=0.0 2024-06-20 11:54:22,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=200223.83333333334, ans=0.0 2024-06-20 11:54:24,672 INFO [train.py:1028] (0/2) Epoch 11, batch 8050, loss[loss=0.2588, simple_loss=0.3012, pruned_loss=0.1082, over 13190.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3015, pruned_loss=0.1047, over 2572177.84 frames. ], batch size: 83, lr: 5.27e-03, grad_scale: 64.0 2024-06-20 11:54:24,769 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=3.924e-01 2024-06-20 11:54:25,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2024-06-20 11:54:26,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=200242.16666666666, ans=0.125 2024-06-20 11:54:35,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200260.5, ans=0.1 2024-06-20 11:54:38,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=200278.83333333334, ans=0.125 2024-06-20 11:54:43,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.85 vs. limit=12.0 2024-06-20 11:54:48,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=200297.16666666666, ans=0.05 2024-06-20 11:55:03,433 INFO [train.py:1028] (0/2) Epoch 11, batch 8100, loss[loss=0.2354, simple_loss=0.287, pruned_loss=0.09188, over 13141.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3022, pruned_loss=0.105, over 2576898.36 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:55:05,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=200333.83333333334, ans=0.0 2024-06-20 11:55:05,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=200333.83333333334, ans=0.125 2024-06-20 11:55:06,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=200333.83333333334, ans=0.125 2024-06-20 11:55:17,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2024-06-20 11:55:20,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=200370.5, ans=0.125 2024-06-20 11:55:32,546 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.886e+02 2.007e+02 2.208e+02 2.846e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 11:55:34,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=200407.16666666666, ans=0.0 2024-06-20 11:55:37,442 INFO [train.py:1028] (0/2) Epoch 11, batch 8150, loss[loss=0.2486, simple_loss=0.2908, pruned_loss=0.1032, over 13132.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3028, pruned_loss=0.1046, over 2580662.81 frames. ], batch size: 121, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:55:50,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=200462.16666666666, ans=0.95 2024-06-20 11:55:50,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=200462.16666666666, ans=0.125 2024-06-20 11:55:58,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=200480.5, ans=0.125 2024-06-20 11:56:03,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200498.83333333334, ans=0.1 2024-06-20 11:56:04,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2024-06-20 11:56:07,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.63 vs. limit=22.5 2024-06-20 11:56:10,051 INFO [train.py:1028] (0/2) Epoch 11, batch 8200, loss[loss=0.2845, simple_loss=0.3228, pruned_loss=0.1231, over 13151.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3027, pruned_loss=0.1043, over 2583890.23 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:56:22,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=200535.5, ans=0.125 2024-06-20 11:56:33,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=200572.16666666666, ans=0.0 2024-06-20 11:56:39,265 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 1.939e+02 2.107e+02 2.313e+02 2.950e+02, threshold=4.214e+02, percent-clipped=0.0 2024-06-20 11:56:43,871 INFO [train.py:1028] (0/2) Epoch 11, batch 8250, loss[loss=0.2716, simple_loss=0.3241, pruned_loss=0.1096, over 13235.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3031, pruned_loss=0.1045, over 2584658.96 frames. ], batch size: 52, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:56:48,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=200608.83333333334, ans=0.125 2024-06-20 11:57:01,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=200627.16666666666, ans=0.125 2024-06-20 11:57:03,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.07 vs. limit=15.0 2024-06-20 11:57:11,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=200663.83333333334, ans=0.125 2024-06-20 11:57:13,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200663.83333333334, ans=0.1 2024-06-20 11:57:21,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2024-06-20 11:57:23,972 INFO [train.py:1028] (0/2) Epoch 11, batch 8300, loss[loss=0.2602, simple_loss=0.3113, pruned_loss=0.1045, over 13000.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3026, pruned_loss=0.1043, over 2580754.57 frames. ], batch size: 102, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:57:28,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.69 vs. limit=10.0 2024-06-20 11:57:31,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=200718.83333333334, ans=0.125 2024-06-20 11:57:32,458 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=12.0 2024-06-20 11:57:32,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=200718.83333333334, ans=0.125 2024-06-20 11:57:36,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=200737.16666666666, ans=0.0 2024-06-20 11:57:45,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200755.5, ans=0.1 2024-06-20 11:57:51,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=200773.83333333334, ans=0.125 2024-06-20 11:57:52,847 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.902e+02 2.062e+02 2.207e+02 2.964e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 11:57:55,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.18 vs. limit=15.0 2024-06-20 11:57:57,329 INFO [train.py:1028] (0/2) Epoch 11, batch 8350, loss[loss=0.248, simple_loss=0.2942, pruned_loss=0.1009, over 13176.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3029, pruned_loss=0.1041, over 2582196.33 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 64.0 2024-06-20 11:58:05,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=200810.5, ans=0.2 2024-06-20 11:58:06,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=200810.5, ans=0.0 2024-06-20 11:58:10,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=200828.83333333334, ans=0.125 2024-06-20 11:58:12,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2024-06-20 11:58:19,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=200847.16666666666, ans=0.035 2024-06-20 11:58:21,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=200847.16666666666, ans=0.125 2024-06-20 11:58:25,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=200865.5, ans=0.0 2024-06-20 11:58:31,075 INFO [train.py:1028] (0/2) Epoch 11, batch 8400, loss[loss=0.2483, simple_loss=0.2937, pruned_loss=0.1014, over 12899.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3031, pruned_loss=0.1043, over 2577055.87 frames. ], batch size: 39, lr: 5.26e-03, grad_scale: 32.0 2024-06-20 11:58:34,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2024-06-20 11:58:41,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200902.16666666666, ans=0.1 2024-06-20 11:58:47,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=200920.5, ans=0.2 2024-06-20 11:58:54,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=200938.83333333334, ans=0.125 2024-06-20 11:59:02,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=200938.83333333334, ans=0.125 2024-06-20 11:59:07,467 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 2.076e+02 2.218e+02 2.468e+02 3.872e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-20 11:59:11,307 INFO [train.py:1028] (0/2) Epoch 11, batch 8450, loss[loss=0.2517, simple_loss=0.3008, pruned_loss=0.1013, over 13203.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3044, pruned_loss=0.1047, over 2579003.10 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 32.0 2024-06-20 11:59:19,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=200993.83333333334, ans=0.2 2024-06-20 11:59:26,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=201012.16666666666, ans=0.125 2024-06-20 11:59:43,652 INFO [train.py:1028] (0/2) Epoch 11, batch 8500, loss[loss=0.2675, simple_loss=0.3167, pruned_loss=0.1092, over 12631.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3052, pruned_loss=0.1051, over 2576999.80 frames. ], batch size: 29, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 11:59:48,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=201067.16666666666, ans=0.125 2024-06-20 11:59:54,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.35 vs. limit=10.0 2024-06-20 12:00:12,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=201140.5, ans=0.125 2024-06-20 12:00:13,489 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 2.056e+02 2.251e+02 2.543e+02 3.167e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-20 12:00:14,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=201140.5, ans=0.125 2024-06-20 12:00:16,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=201140.5, ans=0.125 2024-06-20 12:00:17,409 INFO [train.py:1028] (0/2) Epoch 11, batch 8550, loss[loss=0.2602, simple_loss=0.3099, pruned_loss=0.1053, over 12521.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3046, pruned_loss=0.1049, over 2575785.80 frames. ], batch size: 22, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:00:23,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-20 12:00:34,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=201195.5, ans=0.09899494936611666 2024-06-20 12:00:35,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=201195.5, ans=0.5 2024-06-20 12:00:51,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.71 vs. limit=10.0 2024-06-20 12:00:56,463 INFO [train.py:1028] (0/2) Epoch 11, batch 8600, loss[loss=0.2543, simple_loss=0.2993, pruned_loss=0.1047, over 13110.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3046, pruned_loss=0.105, over 2573117.48 frames. ], batch size: 112, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:01:18,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=201287.16666666666, ans=0.125 2024-06-20 12:01:30,521 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.904e+02 2.019e+02 2.207e+02 2.818e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-20 12:01:33,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2024-06-20 12:01:34,889 INFO [train.py:1028] (0/2) Epoch 11, batch 8650, loss[loss=0.277, simple_loss=0.3195, pruned_loss=0.1173, over 13174.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3052, pruned_loss=0.105, over 2577131.99 frames. ], batch size: 103, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:01:35,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=10.45 vs. limit=12.0 2024-06-20 12:01:39,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201342.16666666666, ans=0.0 2024-06-20 12:02:01,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=15.0 2024-06-20 12:02:04,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=201415.5, ans=0.0 2024-06-20 12:02:08,323 INFO [train.py:1028] (0/2) Epoch 11, batch 8700, loss[loss=0.2502, simple_loss=0.3071, pruned_loss=0.09668, over 13230.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3045, pruned_loss=0.1048, over 2573933.57 frames. ], batch size: 59, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:02:17,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=201452.16666666666, ans=0.125 2024-06-20 12:02:24,830 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:02:27,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=201470.5, ans=0.2 2024-06-20 12:02:38,584 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.951e+02 2.031e+02 2.219e+02 3.073e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-20 12:02:42,677 INFO [train.py:1028] (0/2) Epoch 11, batch 8750, loss[loss=0.2778, simple_loss=0.3134, pruned_loss=0.1211, over 13110.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3045, pruned_loss=0.1051, over 2570105.87 frames. ], batch size: 121, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:02:48,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=201525.5, ans=0.07 2024-06-20 12:02:55,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=201543.83333333334, ans=0.025 2024-06-20 12:02:57,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=201543.83333333334, ans=0.0 2024-06-20 12:03:01,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=201562.16666666666, ans=0.025 2024-06-20 12:03:06,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201562.16666666666, ans=0.1 2024-06-20 12:03:09,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=201580.5, ans=0.125 2024-06-20 12:03:10,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.09 vs. limit=10.0 2024-06-20 12:03:23,386 INFO [train.py:1028] (0/2) Epoch 11, batch 8800, loss[loss=0.2614, simple_loss=0.319, pruned_loss=0.1019, over 13266.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3043, pruned_loss=0.105, over 2574575.96 frames. ], batch size: 72, lr: 5.25e-03, grad_scale: 32.0 2024-06-20 12:03:33,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=201635.5, ans=0.125 2024-06-20 12:03:35,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=201635.5, ans=0.125 2024-06-20 12:03:44,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=201672.16666666666, ans=0.125 2024-06-20 12:03:50,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=201690.5, ans=10.0 2024-06-20 12:03:53,796 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.958e+02 2.115e+02 2.324e+02 2.938e+02, threshold=4.230e+02, percent-clipped=0.0 2024-06-20 12:03:57,512 INFO [train.py:1028] (0/2) Epoch 11, batch 8850, loss[loss=0.2742, simple_loss=0.3154, pruned_loss=0.1165, over 12514.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3046, pruned_loss=0.1057, over 2564554.22 frames. ], batch size: 202, lr: 5.25e-03, grad_scale: 16.0 2024-06-20 12:04:05,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=201727.16666666666, ans=0.125 2024-06-20 12:04:15,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=201745.5, ans=0.125 2024-06-20 12:04:27,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=201782.16666666666, ans=0.0 2024-06-20 12:04:30,791 INFO [train.py:1028] (0/2) Epoch 11, batch 8900, loss[loss=0.2611, simple_loss=0.3122, pruned_loss=0.105, over 12868.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3058, pruned_loss=0.1062, over 2562122.35 frames. ], batch size: 33, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:04:39,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=201818.83333333334, ans=0.125 2024-06-20 12:04:44,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=201837.16666666666, ans=0.0 2024-06-20 12:04:44,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=201837.16666666666, ans=0.125 2024-06-20 12:05:07,257 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.045e+02 2.227e+02 2.477e+02 3.224e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 12:05:09,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=15.0 2024-06-20 12:05:10,791 INFO [train.py:1028] (0/2) Epoch 11, batch 8950, loss[loss=0.2858, simple_loss=0.3286, pruned_loss=0.1215, over 12614.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.306, pruned_loss=0.1061, over 2562422.65 frames. ], batch size: 202, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:05:13,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=201892.16666666666, ans=0.2 2024-06-20 12:05:23,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=201928.83333333334, ans=0.0 2024-06-20 12:05:27,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=201928.83333333334, ans=0.2 2024-06-20 12:05:30,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=201947.16666666666, ans=0.0 2024-06-20 12:05:30,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=201947.16666666666, ans=0.0 2024-06-20 12:05:32,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=201947.16666666666, ans=0.0 2024-06-20 12:05:33,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=201947.16666666666, ans=0.0 2024-06-20 12:05:34,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=201947.16666666666, ans=0.125 2024-06-20 12:05:42,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=15.0 2024-06-20 12:05:44,144 INFO [train.py:1028] (0/2) Epoch 11, batch 9000, loss[loss=0.2634, simple_loss=0.3128, pruned_loss=0.107, over 13263.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3058, pruned_loss=0.1055, over 2567785.10 frames. ], batch size: 46, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:05:44,145 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 12:05:52,044 INFO [train.py:1060] (0/2) Epoch 11, validation: loss=0.1959, simple_loss=0.2595, pruned_loss=0.06618, over 351949.00 frames. 2024-06-20 12:05:52,045 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 12:05:57,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2024-06-20 12:06:01,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=22.5 2024-06-20 12:06:05,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=202020.5, ans=0.025 2024-06-20 12:06:05,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=202020.5, ans=0.2 2024-06-20 12:06:15,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202038.83333333334, ans=0.1 2024-06-20 12:06:16,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202038.83333333334, ans=0.0 2024-06-20 12:06:21,413 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.948e+02 2.091e+02 2.280e+02 3.462e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 12:06:24,563 INFO [train.py:1028] (0/2) Epoch 11, batch 9050, loss[loss=0.2117, simple_loss=0.2622, pruned_loss=0.08066, over 11376.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3066, pruned_loss=0.1059, over 2567823.01 frames. ], batch size: 17, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:06:27,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.94 vs. limit=15.0 2024-06-20 12:06:29,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=202075.5, ans=0.125 2024-06-20 12:06:35,405 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.98 vs. limit=15.0 2024-06-20 12:06:40,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=202112.16666666666, ans=0.0 2024-06-20 12:06:42,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=202112.16666666666, ans=0.0 2024-06-20 12:06:46,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.94 vs. limit=22.5 2024-06-20 12:06:47,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=202130.5, ans=0.2 2024-06-20 12:06:48,087 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2024-06-20 12:06:48,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=202130.5, ans=0.125 2024-06-20 12:06:50,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=202148.83333333334, ans=0.0 2024-06-20 12:06:51,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=202148.83333333334, ans=0.125 2024-06-20 12:06:52,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2024-06-20 12:06:57,332 INFO [train.py:1028] (0/2) Epoch 11, batch 9100, loss[loss=0.2447, simple_loss=0.2978, pruned_loss=0.09581, over 13278.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3057, pruned_loss=0.1048, over 2568534.93 frames. ], batch size: 72, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:07:03,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=202185.5, ans=0.0 2024-06-20 12:07:05,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=202185.5, ans=0.0 2024-06-20 12:07:06,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.94 vs. limit=15.0 2024-06-20 12:07:16,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=202222.16666666666, ans=0.0 2024-06-20 12:07:21,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=202222.16666666666, ans=0.125 2024-06-20 12:07:23,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=202240.5, ans=0.125 2024-06-20 12:07:24,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-06-20 12:07:26,319 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.875e+02 2.034e+02 2.248e+02 2.897e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 12:07:29,522 INFO [train.py:1028] (0/2) Epoch 11, batch 9150, loss[loss=0.2382, simple_loss=0.2957, pruned_loss=0.09034, over 13198.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3058, pruned_loss=0.1052, over 2569054.88 frames. ], batch size: 77, lr: 5.24e-03, grad_scale: 16.0 2024-06-20 12:07:30,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=202258.83333333334, ans=0.125 2024-06-20 12:07:48,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202313.83333333334, ans=0.1 2024-06-20 12:07:49,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=202313.83333333334, ans=0.125 2024-06-20 12:07:57,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2024-06-20 12:07:57,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=202332.16666666666, ans=0.125 2024-06-20 12:08:04,272 INFO [train.py:1028] (0/2) Epoch 11, batch 9200, loss[loss=0.2554, simple_loss=0.3055, pruned_loss=0.1026, over 12923.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3057, pruned_loss=0.1047, over 2571800.66 frames. ], batch size: 36, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:08:07,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=202350.5, ans=0.2 2024-06-20 12:08:25,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=202387.16666666666, ans=0.125 2024-06-20 12:08:26,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 12:08:28,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=202405.5, ans=0.0 2024-06-20 12:08:32,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.03 vs. limit=15.0 2024-06-20 12:08:33,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=15.0 2024-06-20 12:08:35,620 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.874e+02 2.043e+02 2.265e+02 3.021e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 12:08:38,966 INFO [train.py:1028] (0/2) Epoch 11, batch 9250, loss[loss=0.2551, simple_loss=0.31, pruned_loss=0.1001, over 13151.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3057, pruned_loss=0.1045, over 2575030.50 frames. ], batch size: 67, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:08:48,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=15.0 2024-06-20 12:08:54,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-20 12:08:59,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2024-06-20 12:09:10,976 INFO [train.py:1028] (0/2) Epoch 11, batch 9300, loss[loss=0.2678, simple_loss=0.3029, pruned_loss=0.1164, over 12886.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3056, pruned_loss=0.1044, over 2572105.31 frames. ], batch size: 39, lr: 5.24e-03, grad_scale: 32.0 2024-06-20 12:09:15,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202533.83333333334, ans=0.1 2024-06-20 12:09:16,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2024-06-20 12:09:24,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=202570.5, ans=0.025 2024-06-20 12:09:28,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=202570.5, ans=0.2 2024-06-20 12:09:31,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2024-06-20 12:09:39,005 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.926e+02 2.088e+02 2.288e+02 3.539e+02, threshold=4.176e+02, percent-clipped=0.0 2024-06-20 12:09:42,177 INFO [train.py:1028] (0/2) Epoch 11, batch 9350, loss[loss=0.3082, simple_loss=0.3452, pruned_loss=0.1357, over 12540.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3055, pruned_loss=0.1045, over 2568955.90 frames. ], batch size: 22, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:09:44,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202625.5, ans=0.1 2024-06-20 12:09:49,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=202643.83333333334, ans=0.125 2024-06-20 12:09:51,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=202643.83333333334, ans=0.125 2024-06-20 12:10:09,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=202698.83333333334, ans=0.125 2024-06-20 12:10:11,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.07 vs. limit=15.0 2024-06-20 12:10:13,464 INFO [train.py:1028] (0/2) Epoch 11, batch 9400, loss[loss=0.2743, simple_loss=0.3335, pruned_loss=0.1076, over 13270.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3058, pruned_loss=0.1046, over 2568840.15 frames. ], batch size: 52, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:10:21,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=202735.5, ans=0.025 2024-06-20 12:10:24,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=202735.5, ans=0.2 2024-06-20 12:10:33,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=12.0 2024-06-20 12:10:41,525 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.909e+02 2.019e+02 2.187e+02 3.143e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-20 12:10:44,742 INFO [train.py:1028] (0/2) Epoch 11, batch 9450, loss[loss=0.266, simple_loss=0.3154, pruned_loss=0.1083, over 12520.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3069, pruned_loss=0.1055, over 2568216.05 frames. ], batch size: 22, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:10:44,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=202808.83333333334, ans=0.0 2024-06-20 12:11:00,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=202845.5, ans=0.125 2024-06-20 12:11:01,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=202845.5, ans=0.0 2024-06-20 12:11:03,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=202863.83333333334, ans=0.0 2024-06-20 12:11:10,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=202882.16666666666, ans=0.2 2024-06-20 12:11:10,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=202882.16666666666, ans=0.125 2024-06-20 12:11:12,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.89 vs. limit=22.5 2024-06-20 12:11:15,181 INFO [train.py:1028] (0/2) Epoch 11, batch 9500, loss[loss=0.2465, simple_loss=0.2991, pruned_loss=0.09697, over 13258.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3066, pruned_loss=0.105, over 2576686.61 frames. ], batch size: 43, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:11:17,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=202900.5, ans=0.125 2024-06-20 12:11:18,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=202900.5, ans=0.0 2024-06-20 12:11:33,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=202937.16666666666, ans=0.1 2024-06-20 12:11:42,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=202955.5, ans=0.0 2024-06-20 12:11:47,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=202973.83333333334, ans=0.95 2024-06-20 12:11:48,059 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.030e+02 2.240e+02 2.625e+02 3.872e+02, threshold=4.481e+02, percent-clipped=0.0 2024-06-20 12:11:48,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=202973.83333333334, ans=0.125 2024-06-20 12:11:50,947 INFO [train.py:1028] (0/2) Epoch 11, batch 9550, loss[loss=0.238, simple_loss=0.2881, pruned_loss=0.09399, over 12911.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3062, pruned_loss=0.1051, over 2573962.47 frames. ], batch size: 39, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:11:56,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=203010.5, ans=0.125 2024-06-20 12:11:58,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=203010.5, ans=0.0 2024-06-20 12:12:02,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=203010.5, ans=0.0 2024-06-20 12:12:05,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=203028.83333333334, ans=0.2 2024-06-20 12:12:05,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=203028.83333333334, ans=0.125 2024-06-20 12:12:15,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2024-06-20 12:12:22,018 INFO [train.py:1028] (0/2) Epoch 11, batch 9600, loss[loss=0.2769, simple_loss=0.3103, pruned_loss=0.1217, over 10660.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.306, pruned_loss=0.1049, over 2572536.47 frames. ], batch size: 303, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:12:24,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203083.83333333334, ans=0.1 2024-06-20 12:12:24,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2024-06-20 12:12:32,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=203102.16666666666, ans=0.2 2024-06-20 12:12:35,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=203120.5, ans=0.125 2024-06-20 12:12:36,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203120.5, ans=0.1 2024-06-20 12:12:38,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=203120.5, ans=0.125 2024-06-20 12:12:45,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203138.83333333334, ans=0.0 2024-06-20 12:12:49,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=203157.16666666666, ans=0.125 2024-06-20 12:12:49,517 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.944e+02 2.089e+02 2.222e+02 3.055e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 12:12:49,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=203157.16666666666, ans=0.0 2024-06-20 12:12:50,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=203157.16666666666, ans=0.025 2024-06-20 12:12:50,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.52 vs. limit=10.0 2024-06-20 12:12:51,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=203157.16666666666, ans=0.125 2024-06-20 12:12:52,599 INFO [train.py:1028] (0/2) Epoch 11, batch 9650, loss[loss=0.2342, simple_loss=0.2773, pruned_loss=0.09558, over 13082.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3059, pruned_loss=0.1053, over 2563900.35 frames. ], batch size: 132, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:12:55,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=203175.5, ans=0.025 2024-06-20 12:12:57,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=203175.5, ans=0.0 2024-06-20 12:13:03,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=203193.83333333334, ans=0.5 2024-06-20 12:13:05,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=203212.16666666666, ans=0.2 2024-06-20 12:13:11,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=203230.5, ans=0.2 2024-06-20 12:13:15,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.69 vs. limit=15.0 2024-06-20 12:13:23,070 INFO [train.py:1028] (0/2) Epoch 11, batch 9700, loss[loss=0.2367, simple_loss=0.283, pruned_loss=0.09521, over 13017.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3047, pruned_loss=0.1049, over 2558523.04 frames. ], batch size: 144, lr: 5.23e-03, grad_scale: 32.0 2024-06-20 12:13:25,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=203267.16666666666, ans=0.125 2024-06-20 12:13:38,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.58 vs. limit=22.5 2024-06-20 12:13:46,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2024-06-20 12:13:50,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203340.5, ans=0.0 2024-06-20 12:13:53,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.895e+02 2.037e+02 2.308e+02 3.134e+02, threshold=4.075e+02, percent-clipped=0.0 2024-06-20 12:13:56,604 INFO [train.py:1028] (0/2) Epoch 11, batch 9750, loss[loss=0.2372, simple_loss=0.2899, pruned_loss=0.09221, over 13104.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3032, pruned_loss=0.1039, over 2555441.31 frames. ], batch size: 132, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:13:56,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2024-06-20 12:13:57,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.85 vs. limit=15.0 2024-06-20 12:13:58,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=203358.83333333334, ans=0.0 2024-06-20 12:14:00,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=203358.83333333334, ans=0.0 2024-06-20 12:14:04,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=203377.16666666666, ans=0.2 2024-06-20 12:14:07,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.50 vs. limit=22.5 2024-06-20 12:14:07,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=203377.16666666666, ans=0.0 2024-06-20 12:14:15,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=203413.83333333334, ans=0.07 2024-06-20 12:14:27,357 INFO [train.py:1028] (0/2) Epoch 11, batch 9800, loss[loss=0.2185, simple_loss=0.2733, pruned_loss=0.08179, over 13295.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3018, pruned_loss=0.1031, over 2548915.07 frames. ], batch size: 40, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:14:31,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203450.5, ans=0.1 2024-06-20 12:14:38,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=203487.16666666666, ans=0.2 2024-06-20 12:14:39,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=203487.16666666666, ans=0.07 2024-06-20 12:14:41,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203487.16666666666, ans=0.1 2024-06-20 12:14:45,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=203505.5, ans=0.125 2024-06-20 12:14:53,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=203523.83333333334, ans=0.125 2024-06-20 12:14:53,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.09 vs. limit=6.0 2024-06-20 12:14:54,687 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.981e+02 2.148e+02 2.391e+02 3.744e+02, threshold=4.296e+02, percent-clipped=0.0 2024-06-20 12:14:57,734 INFO [train.py:1028] (0/2) Epoch 11, batch 9850, loss[loss=0.278, simple_loss=0.3235, pruned_loss=0.1162, over 13036.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3015, pruned_loss=0.1029, over 2541005.12 frames. ], batch size: 102, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:14:57,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=203542.16666666666, ans=0.2 2024-06-20 12:15:04,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=203560.5, ans=0.0 2024-06-20 12:15:10,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=203560.5, ans=0.125 2024-06-20 12:15:13,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=203578.83333333334, ans=0.125 2024-06-20 12:15:17,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=203597.16666666666, ans=0.125 2024-06-20 12:15:19,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=203597.16666666666, ans=0.125 2024-06-20 12:15:25,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=203615.5, ans=0.125 2024-06-20 12:15:30,257 INFO [train.py:1028] (0/2) Epoch 11, batch 9900, loss[loss=0.2347, simple_loss=0.2854, pruned_loss=0.09199, over 12985.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3013, pruned_loss=0.1031, over 2533739.22 frames. ], batch size: 39, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:15:33,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=12.0 2024-06-20 12:15:33,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=203633.83333333334, ans=0.125 2024-06-20 12:15:39,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203652.16666666666, ans=0.1 2024-06-20 12:15:40,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=203652.16666666666, ans=0.2 2024-06-20 12:15:42,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.02 vs. limit=22.5 2024-06-20 12:15:56,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203707.16666666666, ans=0.1 2024-06-20 12:15:56,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=203707.16666666666, ans=0.2 2024-06-20 12:15:57,785 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.910e+02 2.039e+02 2.182e+02 2.743e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 12:15:58,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.78 vs. limit=22.5 2024-06-20 12:16:00,954 INFO [train.py:1028] (0/2) Epoch 11, batch 9950, loss[loss=0.2666, simple_loss=0.3093, pruned_loss=0.1119, over 12733.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3011, pruned_loss=0.1039, over 2527165.72 frames. ], batch size: 29, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:16:05,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=203725.5, ans=0.0 2024-06-20 12:16:12,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203743.83333333334, ans=0.125 2024-06-20 12:16:15,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=203762.16666666666, ans=0.1 2024-06-20 12:16:25,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203780.5, ans=0.1 2024-06-20 12:16:34,055 INFO [train.py:1028] (0/2) Epoch 11, batch 10000, loss[loss=0.2361, simple_loss=0.2858, pruned_loss=0.09319, over 12475.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3015, pruned_loss=0.1045, over 2487499.80 frames. ], batch size: 22, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:16:34,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=203817.16666666666, ans=0.125 2024-06-20 12:16:57,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=12.0 2024-06-20 12:17:01,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=203890.5, ans=0.2 2024-06-20 12:17:02,295 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.988e+02 2.166e+02 2.593e+02 4.039e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 12:17:03,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=203890.5, ans=0.125 2024-06-20 12:17:05,157 INFO [train.py:1028] (0/2) Epoch 11, batch 10050, loss[loss=0.2832, simple_loss=0.3303, pruned_loss=0.118, over 12421.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3013, pruned_loss=0.1049, over 2444671.22 frames. ], batch size: 22, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:17:08,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=203908.83333333334, ans=0.125 2024-06-20 12:17:16,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=203945.5, ans=0.05 2024-06-20 12:17:25,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 12:17:34,775 INFO [train.py:1028] (0/2) Epoch 11, batch 10100, loss[loss=0.2597, simple_loss=0.3035, pruned_loss=0.1079, over 11838.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3005, pruned_loss=0.1042, over 2426027.98 frames. ], batch size: 17, lr: 5.22e-03, grad_scale: 32.0 2024-06-20 12:17:34,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=204000.5, ans=0.125 2024-06-20 12:17:35,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=204000.5, ans=0.0 2024-06-20 12:17:35,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.07 vs. limit=22.5 2024-06-20 12:17:47,949 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-11.pt 2024-06-20 12:19:51,035 INFO [train.py:1028] (0/2) Epoch 12, batch 0, loss[loss=0.2119, simple_loss=0.2601, pruned_loss=0.08185, over 12869.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2601, pruned_loss=0.08185, over 12869.00 frames. ], batch size: 36, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:19:51,037 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 12:19:56,011 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.7238, 0.6732, 2.8620, 3.2218, 0.4284, 3.0169, 3.5119, 3.0980], device='cuda:0') 2024-06-20 12:19:58,167 INFO [train.py:1060] (0/2) Epoch 12, validation: loss=0.1957, simple_loss=0.2607, pruned_loss=0.06538, over 351949.00 frames. 2024-06-20 12:19:58,167 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 12:20:03,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2024-06-20 12:20:03,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=12.0 2024-06-20 12:20:06,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=204050.0, ans=0.2 2024-06-20 12:20:14,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=204068.33333333334, ans=0.0 2024-06-20 12:20:17,172 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.937e+02 2.210e+02 2.561e+02 3.967e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-20 12:20:23,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=204086.66666666666, ans=0.125 2024-06-20 12:20:25,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.37 vs. limit=15.0 2024-06-20 12:20:34,434 INFO [train.py:1028] (0/2) Epoch 12, batch 50, loss[loss=0.2291, simple_loss=0.2828, pruned_loss=0.08767, over 12678.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.2829, pruned_loss=0.09608, over 574957.64 frames. ], batch size: 29, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:20:38,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=204123.33333333334, ans=0.2 2024-06-20 12:20:39,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=204123.33333333334, ans=0.2 2024-06-20 12:20:43,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=204141.66666666666, ans=0.0 2024-06-20 12:20:56,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=204160.0, ans=0.025 2024-06-20 12:21:00,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.46 vs. limit=15.0 2024-06-20 12:21:05,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204196.66666666666, ans=0.1 2024-06-20 12:21:08,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=204196.66666666666, ans=0.125 2024-06-20 12:21:08,608 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-20 12:21:10,187 INFO [train.py:1028] (0/2) Epoch 12, batch 100, loss[loss=0.2304, simple_loss=0.2874, pruned_loss=0.08673, over 13312.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.2818, pruned_loss=0.09541, over 1017187.45 frames. ], batch size: 46, lr: 5.00e-03, grad_scale: 32.0 2024-06-20 12:21:10,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=204215.0, ans=0.5 2024-06-20 12:21:27,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=204251.66666666666, ans=0.0 2024-06-20 12:21:28,183 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.895e+02 2.134e+02 2.386e+02 3.029e+02, threshold=4.267e+02, percent-clipped=0.0 2024-06-20 12:21:35,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=204288.33333333334, ans=0.125 2024-06-20 12:21:35,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=204288.33333333334, ans=0.95 2024-06-20 12:21:42,049 INFO [train.py:1028] (0/2) Epoch 12, batch 150, loss[loss=0.2432, simple_loss=0.2928, pruned_loss=0.0968, over 12570.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2828, pruned_loss=0.09553, over 1365155.00 frames. ], batch size: 29, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:21:42,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2024-06-20 12:21:47,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.77 vs. limit=15.0 2024-06-20 12:21:55,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=204343.33333333334, ans=0.5 2024-06-20 12:21:59,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=204343.33333333334, ans=0.125 2024-06-20 12:22:00,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=204361.66666666666, ans=0.0 2024-06-20 12:22:02,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=204361.66666666666, ans=0.0 2024-06-20 12:22:03,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2024-06-20 12:22:03,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=204361.66666666666, ans=0.0 2024-06-20 12:22:13,350 INFO [train.py:1028] (0/2) Epoch 12, batch 200, loss[loss=0.2609, simple_loss=0.2965, pruned_loss=0.1127, over 12484.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2821, pruned_loss=0.09511, over 1635126.21 frames. ], batch size: 202, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:22:15,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204398.33333333334, ans=0.1 2024-06-20 12:22:33,919 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.811e+02 1.934e+02 2.061e+02 2.880e+02, threshold=3.868e+02, percent-clipped=0.0 2024-06-20 12:22:37,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=204453.33333333334, ans=0.0 2024-06-20 12:22:45,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=204471.66666666666, ans=0.025 2024-06-20 12:22:46,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=204471.66666666666, ans=0.125 2024-06-20 12:22:48,042 INFO [train.py:1028] (0/2) Epoch 12, batch 250, loss[loss=0.2338, simple_loss=0.2715, pruned_loss=0.09807, over 13003.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2808, pruned_loss=0.09426, over 1846073.18 frames. ], batch size: 144, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:23:13,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=204545.0, ans=0.0 2024-06-20 12:23:13,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=204545.0, ans=0.125 2024-06-20 12:23:20,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=204563.33333333334, ans=0.2 2024-06-20 12:23:20,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=204563.33333333334, ans=0.125 2024-06-20 12:23:24,504 INFO [train.py:1028] (0/2) Epoch 12, batch 300, loss[loss=0.2413, simple_loss=0.2794, pruned_loss=0.1016, over 13215.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2811, pruned_loss=0.09491, over 2009073.68 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:23:25,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.69 vs. limit=22.5 2024-06-20 12:23:27,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=204581.66666666666, ans=0.025 2024-06-20 12:23:29,344 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:23:30,862 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:23:40,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=15.0 2024-06-20 12:23:42,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.844e+02 2.036e+02 2.184e+02 2.983e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-20 12:23:46,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=204636.66666666666, ans=0.1 2024-06-20 12:23:47,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=204636.66666666666, ans=0.0 2024-06-20 12:23:56,637 INFO [train.py:1028] (0/2) Epoch 12, batch 350, loss[loss=0.211, simple_loss=0.255, pruned_loss=0.08343, over 12855.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2798, pruned_loss=0.09416, over 2137970.49 frames. ], batch size: 33, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:24:00,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=204673.33333333334, ans=0.125 2024-06-20 12:24:22,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=204746.66666666666, ans=0.125 2024-06-20 12:24:31,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2024-06-20 12:24:32,896 INFO [train.py:1028] (0/2) Epoch 12, batch 400, loss[loss=0.2344, simple_loss=0.285, pruned_loss=0.09187, over 13287.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2801, pruned_loss=0.09407, over 2239455.40 frames. ], batch size: 63, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:24:51,137 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.822e+02 1.928e+02 2.173e+02 3.193e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 12:25:07,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=204856.66666666666, ans=0.0 2024-06-20 12:25:08,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2024-06-20 12:25:08,432 INFO [train.py:1028] (0/2) Epoch 12, batch 450, loss[loss=0.2167, simple_loss=0.264, pruned_loss=0.08472, over 13200.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2796, pruned_loss=0.09348, over 2313917.23 frames. ], batch size: 67, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:25:16,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=204875.0, ans=0.0 2024-06-20 12:25:31,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.59 vs. limit=22.5 2024-06-20 12:25:31,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.62 vs. limit=10.0 2024-06-20 12:25:35,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=204930.0, ans=0.125 2024-06-20 12:25:38,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=204930.0, ans=0.025 2024-06-20 12:25:40,482 INFO [train.py:1028] (0/2) Epoch 12, batch 500, loss[loss=0.2269, simple_loss=0.2716, pruned_loss=0.09115, over 13103.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2802, pruned_loss=0.0936, over 2376145.04 frames. ], batch size: 121, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:25:41,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=204948.33333333334, ans=0.09899494936611666 2024-06-20 12:25:41,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=204948.33333333334, ans=0.0 2024-06-20 12:25:45,250 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:25:58,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.876e+02 2.023e+02 2.255e+02 2.809e+02, threshold=4.047e+02, percent-clipped=0.0 2024-06-20 12:25:59,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.49 vs. limit=22.5 2024-06-20 12:26:04,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=205003.33333333334, ans=0.0 2024-06-20 12:26:12,804 INFO [train.py:1028] (0/2) Epoch 12, batch 550, loss[loss=0.2273, simple_loss=0.2751, pruned_loss=0.08972, over 12893.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2806, pruned_loss=0.09376, over 2421515.49 frames. ], batch size: 158, lr: 4.99e-03, grad_scale: 32.0 2024-06-20 12:26:14,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=205040.0, ans=0.0 2024-06-20 12:26:16,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-20 12:26:18,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=205058.33333333334, ans=0.125 2024-06-20 12:26:28,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=205076.66666666666, ans=0.0 2024-06-20 12:26:44,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-20 12:26:46,890 INFO [train.py:1028] (0/2) Epoch 12, batch 600, loss[loss=0.2236, simple_loss=0.2645, pruned_loss=0.09139, over 13026.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.28, pruned_loss=0.09351, over 2458728.06 frames. ], batch size: 144, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:26:47,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=205131.66666666666, ans=0.2 2024-06-20 12:26:48,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=205131.66666666666, ans=0.05 2024-06-20 12:26:50,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=205131.66666666666, ans=0.125 2024-06-20 12:26:53,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-20 12:27:02,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=205168.33333333334, ans=0.125 2024-06-20 12:27:04,561 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.808e+02 1.886e+02 2.012e+02 2.572e+02, threshold=3.772e+02, percent-clipped=0.0 2024-06-20 12:27:08,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=15.0 2024-06-20 12:27:11,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=205186.66666666666, ans=0.2 2024-06-20 12:27:18,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=205205.0, ans=0.0 2024-06-20 12:27:21,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=205223.33333333334, ans=0.0 2024-06-20 12:27:21,536 INFO [train.py:1028] (0/2) Epoch 12, batch 650, loss[loss=0.2657, simple_loss=0.3146, pruned_loss=0.1084, over 13194.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2806, pruned_loss=0.09371, over 2490617.20 frames. ], batch size: 59, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:27:24,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=205223.33333333334, ans=0.05 2024-06-20 12:27:32,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=205241.66666666666, ans=0.125 2024-06-20 12:27:38,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205260.0, ans=0.125 2024-06-20 12:27:43,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2024-06-20 12:27:43,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2024-06-20 12:27:46,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=205296.66666666666, ans=0.125 2024-06-20 12:27:46,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=205296.66666666666, ans=0.125 2024-06-20 12:27:50,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=205296.66666666666, ans=0.0 2024-06-20 12:27:52,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=205315.0, ans=0.0 2024-06-20 12:27:52,714 INFO [train.py:1028] (0/2) Epoch 12, batch 700, loss[loss=0.2213, simple_loss=0.2744, pruned_loss=0.08408, over 13260.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2801, pruned_loss=0.09362, over 2512493.50 frames. ], batch size: 46, lr: 4.98e-03, grad_scale: 32.0 2024-06-20 12:27:58,431 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-112000.pt 2024-06-20 12:28:05,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=205333.33333333334, ans=0.025 2024-06-20 12:28:07,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=205333.33333333334, ans=0.125 2024-06-20 12:28:10,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=205351.66666666666, ans=0.125 2024-06-20 12:28:12,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.85 vs. limit=10.0 2024-06-20 12:28:15,562 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.824e+02 1.943e+02 2.079e+02 2.883e+02, threshold=3.886e+02, percent-clipped=0.0 2024-06-20 12:28:19,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2024-06-20 12:28:27,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=205388.33333333334, ans=0.1 2024-06-20 12:28:29,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=205406.66666666666, ans=0.125 2024-06-20 12:28:30,147 INFO [train.py:1028] (0/2) Epoch 12, batch 750, loss[loss=0.2174, simple_loss=0.2698, pruned_loss=0.08248, over 13225.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2799, pruned_loss=0.09331, over 2527150.41 frames. ], batch size: 63, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:28:33,473 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.218e+01 2024-06-20 12:28:34,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=205406.66666666666, ans=0.125 2024-06-20 12:28:38,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=205425.0, ans=0.125 2024-06-20 12:28:45,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205443.33333333334, ans=0.1 2024-06-20 12:28:56,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=205461.66666666666, ans=0.125 2024-06-20 12:28:57,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-20 12:29:05,235 INFO [train.py:1028] (0/2) Epoch 12, batch 800, loss[loss=0.2349, simple_loss=0.2846, pruned_loss=0.09256, over 12944.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2802, pruned_loss=0.09349, over 2539921.51 frames. ], batch size: 36, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:29:09,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=205498.33333333334, ans=0.125 2024-06-20 12:29:10,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=205498.33333333334, ans=0.125 2024-06-20 12:29:10,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=205516.66666666666, ans=0.0 2024-06-20 12:29:12,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=205516.66666666666, ans=0.0 2024-06-20 12:29:18,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205535.0, ans=0.125 2024-06-20 12:29:23,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.42 vs. limit=10.0 2024-06-20 12:29:26,129 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.823e+02 1.984e+02 2.199e+02 2.988e+02, threshold=3.968e+02, percent-clipped=0.0 2024-06-20 12:29:38,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=205571.66666666666, ans=0.125 2024-06-20 12:29:40,470 INFO [train.py:1028] (0/2) Epoch 12, batch 850, loss[loss=0.225, simple_loss=0.2648, pruned_loss=0.09263, over 13124.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2795, pruned_loss=0.09302, over 2551615.10 frames. ], batch size: 95, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:29:43,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=205590.0, ans=0.125 2024-06-20 12:29:46,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=205608.33333333334, ans=0.125 2024-06-20 12:29:52,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.18 vs. limit=15.0 2024-06-20 12:29:54,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=205626.66666666666, ans=0.125 2024-06-20 12:30:00,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=205645.0, ans=0.125 2024-06-20 12:30:12,525 INFO [train.py:1028] (0/2) Epoch 12, batch 900, loss[loss=0.2353, simple_loss=0.2879, pruned_loss=0.09132, over 12916.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2793, pruned_loss=0.09335, over 2556145.01 frames. ], batch size: 36, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:30:29,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.77 vs. limit=22.5 2024-06-20 12:30:31,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.837e+02 1.957e+02 2.166e+02 3.035e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 12:30:31,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=205718.33333333334, ans=0.0 2024-06-20 12:30:42,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=205755.0, ans=0.0 2024-06-20 12:30:49,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=205755.0, ans=0.2 2024-06-20 12:30:49,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.52 vs. limit=6.0 2024-06-20 12:30:50,248 INFO [train.py:1028] (0/2) Epoch 12, batch 950, loss[loss=0.212, simple_loss=0.2706, pruned_loss=0.07666, over 13272.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2794, pruned_loss=0.09332, over 2559310.56 frames. ], batch size: 40, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:30:52,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=205773.33333333334, ans=0.2 2024-06-20 12:30:53,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=205773.33333333334, ans=0.125 2024-06-20 12:30:55,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=205773.33333333334, ans=0.025 2024-06-20 12:31:12,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=205828.33333333334, ans=0.1 2024-06-20 12:31:19,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=205846.66666666666, ans=0.0 2024-06-20 12:31:24,964 INFO [train.py:1028] (0/2) Epoch 12, batch 1000, loss[loss=0.2494, simple_loss=0.3006, pruned_loss=0.09908, over 13281.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2793, pruned_loss=0.09346, over 2561538.45 frames. ], batch size: 49, lr: 4.98e-03, grad_scale: 64.0 2024-06-20 12:31:33,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2024-06-20 12:31:36,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=205883.33333333334, ans=0.1 2024-06-20 12:31:37,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205901.66666666666, ans=0.1 2024-06-20 12:31:42,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.794e+02 1.910e+02 2.131e+02 2.869e+02, threshold=3.820e+02, percent-clipped=0.0 2024-06-20 12:31:57,182 INFO [train.py:1028] (0/2) Epoch 12, batch 1050, loss[loss=0.2543, simple_loss=0.303, pruned_loss=0.1028, over 13252.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2799, pruned_loss=0.09366, over 2565759.99 frames. ], batch size: 77, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:32:09,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=205993.33333333334, ans=0.0 2024-06-20 12:32:13,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205993.33333333334, ans=0.1 2024-06-20 12:32:13,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=205993.33333333334, ans=0.2 2024-06-20 12:32:24,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=206030.0, ans=0.125 2024-06-20 12:32:28,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=206030.0, ans=0.0 2024-06-20 12:32:29,890 INFO [train.py:1028] (0/2) Epoch 12, batch 1100, loss[loss=0.2508, simple_loss=0.298, pruned_loss=0.1018, over 13289.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2805, pruned_loss=0.09384, over 2570890.45 frames. ], batch size: 52, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:32:35,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=206066.66666666666, ans=0.0 2024-06-20 12:32:50,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.907e+02 2.053e+02 2.397e+02 3.531e+02, threshold=4.107e+02, percent-clipped=0.0 2024-06-20 12:32:57,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=206103.33333333334, ans=0.0 2024-06-20 12:33:00,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=206121.66666666666, ans=0.125 2024-06-20 12:33:00,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=206121.66666666666, ans=0.125 2024-06-20 12:33:04,845 INFO [train.py:1028] (0/2) Epoch 12, batch 1150, loss[loss=0.2442, simple_loss=0.2911, pruned_loss=0.09862, over 13257.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2804, pruned_loss=0.09388, over 2571732.58 frames. ], batch size: 52, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:33:08,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=206140.0, ans=0.125 2024-06-20 12:33:24,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=206176.66666666666, ans=0.2 2024-06-20 12:33:25,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=206176.66666666666, ans=0.125 2024-06-20 12:33:38,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.23 vs. limit=15.0 2024-06-20 12:33:42,185 INFO [train.py:1028] (0/2) Epoch 12, batch 1200, loss[loss=0.2269, simple_loss=0.275, pruned_loss=0.08943, over 13179.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2796, pruned_loss=0.09352, over 2573768.94 frames. ], batch size: 77, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:33:45,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=206231.66666666666, ans=0.2 2024-06-20 12:33:57,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206268.33333333334, ans=0.125 2024-06-20 12:34:00,184 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.802e+02 1.956e+02 2.126e+02 3.629e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 12:34:02,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=206286.66666666666, ans=0.125 2024-06-20 12:34:14,178 INFO [train.py:1028] (0/2) Epoch 12, batch 1250, loss[loss=0.2338, simple_loss=0.2831, pruned_loss=0.09223, over 13183.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2796, pruned_loss=0.09341, over 2583882.15 frames. ], batch size: 112, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:34:18,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=206323.33333333334, ans=0.0 2024-06-20 12:34:50,931 INFO [train.py:1028] (0/2) Epoch 12, batch 1300, loss[loss=0.249, simple_loss=0.289, pruned_loss=0.1045, over 12734.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2798, pruned_loss=0.09363, over 2584236.23 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:35:05,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2024-06-20 12:35:09,010 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 1.818e+02 1.968e+02 2.140e+02 3.108e+02, threshold=3.937e+02, percent-clipped=0.0 2024-06-20 12:35:11,382 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2024-06-20 12:35:14,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=206470.0, ans=0.125 2024-06-20 12:35:16,551 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:35:23,245 INFO [train.py:1028] (0/2) Epoch 12, batch 1350, loss[loss=0.2212, simple_loss=0.2709, pruned_loss=0.08574, over 13231.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2798, pruned_loss=0.0935, over 2585907.56 frames. ], batch size: 59, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:35:24,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.49 vs. limit=15.0 2024-06-20 12:35:29,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=206506.66666666666, ans=0.0 2024-06-20 12:35:36,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206525.0, ans=0.0 2024-06-20 12:35:40,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=206543.33333333334, ans=15.0 2024-06-20 12:35:40,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=206543.33333333334, ans=0.1 2024-06-20 12:35:44,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=206543.33333333334, ans=0.2 2024-06-20 12:35:52,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=206580.0, ans=0.125 2024-06-20 12:35:56,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.08 vs. limit=22.5 2024-06-20 12:35:57,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.17 vs. limit=22.5 2024-06-20 12:35:59,112 INFO [train.py:1028] (0/2) Epoch 12, batch 1400, loss[loss=0.2417, simple_loss=0.2966, pruned_loss=0.09339, over 12309.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.28, pruned_loss=0.09369, over 2587067.87 frames. ], batch size: 25, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:36:00,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=206598.33333333334, ans=0.125 2024-06-20 12:36:06,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206616.66666666666, ans=0.125 2024-06-20 12:36:17,263 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.791e+02 1.931e+02 2.109e+02 3.186e+02, threshold=3.861e+02, percent-clipped=0.0 2024-06-20 12:36:31,650 INFO [train.py:1028] (0/2) Epoch 12, batch 1450, loss[loss=0.2144, simple_loss=0.2582, pruned_loss=0.08526, over 13073.00 frames. ], tot_loss[loss=0.234, simple_loss=0.28, pruned_loss=0.09398, over 2586688.47 frames. ], batch size: 121, lr: 4.97e-03, grad_scale: 64.0 2024-06-20 12:36:39,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=15.0 2024-06-20 12:36:50,038 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2024-06-20 12:36:58,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=206745.0, ans=0.5 2024-06-20 12:37:01,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=206763.33333333334, ans=0.0 2024-06-20 12:37:04,025 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.50 vs. limit=10.0 2024-06-20 12:37:07,041 INFO [train.py:1028] (0/2) Epoch 12, batch 1500, loss[loss=0.2249, simple_loss=0.276, pruned_loss=0.08686, over 13293.00 frames. ], tot_loss[loss=0.234, simple_loss=0.28, pruned_loss=0.09403, over 2588379.30 frames. ], batch size: 83, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:37:12,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=206781.66666666666, ans=0.07 2024-06-20 12:37:15,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206800.0, ans=0.1 2024-06-20 12:37:17,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.75 vs. limit=5.0 2024-06-20 12:37:29,490 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.841e+02 1.951e+02 2.228e+02 3.004e+02, threshold=3.901e+02, percent-clipped=0.0 2024-06-20 12:37:32,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.00 vs. limit=22.5 2024-06-20 12:37:34,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=206836.66666666666, ans=0.0 2024-06-20 12:37:42,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=206855.0, ans=0.125 2024-06-20 12:37:42,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2024-06-20 12:37:42,942 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.842e-01 2024-06-20 12:37:44,065 INFO [train.py:1028] (0/2) Epoch 12, batch 1550, loss[loss=0.2373, simple_loss=0.2802, pruned_loss=0.09723, over 12999.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.281, pruned_loss=0.09457, over 2584040.84 frames. ], batch size: 102, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:37:44,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=17.12 vs. limit=15.0 2024-06-20 12:37:53,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.17 vs. limit=15.0 2024-06-20 12:37:59,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=206910.0, ans=0.125 2024-06-20 12:38:04,569 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:38:08,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=206928.33333333334, ans=0.125 2024-06-20 12:38:08,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=206928.33333333334, ans=0.125 2024-06-20 12:38:16,809 INFO [train.py:1028] (0/2) Epoch 12, batch 1600, loss[loss=0.2425, simple_loss=0.2915, pruned_loss=0.09673, over 13099.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2816, pruned_loss=0.09462, over 2579261.26 frames. ], batch size: 77, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:38:21,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=206965.0, ans=0.0 2024-06-20 12:38:29,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=207001.66666666666, ans=0.125 2024-06-20 12:38:34,708 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.906e+02 2.032e+02 2.208e+02 3.637e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 12:38:35,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=207020.0, ans=0.0 2024-06-20 12:38:42,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=207020.0, ans=0.0 2024-06-20 12:38:44,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.75 vs. limit=15.0 2024-06-20 12:38:49,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=207038.33333333334, ans=0.125 2024-06-20 12:38:50,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=207038.33333333334, ans=0.0 2024-06-20 12:38:50,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.07 vs. limit=22.5 2024-06-20 12:38:51,571 INFO [train.py:1028] (0/2) Epoch 12, batch 1650, loss[loss=0.2507, simple_loss=0.2836, pruned_loss=0.1089, over 13144.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2812, pruned_loss=0.09469, over 2575545.95 frames. ], batch size: 95, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:38:51,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=207056.66666666666, ans=0.125 2024-06-20 12:38:53,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=207056.66666666666, ans=0.125 2024-06-20 12:38:55,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-20 12:39:01,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=207075.0, ans=0.125 2024-06-20 12:39:05,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207093.33333333334, ans=0.125 2024-06-20 12:39:07,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 12:39:10,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=207111.66666666666, ans=0.125 2024-06-20 12:39:10,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=207111.66666666666, ans=0.125 2024-06-20 12:39:13,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.24 vs. limit=15.0 2024-06-20 12:39:24,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.86 vs. limit=22.5 2024-06-20 12:39:25,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=207130.0, ans=0.0 2024-06-20 12:39:26,868 INFO [train.py:1028] (0/2) Epoch 12, batch 1700, loss[loss=0.2128, simple_loss=0.2702, pruned_loss=0.0777, over 12292.00 frames. ], tot_loss[loss=0.234, simple_loss=0.2802, pruned_loss=0.09388, over 2580387.97 frames. ], batch size: 25, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:39:34,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=207166.66666666666, ans=0.0 2024-06-20 12:39:37,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=207166.66666666666, ans=0.125 2024-06-20 12:39:37,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.68 vs. limit=22.5 2024-06-20 12:39:44,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.502e+02 1.802e+02 1.940e+02 2.070e+02 2.513e+02, threshold=3.881e+02, percent-clipped=0.0 2024-06-20 12:39:45,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=207203.33333333334, ans=0.05 2024-06-20 12:39:49,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=207203.33333333334, ans=0.125 2024-06-20 12:39:53,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=207221.66666666666, ans=0.125 2024-06-20 12:39:58,935 INFO [train.py:1028] (0/2) Epoch 12, batch 1750, loss[loss=0.2184, simple_loss=0.2753, pruned_loss=0.08069, over 12631.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2805, pruned_loss=0.09379, over 2581764.54 frames. ], batch size: 22, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:39:59,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=207240.0, ans=0.0 2024-06-20 12:40:09,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.80 vs. limit=22.5 2024-06-20 12:40:24,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=207295.0, ans=0.2 2024-06-20 12:40:26,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=207313.33333333334, ans=0.125 2024-06-20 12:40:31,751 INFO [train.py:1028] (0/2) Epoch 12, batch 1800, loss[loss=0.2419, simple_loss=0.2888, pruned_loss=0.09747, over 13183.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.2809, pruned_loss=0.09427, over 2581130.39 frames. ], batch size: 67, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:40:34,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=207331.66666666666, ans=0.04949747468305833 2024-06-20 12:40:45,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.87 vs. limit=6.0 2024-06-20 12:40:54,193 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.874e+02 2.062e+02 2.319e+02 3.373e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-20 12:40:54,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=207386.66666666666, ans=0.125 2024-06-20 12:40:55,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=207386.66666666666, ans=0.125 2024-06-20 12:41:03,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=207405.0, ans=0.025 2024-06-20 12:41:04,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=207405.0, ans=10.0 2024-06-20 12:41:08,307 INFO [train.py:1028] (0/2) Epoch 12, batch 1850, loss[loss=0.2251, simple_loss=0.2789, pruned_loss=0.08565, over 13244.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.2813, pruned_loss=0.09417, over 2582606.77 frames. ], batch size: 83, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:41:11,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=207423.33333333334, ans=0.125 2024-06-20 12:41:13,013 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.745e+00 2024-06-20 12:41:18,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=207441.66666666666, ans=0.0 2024-06-20 12:41:19,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=207441.66666666666, ans=0.125 2024-06-20 12:41:23,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207460.0, ans=0.1 2024-06-20 12:41:43,706 INFO [train.py:1028] (0/2) Epoch 12, batch 1900, loss[loss=0.2296, simple_loss=0.2745, pruned_loss=0.0923, over 13151.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2802, pruned_loss=0.09399, over 2584936.10 frames. ], batch size: 95, lr: 4.96e-03, grad_scale: 64.0 2024-06-20 12:42:00,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=12.0 2024-06-20 12:42:02,263 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.823e+02 1.954e+02 2.118e+02 2.596e+02, threshold=3.907e+02, percent-clipped=0.0 2024-06-20 12:42:03,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=207570.0, ans=0.07 2024-06-20 12:42:12,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=12.0 2024-06-20 12:42:15,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=207606.66666666666, ans=0.125 2024-06-20 12:42:16,018 INFO [train.py:1028] (0/2) Epoch 12, batch 1950, loss[loss=0.2074, simple_loss=0.2577, pruned_loss=0.07857, over 13307.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2795, pruned_loss=0.09378, over 2590951.18 frames. ], batch size: 52, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:42:22,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207625.0, ans=0.1 2024-06-20 12:42:22,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.24 vs. limit=10.0 2024-06-20 12:42:26,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=207625.0, ans=0.0 2024-06-20 12:42:28,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=207643.33333333334, ans=0.2 2024-06-20 12:42:50,901 INFO [train.py:1028] (0/2) Epoch 12, batch 2000, loss[loss=0.2425, simple_loss=0.2851, pruned_loss=0.1, over 12496.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2794, pruned_loss=0.09384, over 2586643.09 frames. ], batch size: 22, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:42:51,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=207698.33333333334, ans=0.125 2024-06-20 12:43:09,008 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.796e+02 1.950e+02 2.178e+02 3.040e+02, threshold=3.900e+02, percent-clipped=0.0 2024-06-20 12:43:26,559 INFO [train.py:1028] (0/2) Epoch 12, batch 2050, loss[loss=0.2341, simple_loss=0.2887, pruned_loss=0.08977, over 12577.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2801, pruned_loss=0.09422, over 2581937.78 frames. ], batch size: 29, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:43:38,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=207808.33333333334, ans=0.0 2024-06-20 12:43:47,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=207845.0, ans=0.2 2024-06-20 12:43:47,862 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:43:49,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=207845.0, ans=0.0 2024-06-20 12:43:59,351 INFO [train.py:1028] (0/2) Epoch 12, batch 2100, loss[loss=0.2209, simple_loss=0.2732, pruned_loss=0.08426, over 13240.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2801, pruned_loss=0.09364, over 2584278.62 frames. ], batch size: 59, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:44:02,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=207881.66666666666, ans=0.2 2024-06-20 12:44:10,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=207900.0, ans=0.025 2024-06-20 12:44:17,984 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.813e+02 1.919e+02 2.058e+02 3.055e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 12:44:29,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=207955.0, ans=0.025 2024-06-20 12:44:32,264 INFO [train.py:1028] (0/2) Epoch 12, batch 2150, loss[loss=0.2263, simple_loss=0.2803, pruned_loss=0.08617, over 13277.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2803, pruned_loss=0.09356, over 2587603.84 frames. ], batch size: 52, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:44:38,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=207973.33333333334, ans=0.07 2024-06-20 12:44:38,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=207973.33333333334, ans=0.125 2024-06-20 12:44:38,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.27 vs. limit=10.0 2024-06-20 12:44:46,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=207991.66666666666, ans=0.1 2024-06-20 12:44:53,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.97 vs. limit=10.0 2024-06-20 12:44:59,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=208028.33333333334, ans=0.125 2024-06-20 12:45:06,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=208046.66666666666, ans=0.0 2024-06-20 12:45:08,317 INFO [train.py:1028] (0/2) Epoch 12, batch 2200, loss[loss=0.2203, simple_loss=0.2614, pruned_loss=0.08963, over 13222.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2808, pruned_loss=0.09392, over 2588177.50 frames. ], batch size: 83, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:45:12,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=208065.0, ans=0.0 2024-06-20 12:45:16,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=208083.33333333334, ans=0.025 2024-06-20 12:45:16,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.84 vs. limit=10.0 2024-06-20 12:45:21,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=208101.66666666666, ans=0.0 2024-06-20 12:45:22,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=208101.66666666666, ans=0.125 2024-06-20 12:45:24,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=208101.66666666666, ans=0.025 2024-06-20 12:45:26,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.858e+02 2.045e+02 2.280e+02 3.132e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 12:45:36,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=208138.33333333334, ans=0.125 2024-06-20 12:45:37,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.48 vs. limit=15.0 2024-06-20 12:45:38,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208138.33333333334, ans=0.125 2024-06-20 12:45:44,097 INFO [train.py:1028] (0/2) Epoch 12, batch 2250, loss[loss=0.2524, simple_loss=0.2975, pruned_loss=0.1036, over 13289.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2805, pruned_loss=0.09382, over 2587228.50 frames. ], batch size: 63, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:45:45,508 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.165e-01 2024-06-20 12:45:45,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=208156.66666666666, ans=0.0 2024-06-20 12:45:54,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208175.0, ans=0.1 2024-06-20 12:45:57,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=208193.33333333334, ans=0.2 2024-06-20 12:46:16,538 INFO [train.py:1028] (0/2) Epoch 12, batch 2300, loss[loss=0.2212, simple_loss=0.2681, pruned_loss=0.08717, over 12914.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2799, pruned_loss=0.09353, over 2582444.53 frames. ], batch size: 33, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:46:17,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=208248.33333333334, ans=0.125 2024-06-20 12:46:18,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=208248.33333333334, ans=0.125 2024-06-20 12:46:30,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208285.0, ans=0.1 2024-06-20 12:46:31,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=208285.0, ans=0.04949747468305833 2024-06-20 12:46:35,399 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.851e+02 2.035e+02 2.284e+02 3.226e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 12:46:40,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2024-06-20 12:46:42,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=208303.33333333334, ans=0.125 2024-06-20 12:46:45,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=208321.66666666666, ans=0.125 2024-06-20 12:46:51,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2024-06-20 12:46:54,434 INFO [train.py:1028] (0/2) Epoch 12, batch 2350, loss[loss=0.2465, simple_loss=0.2926, pruned_loss=0.1002, over 13180.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2801, pruned_loss=0.09368, over 2585590.22 frames. ], batch size: 67, lr: 4.95e-03, grad_scale: 64.0 2024-06-20 12:47:02,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=208358.33333333334, ans=0.95 2024-06-20 12:47:09,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=208376.66666666666, ans=0.125 2024-06-20 12:47:15,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208395.0, ans=0.125 2024-06-20 12:47:19,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=208395.0, ans=0.09899494936611666 2024-06-20 12:47:22,544 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-20 12:47:27,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.29 vs. limit=10.0 2024-06-20 12:47:30,270 INFO [train.py:1028] (0/2) Epoch 12, batch 2400, loss[loss=0.2352, simple_loss=0.2792, pruned_loss=0.09558, over 13223.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.2795, pruned_loss=0.09357, over 2588708.56 frames. ], batch size: 46, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:47:33,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=208431.66666666666, ans=0.2 2024-06-20 12:47:48,191 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 1.806e+02 1.932e+02 2.070e+02 2.777e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 12:47:54,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=208486.66666666666, ans=15.0 2024-06-20 12:47:57,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.41 vs. limit=15.0 2024-06-20 12:48:02,683 INFO [train.py:1028] (0/2) Epoch 12, batch 2450, loss[loss=0.2196, simple_loss=0.2668, pruned_loss=0.08613, over 13218.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.279, pruned_loss=0.09368, over 2584785.01 frames. ], batch size: 63, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:48:09,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=208541.66666666666, ans=0.125 2024-06-20 12:48:12,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=208541.66666666666, ans=0.125 2024-06-20 12:48:12,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208541.66666666666, ans=0.125 2024-06-20 12:48:13,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=208541.66666666666, ans=0.0 2024-06-20 12:48:17,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208560.0, ans=0.125 2024-06-20 12:48:31,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=208596.66666666666, ans=0.1 2024-06-20 12:48:38,346 INFO [train.py:1028] (0/2) Epoch 12, batch 2500, loss[loss=0.2061, simple_loss=0.2533, pruned_loss=0.07952, over 13175.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2772, pruned_loss=0.0928, over 2588258.44 frames. ], batch size: 83, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:48:44,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2024-06-20 12:48:45,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208633.33333333334, ans=0.1 2024-06-20 12:48:56,297 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.575e+02 1.878e+02 2.066e+02 2.249e+02 3.091e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 12:48:57,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=208670.0, ans=0.125 2024-06-20 12:49:01,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=15.0 2024-06-20 12:49:08,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2024-06-20 12:49:10,893 INFO [train.py:1028] (0/2) Epoch 12, batch 2550, loss[loss=0.2207, simple_loss=0.2834, pruned_loss=0.07904, over 12604.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2767, pruned_loss=0.09244, over 2588157.63 frames. ], batch size: 22, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:49:31,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=208743.33333333334, ans=15.0 2024-06-20 12:49:33,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=208743.33333333334, ans=0.0 2024-06-20 12:49:39,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=208761.66666666666, ans=0.125 2024-06-20 12:49:41,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.32 vs. limit=22.5 2024-06-20 12:49:41,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.92 vs. limit=15.0 2024-06-20 12:49:42,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=208780.0, ans=0.0 2024-06-20 12:49:45,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=208780.0, ans=0.2 2024-06-20 12:49:47,764 INFO [train.py:1028] (0/2) Epoch 12, batch 2600, loss[loss=0.2345, simple_loss=0.2793, pruned_loss=0.09479, over 13281.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2753, pruned_loss=0.09211, over 2587539.61 frames. ], batch size: 52, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:49:50,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=208798.33333333334, ans=0.0 2024-06-20 12:49:59,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=208816.66666666666, ans=0.2 2024-06-20 12:50:06,176 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.777e+02 1.877e+02 2.056e+02 3.218e+02, threshold=3.755e+02, percent-clipped=0.0 2024-06-20 12:50:19,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=208890.0, ans=0.0 2024-06-20 12:50:20,382 INFO [train.py:1028] (0/2) Epoch 12, batch 2650, loss[loss=0.2544, simple_loss=0.2846, pruned_loss=0.1121, over 13050.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2736, pruned_loss=0.09171, over 2588092.51 frames. ], batch size: 144, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:50:20,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=208890.0, ans=0.125 2024-06-20 12:50:29,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.04 vs. limit=22.5 2024-06-20 12:50:32,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.89 vs. limit=10.0 2024-06-20 12:50:35,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208926.66666666666, ans=0.1 2024-06-20 12:50:45,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=208945.0, ans=0.125 2024-06-20 12:50:47,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=208945.0, ans=0.0 2024-06-20 12:50:54,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=208963.33333333334, ans=0.0 2024-06-20 12:50:55,888 INFO [train.py:1028] (0/2) Epoch 12, batch 2700, loss[loss=0.2406, simple_loss=0.2762, pruned_loss=0.1025, over 13307.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2722, pruned_loss=0.09134, over 2585902.25 frames. ], batch size: 89, lr: 4.94e-03, grad_scale: 64.0 2024-06-20 12:51:00,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=15.0 2024-06-20 12:51:04,490 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-20 12:51:07,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=209000.0, ans=0.1 2024-06-20 12:51:11,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.91 vs. limit=10.0 2024-06-20 12:51:12,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=209018.33333333334, ans=0.0 2024-06-20 12:51:13,081 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=15.0 2024-06-20 12:51:17,204 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.768e+02 1.872e+02 2.053e+02 2.762e+02, threshold=3.744e+02, percent-clipped=0.0 2024-06-20 12:51:31,472 INFO [train.py:1028] (0/2) Epoch 12, batch 2750, loss[loss=0.2533, simple_loss=0.2945, pruned_loss=0.1061, over 13257.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2718, pruned_loss=0.09109, over 2581066.95 frames. ], batch size: 43, lr: 4.94e-03, grad_scale: 128.0 2024-06-20 12:51:33,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209073.33333333334, ans=0.125 2024-06-20 12:51:36,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.04 vs. limit=10.0 2024-06-20 12:51:42,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=209091.66666666666, ans=0.0 2024-06-20 12:52:04,989 INFO [train.py:1028] (0/2) Epoch 12, batch 2800, loss[loss=0.222, simple_loss=0.2634, pruned_loss=0.09034, over 10867.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2715, pruned_loss=0.09116, over 2579030.67 frames. ], batch size: 304, lr: 4.94e-03, grad_scale: 128.0 2024-06-20 12:52:25,408 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.899e+02 2.043e+02 2.244e+02 3.298e+02, threshold=4.085e+02, percent-clipped=0.0 2024-06-20 12:52:38,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=209238.33333333334, ans=0.0 2024-06-20 12:52:39,591 INFO [train.py:1028] (0/2) Epoch 12, batch 2850, loss[loss=0.2273, simple_loss=0.2699, pruned_loss=0.09233, over 13301.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2705, pruned_loss=0.09076, over 2576976.30 frames. ], batch size: 49, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:52:48,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=209275.0, ans=0.125 2024-06-20 12:52:54,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=22.5 2024-06-20 12:53:07,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209330.0, ans=0.1 2024-06-20 12:53:14,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2024-06-20 12:53:14,749 INFO [train.py:1028] (0/2) Epoch 12, batch 2900, loss[loss=0.2352, simple_loss=0.2808, pruned_loss=0.09478, over 13135.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2685, pruned_loss=0.0898, over 2585254.25 frames. ], batch size: 55, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:53:22,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.0 2024-06-20 12:53:30,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=209385.0, ans=0.125 2024-06-20 12:53:31,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=209385.0, ans=0.1 2024-06-20 12:53:33,673 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.796e+02 1.937e+02 2.090e+02 2.797e+02, threshold=3.874e+02, percent-clipped=0.0 2024-06-20 12:53:36,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.58 vs. limit=22.5 2024-06-20 12:53:39,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=209403.33333333334, ans=0.04949747468305833 2024-06-20 12:53:47,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.64 vs. limit=10.0 2024-06-20 12:53:47,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=209440.0, ans=0.04949747468305833 2024-06-20 12:53:47,952 INFO [train.py:1028] (0/2) Epoch 12, batch 2950, loss[loss=0.2205, simple_loss=0.2671, pruned_loss=0.08699, over 13260.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2685, pruned_loss=0.08987, over 2580072.22 frames. ], batch size: 43, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:53:54,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=209458.33333333334, ans=0.125 2024-06-20 12:54:22,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209513.33333333334, ans=0.1 2024-06-20 12:54:23,736 INFO [train.py:1028] (0/2) Epoch 12, batch 3000, loss[loss=0.226, simple_loss=0.2762, pruned_loss=0.0879, over 13211.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2675, pruned_loss=0.08927, over 2578892.50 frames. ], batch size: 59, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:54:23,737 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 12:54:28,854 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6542, 1.2210, 1.5988, 1.2056], device='cuda:0') 2024-06-20 12:54:31,692 INFO [train.py:1060] (0/2) Epoch 12, validation: loss=0.1935, simple_loss=0.2575, pruned_loss=0.06476, over 351949.00 frames. 2024-06-20 12:54:31,693 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 12:54:37,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=209531.66666666666, ans=0.0 2024-06-20 12:54:49,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=209568.33333333334, ans=0.125 2024-06-20 12:54:50,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.804e+02 1.908e+02 2.121e+02 3.050e+02, threshold=3.816e+02, percent-clipped=0.0 2024-06-20 12:55:04,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=209605.0, ans=0.0 2024-06-20 12:55:07,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.07 vs. limit=22.5 2024-06-20 12:55:08,034 INFO [train.py:1028] (0/2) Epoch 12, batch 3050, loss[loss=0.2385, simple_loss=0.2756, pruned_loss=0.1007, over 13336.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2679, pruned_loss=0.09006, over 2579162.60 frames. ], batch size: 46, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:55:11,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=209623.33333333334, ans=0.125 2024-06-20 12:55:35,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=209696.66666666666, ans=0.2 2024-06-20 12:55:40,276 INFO [train.py:1028] (0/2) Epoch 12, batch 3100, loss[loss=0.2171, simple_loss=0.2591, pruned_loss=0.08761, over 13047.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2672, pruned_loss=0.08961, over 2580154.86 frames. ], batch size: 144, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:55:41,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=209715.0, ans=0.2 2024-06-20 12:55:58,624 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.801e+02 1.919e+02 2.117e+02 2.638e+02, threshold=3.838e+02, percent-clipped=0.0 2024-06-20 12:56:00,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=209770.0, ans=0.95 2024-06-20 12:56:15,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.48 vs. limit=22.5 2024-06-20 12:56:15,505 INFO [train.py:1028] (0/2) Epoch 12, batch 3150, loss[loss=0.2134, simple_loss=0.2549, pruned_loss=0.08596, over 12966.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2655, pruned_loss=0.08859, over 2582721.02 frames. ], batch size: 158, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:56:20,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=209806.66666666666, ans=0.125 2024-06-20 12:56:23,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=209825.0, ans=0.0 2024-06-20 12:56:23,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2024-06-20 12:56:24,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.87 vs. limit=10.0 2024-06-20 12:56:33,291 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 12:56:42,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2024-06-20 12:56:44,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=209880.0, ans=0.125 2024-06-20 12:56:48,093 INFO [train.py:1028] (0/2) Epoch 12, batch 3200, loss[loss=0.2091, simple_loss=0.258, pruned_loss=0.08009, over 13086.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2649, pruned_loss=0.08834, over 2581739.50 frames. ], batch size: 55, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:56:52,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=209898.33333333334, ans=0.125 2024-06-20 12:56:56,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=209916.66666666666, ans=0.125 2024-06-20 12:57:03,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=209935.0, ans=0.125 2024-06-20 12:57:03,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=209935.0, ans=0.0 2024-06-20 12:57:04,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209935.0, ans=0.125 2024-06-20 12:57:08,599 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.774e+02 1.903e+02 2.113e+02 2.650e+02, threshold=3.805e+02, percent-clipped=0.0 2024-06-20 12:57:12,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2024-06-20 12:57:13,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=209953.33333333334, ans=0.125 2024-06-20 12:57:15,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=12.0 2024-06-20 12:57:15,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=209971.66666666666, ans=0.0 2024-06-20 12:57:16,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209971.66666666666, ans=0.1 2024-06-20 12:57:22,425 INFO [train.py:1028] (0/2) Epoch 12, batch 3250, loss[loss=0.2127, simple_loss=0.256, pruned_loss=0.08469, over 13207.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2637, pruned_loss=0.08792, over 2585775.64 frames. ], batch size: 72, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:57:22,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=209990.0, ans=0.2 2024-06-20 12:57:34,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=210008.33333333334, ans=0.125 2024-06-20 12:57:38,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=210026.66666666666, ans=0.125 2024-06-20 12:57:43,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=210045.0, ans=0.09899494936611666 2024-06-20 12:57:45,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=210045.0, ans=0.0 2024-06-20 12:57:47,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=210045.0, ans=0.125 2024-06-20 12:57:51,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=210063.33333333334, ans=0.0 2024-06-20 12:57:55,509 INFO [train.py:1028] (0/2) Epoch 12, batch 3300, loss[loss=0.2394, simple_loss=0.2782, pruned_loss=0.1004, over 12752.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2637, pruned_loss=0.08797, over 2582526.47 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 128.0 2024-06-20 12:58:00,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210081.66666666666, ans=0.0 2024-06-20 12:58:12,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=210118.33333333334, ans=0.0 2024-06-20 12:58:17,260 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.500e+02 1.887e+02 2.046e+02 2.191e+02 3.145e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 12:58:24,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=210155.0, ans=0.0 2024-06-20 12:58:31,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=15.0 2024-06-20 12:58:31,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=210173.33333333334, ans=0.2 2024-06-20 12:58:32,082 INFO [train.py:1028] (0/2) Epoch 12, batch 3350, loss[loss=0.2169, simple_loss=0.255, pruned_loss=0.08937, over 12920.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2635, pruned_loss=0.08814, over 2577684.42 frames. ], batch size: 158, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:58:38,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=210191.66666666666, ans=0.125 2024-06-20 12:58:41,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=210191.66666666666, ans=0.05 2024-06-20 12:58:42,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2024-06-20 12:58:43,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=210191.66666666666, ans=0.125 2024-06-20 12:58:52,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2024-06-20 12:58:52,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.54 vs. limit=22.5 2024-06-20 12:59:01,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=210246.66666666666, ans=0.035 2024-06-20 12:59:03,282 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.69 vs. limit=15.0 2024-06-20 12:59:07,941 INFO [train.py:1028] (0/2) Epoch 12, batch 3400, loss[loss=0.2104, simple_loss=0.2582, pruned_loss=0.08127, over 12658.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2631, pruned_loss=0.08799, over 2575615.09 frames. ], batch size: 22, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:59:08,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210265.0, ans=0.125 2024-06-20 12:59:12,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=210265.0, ans=0.125 2024-06-20 12:59:13,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=210265.0, ans=0.0 2024-06-20 12:59:13,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=210265.0, ans=0.125 2024-06-20 12:59:17,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.42 vs. limit=15.0 2024-06-20 12:59:19,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210283.33333333334, ans=0.125 2024-06-20 12:59:23,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210301.66666666666, ans=0.1 2024-06-20 12:59:26,225 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.797e+02 1.956e+02 2.193e+02 3.184e+02, threshold=3.911e+02, percent-clipped=0.0 2024-06-20 12:59:40,769 INFO [train.py:1028] (0/2) Epoch 12, batch 3450, loss[loss=0.2265, simple_loss=0.267, pruned_loss=0.09299, over 12724.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2625, pruned_loss=0.08753, over 2576824.61 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 12:59:45,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.76 vs. limit=10.0 2024-06-20 12:59:47,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210375.0, ans=0.125 2024-06-20 13:00:01,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=210393.33333333334, ans=0.0 2024-06-20 13:00:13,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=210430.0, ans=0.125 2024-06-20 13:00:16,538 INFO [train.py:1028] (0/2) Epoch 12, batch 3500, loss[loss=0.1979, simple_loss=0.2504, pruned_loss=0.07267, over 12832.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2619, pruned_loss=0.08719, over 2576381.75 frames. ], batch size: 33, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:00:18,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=210448.33333333334, ans=0.125 2024-06-20 13:00:20,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210448.33333333334, ans=0.125 2024-06-20 13:00:22,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.72 vs. limit=10.0 2024-06-20 13:00:36,349 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.745e+02 1.849e+02 2.017e+02 2.416e+02, threshold=3.699e+02, percent-clipped=0.0 2024-06-20 13:00:45,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.87 vs. limit=15.0 2024-06-20 13:00:45,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=210521.66666666666, ans=0.05 2024-06-20 13:00:54,373 INFO [train.py:1028] (0/2) Epoch 12, batch 3550, loss[loss=0.2034, simple_loss=0.2485, pruned_loss=0.07921, over 13213.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2606, pruned_loss=0.0865, over 2576751.83 frames. ], batch size: 95, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:01:02,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=210558.33333333334, ans=15.0 2024-06-20 13:01:17,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210595.0, ans=0.125 2024-06-20 13:01:27,879 INFO [train.py:1028] (0/2) Epoch 12, batch 3600, loss[loss=0.221, simple_loss=0.2712, pruned_loss=0.0854, over 13336.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2594, pruned_loss=0.08616, over 2580371.36 frames. ], batch size: 49, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:01:28,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=210631.66666666666, ans=0.2 2024-06-20 13:01:45,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210668.33333333334, ans=0.1 2024-06-20 13:01:47,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.767e+02 1.919e+02 2.148e+02 3.157e+02, threshold=3.839e+02, percent-clipped=0.0 2024-06-20 13:01:57,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=210705.0, ans=0.0 2024-06-20 13:01:59,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=210705.0, ans=0.0 2024-06-20 13:02:01,923 INFO [train.py:1028] (0/2) Epoch 12, batch 3650, loss[loss=0.2334, simple_loss=0.2667, pruned_loss=0.1, over 13044.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2595, pruned_loss=0.08629, over 2578610.84 frames. ], batch size: 102, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:02:02,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=210723.33333333334, ans=0.0 2024-06-20 13:02:09,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=210723.33333333334, ans=0.125 2024-06-20 13:02:20,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=210760.0, ans=0.0 2024-06-20 13:02:21,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=210760.0, ans=0.125 2024-06-20 13:02:25,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=210778.33333333334, ans=0.125 2024-06-20 13:02:25,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2024-06-20 13:02:31,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=210796.66666666666, ans=0.125 2024-06-20 13:02:32,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=210796.66666666666, ans=0.125 2024-06-20 13:02:37,230 INFO [train.py:1028] (0/2) Epoch 12, batch 3700, loss[loss=0.2069, simple_loss=0.2586, pruned_loss=0.07757, over 13246.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.259, pruned_loss=0.08601, over 2584178.94 frames. ], batch size: 72, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:02:43,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=210833.33333333334, ans=0.125 2024-06-20 13:02:52,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=210851.66666666666, ans=10.0 2024-06-20 13:02:52,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=210851.66666666666, ans=0.0 2024-06-20 13:02:58,854 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.735e+02 1.878e+02 2.094e+02 2.953e+02, threshold=3.756e+02, percent-clipped=0.0 2024-06-20 13:03:00,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=210870.0, ans=0.125 2024-06-20 13:03:08,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210888.33333333334, ans=0.1 2024-06-20 13:03:13,237 INFO [train.py:1028] (0/2) Epoch 12, batch 3750, loss[loss=0.2058, simple_loss=0.2606, pruned_loss=0.07554, over 12556.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2591, pruned_loss=0.08596, over 2585759.95 frames. ], batch size: 22, lr: 4.92e-03, grad_scale: 128.0 2024-06-20 13:03:21,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=210925.0, ans=0.0 2024-06-20 13:03:21,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=210925.0, ans=0.0 2024-06-20 13:03:29,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210943.33333333334, ans=0.0 2024-06-20 13:03:44,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2024-06-20 13:03:44,914 INFO [train.py:1028] (0/2) Epoch 12, batch 3800, loss[loss=0.241, simple_loss=0.2743, pruned_loss=0.1038, over 13207.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2584, pruned_loss=0.08553, over 2583851.18 frames. ], batch size: 83, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:04:02,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=211035.0, ans=0.0 2024-06-20 13:04:03,394 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.511e+02 1.765e+02 1.955e+02 2.089e+02 3.004e+02, threshold=3.909e+02, percent-clipped=0.0 2024-06-20 13:04:10,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211053.33333333334, ans=0.125 2024-06-20 13:04:21,546 INFO [train.py:1028] (0/2) Epoch 12, batch 3850, loss[loss=0.2034, simple_loss=0.2383, pruned_loss=0.08426, over 13012.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2576, pruned_loss=0.085, over 2582918.71 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:04:29,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=211108.33333333334, ans=0.0 2024-06-20 13:04:36,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=211126.66666666666, ans=0.125 2024-06-20 13:04:38,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.56 vs. limit=10.0 2024-06-20 13:04:39,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=211126.66666666666, ans=0.0 2024-06-20 13:04:39,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=211126.66666666666, ans=0.125 2024-06-20 13:04:47,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=211163.33333333334, ans=0.2 2024-06-20 13:04:54,263 INFO [train.py:1028] (0/2) Epoch 12, batch 3900, loss[loss=0.2224, simple_loss=0.2616, pruned_loss=0.0916, over 13235.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2578, pruned_loss=0.08527, over 2587562.50 frames. ], batch size: 83, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:04:55,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=211181.66666666666, ans=0.125 2024-06-20 13:04:59,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2024-06-20 13:05:08,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=211200.0, ans=0.5 2024-06-20 13:05:14,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=211218.33333333334, ans=0.2 2024-06-20 13:05:15,707 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.792e+02 1.896e+02 2.111e+02 3.584e+02, threshold=3.792e+02, percent-clipped=0.0 2024-06-20 13:05:17,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=211236.66666666666, ans=0.125 2024-06-20 13:05:22,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211236.66666666666, ans=0.1 2024-06-20 13:05:28,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=211255.0, ans=0.04949747468305833 2024-06-20 13:05:30,476 INFO [train.py:1028] (0/2) Epoch 12, batch 3950, loss[loss=0.1926, simple_loss=0.2313, pruned_loss=0.07689, over 13101.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2572, pruned_loss=0.08478, over 2588440.40 frames. ], batch size: 132, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:05:36,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=211291.66666666666, ans=0.0 2024-06-20 13:05:44,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=211310.0, ans=0.0 2024-06-20 13:05:45,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=211310.0, ans=0.125 2024-06-20 13:06:03,378 INFO [train.py:1028] (0/2) Epoch 12, batch 4000, loss[loss=0.1922, simple_loss=0.248, pruned_loss=0.06821, over 12893.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2564, pruned_loss=0.08436, over 2582652.64 frames. ], batch size: 39, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:06:09,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=211365.0, ans=0.0 2024-06-20 13:06:09,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=211383.33333333334, ans=0.2 2024-06-20 13:06:10,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=211383.33333333334, ans=0.125 2024-06-20 13:06:21,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=211401.66666666666, ans=0.0 2024-06-20 13:06:21,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=211401.66666666666, ans=0.125 2024-06-20 13:06:24,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.526e+02 1.825e+02 1.977e+02 2.201e+02 3.215e+02, threshold=3.954e+02, percent-clipped=0.0 2024-06-20 13:06:39,731 INFO [train.py:1028] (0/2) Epoch 12, batch 4050, loss[loss=0.2328, simple_loss=0.2608, pruned_loss=0.1024, over 10871.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2563, pruned_loss=0.08439, over 2580032.56 frames. ], batch size: 303, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:06:59,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=211511.66666666666, ans=0.04949747468305833 2024-06-20 13:07:03,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=211511.66666666666, ans=15.0 2024-06-20 13:07:16,186 INFO [train.py:1028] (0/2) Epoch 12, batch 4100, loss[loss=0.2067, simple_loss=0.2472, pruned_loss=0.08306, over 13055.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2565, pruned_loss=0.08481, over 2577453.88 frames. ], batch size: 102, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:07:16,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=211548.33333333334, ans=0.125 2024-06-20 13:07:16,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=211548.33333333334, ans=0.125 2024-06-20 13:07:17,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.09 vs. limit=15.0 2024-06-20 13:07:18,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2024-06-20 13:07:19,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.00 vs. limit=15.0 2024-06-20 13:07:25,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=211566.66666666666, ans=0.125 2024-06-20 13:07:34,452 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.798e+02 1.927e+02 2.096e+02 2.654e+02, threshold=3.854e+02, percent-clipped=0.0 2024-06-20 13:07:34,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=211585.0, ans=0.125 2024-06-20 13:07:36,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=211603.33333333334, ans=0.2 2024-06-20 13:07:42,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211621.66666666666, ans=0.1 2024-06-20 13:07:43,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2024-06-20 13:07:46,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=211621.66666666666, ans=0.0 2024-06-20 13:07:48,892 INFO [train.py:1028] (0/2) Epoch 12, batch 4150, loss[loss=0.2169, simple_loss=0.257, pruned_loss=0.08836, over 13114.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.256, pruned_loss=0.08449, over 2574574.38 frames. ], batch size: 55, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:07:53,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=211640.0, ans=0.0 2024-06-20 13:07:55,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=211658.33333333334, ans=0.125 2024-06-20 13:07:56,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=211658.33333333334, ans=0.0 2024-06-20 13:08:25,143 INFO [train.py:1028] (0/2) Epoch 12, batch 4200, loss[loss=0.235, simple_loss=0.2668, pruned_loss=0.1016, over 13159.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2558, pruned_loss=0.08462, over 2578049.45 frames. ], batch size: 103, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:08:28,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=211731.66666666666, ans=0.015 2024-06-20 13:08:30,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=211731.66666666666, ans=0.125 2024-06-20 13:08:41,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2024-06-20 13:08:43,398 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.580e+02 1.761e+02 1.905e+02 2.140e+02 2.853e+02, threshold=3.810e+02, percent-clipped=0.0 2024-06-20 13:08:44,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=211786.66666666666, ans=0.2 2024-06-20 13:08:57,926 INFO [train.py:1028] (0/2) Epoch 12, batch 4250, loss[loss=0.2045, simple_loss=0.2481, pruned_loss=0.08048, over 13320.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2553, pruned_loss=0.08424, over 2580196.51 frames. ], batch size: 46, lr: 4.91e-03, grad_scale: 128.0 2024-06-20 13:08:58,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.27 vs. limit=10.0 2024-06-20 13:09:00,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211823.33333333334, ans=0.1 2024-06-20 13:09:04,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=211841.66666666666, ans=0.0 2024-06-20 13:09:18,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=211860.0, ans=0.2 2024-06-20 13:09:19,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=211860.0, ans=0.025 2024-06-20 13:09:33,360 INFO [train.py:1028] (0/2) Epoch 12, batch 4300, loss[loss=0.2165, simple_loss=0.2621, pruned_loss=0.08548, over 13172.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2552, pruned_loss=0.0843, over 2581107.00 frames. ], batch size: 59, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:09:43,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.51 vs. limit=15.0 2024-06-20 13:09:47,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.47 vs. limit=22.5 2024-06-20 13:09:50,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211951.66666666666, ans=0.125 2024-06-20 13:09:51,545 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=10.0 2024-06-20 13:09:51,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.715e+02 1.873e+02 2.005e+02 2.786e+02, threshold=3.746e+02, percent-clipped=0.0 2024-06-20 13:09:53,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=211970.0, ans=0.05 2024-06-20 13:10:05,746 INFO [train.py:1028] (0/2) Epoch 12, batch 4350, loss[loss=0.205, simple_loss=0.2468, pruned_loss=0.0816, over 13214.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2546, pruned_loss=0.08409, over 2585952.78 frames. ], batch size: 59, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:10:18,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=212025.0, ans=0.0 2024-06-20 13:10:27,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.31 vs. limit=10.0 2024-06-20 13:10:31,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=212061.66666666666, ans=0.0 2024-06-20 13:10:32,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=212061.66666666666, ans=0.125 2024-06-20 13:10:35,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2024-06-20 13:10:39,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-20 13:10:41,517 INFO [train.py:1028] (0/2) Epoch 12, batch 4400, loss[loss=0.2183, simple_loss=0.2573, pruned_loss=0.08962, over 13217.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2542, pruned_loss=0.08405, over 2586757.63 frames. ], batch size: 83, lr: 4.90e-03, grad_scale: 128.0 2024-06-20 13:10:45,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2024-06-20 13:10:47,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=212116.66666666666, ans=0.125 2024-06-20 13:10:52,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.17 vs. limit=15.0 2024-06-20 13:10:56,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.47 vs. limit=15.0 2024-06-20 13:11:00,157 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.722e+02 1.833e+02 1.981e+02 2.957e+02, threshold=3.665e+02, percent-clipped=0.0 2024-06-20 13:11:12,268 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=12.0 2024-06-20 13:11:16,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=212190.0, ans=0.2 2024-06-20 13:11:17,506 INFO [train.py:1028] (0/2) Epoch 12, batch 4450, loss[loss=0.2318, simple_loss=0.2803, pruned_loss=0.09166, over 12902.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2548, pruned_loss=0.08431, over 2581309.56 frames. ], batch size: 33, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:11:19,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=212190.0, ans=0.125 2024-06-20 13:11:28,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=8.0 2024-06-20 13:11:28,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2024-06-20 13:11:31,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2024-06-20 13:11:36,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=212245.0, ans=0.125 2024-06-20 13:11:42,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2024-06-20 13:11:49,345 INFO [train.py:1028] (0/2) Epoch 12, batch 4500, loss[loss=0.2061, simple_loss=0.2502, pruned_loss=0.08101, over 13273.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2542, pruned_loss=0.08405, over 2584943.64 frames. ], batch size: 89, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:12:03,840 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.797e+00 2024-06-20 13:12:07,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=212318.33333333334, ans=0.125 2024-06-20 13:12:08,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=212318.33333333334, ans=0.0 2024-06-20 13:12:08,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.35 vs. limit=15.0 2024-06-20 13:12:09,131 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.728e+02 1.899e+02 2.090e+02 2.847e+02, threshold=3.798e+02, percent-clipped=0.0 2024-06-20 13:12:10,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=212336.66666666666, ans=0.125 2024-06-20 13:12:22,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=212355.0, ans=0.0 2024-06-20 13:12:26,242 INFO [train.py:1028] (0/2) Epoch 12, batch 4550, loss[loss=0.205, simple_loss=0.2488, pruned_loss=0.08059, over 13259.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2545, pruned_loss=0.08408, over 2588564.50 frames. ], batch size: 52, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:12:36,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=212391.66666666666, ans=0.0 2024-06-20 13:12:38,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2024-06-20 13:12:51,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.35 vs. limit=15.0 2024-06-20 13:12:55,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=212446.66666666666, ans=0.0 2024-06-20 13:12:59,353 INFO [train.py:1028] (0/2) Epoch 12, batch 4600, loss[loss=0.2162, simple_loss=0.2587, pruned_loss=0.08685, over 12564.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2547, pruned_loss=0.08415, over 2584853.98 frames. ], batch size: 202, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:13:00,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=212465.0, ans=0.2 2024-06-20 13:13:20,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=212501.66666666666, ans=0.125 2024-06-20 13:13:21,757 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.805e+02 1.974e+02 2.252e+02 3.445e+02, threshold=3.949e+02, percent-clipped=0.0 2024-06-20 13:13:34,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=212538.33333333334, ans=0.125 2024-06-20 13:13:35,355 INFO [train.py:1028] (0/2) Epoch 12, batch 4650, loss[loss=0.2015, simple_loss=0.2429, pruned_loss=0.08, over 13055.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.254, pruned_loss=0.08416, over 2588552.47 frames. ], batch size: 132, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:13:46,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.11 vs. limit=22.5 2024-06-20 13:13:50,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=212593.33333333334, ans=0.07 2024-06-20 13:14:08,456 INFO [train.py:1028] (0/2) Epoch 12, batch 4700, loss[loss=0.2039, simple_loss=0.2492, pruned_loss=0.0793, over 12780.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2539, pruned_loss=0.08422, over 2583866.70 frames. ], batch size: 26, lr: 4.90e-03, grad_scale: 64.0 2024-06-20 13:14:13,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.22 vs. limit=15.0 2024-06-20 13:14:14,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=212648.33333333334, ans=0.125 2024-06-20 13:14:17,817 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-116000.pt 2024-06-20 13:14:23,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=212666.66666666666, ans=0.025 2024-06-20 13:14:24,895 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:14:35,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=15.0 2024-06-20 13:14:35,817 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.510e+02 1.771e+02 1.904e+02 2.043e+02 2.707e+02, threshold=3.809e+02, percent-clipped=0.0 2024-06-20 13:14:37,331 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:14:37,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.27 vs. limit=22.5 2024-06-20 13:14:43,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212721.66666666666, ans=0.1 2024-06-20 13:14:49,732 INFO [train.py:1028] (0/2) Epoch 12, batch 4750, loss[loss=0.2216, simple_loss=0.257, pruned_loss=0.09308, over 12540.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.254, pruned_loss=0.08451, over 2580900.93 frames. ], batch size: 202, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:14:49,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=212740.0, ans=0.125 2024-06-20 13:14:58,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=212758.33333333334, ans=0.035 2024-06-20 13:15:03,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.87 vs. limit=6.0 2024-06-20 13:15:16,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=212795.0, ans=0.2 2024-06-20 13:15:27,736 INFO [train.py:1028] (0/2) Epoch 12, batch 4800, loss[loss=0.2135, simple_loss=0.2575, pruned_loss=0.08473, over 13306.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2537, pruned_loss=0.08413, over 2578429.88 frames. ], batch size: 63, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:15:31,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212831.66666666666, ans=0.1 2024-06-20 13:15:46,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.765e+02 1.878e+02 2.043e+02 2.740e+02, threshold=3.756e+02, percent-clipped=0.0 2024-06-20 13:15:47,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=15.0 2024-06-20 13:15:48,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=212886.66666666666, ans=0.125 2024-06-20 13:16:00,563 INFO [train.py:1028] (0/2) Epoch 12, batch 4850, loss[loss=0.2038, simple_loss=0.2471, pruned_loss=0.08032, over 13268.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2533, pruned_loss=0.08397, over 2575852.60 frames. ], batch size: 89, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:16:07,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.80 vs. limit=22.5 2024-06-20 13:16:11,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=15.0 2024-06-20 13:16:13,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=212960.0, ans=0.025 2024-06-20 13:16:19,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=12.0 2024-06-20 13:16:27,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.43 vs. limit=22.5 2024-06-20 13:16:37,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=213015.0, ans=0.0 2024-06-20 13:16:37,788 INFO [train.py:1028] (0/2) Epoch 12, batch 4900, loss[loss=0.2056, simple_loss=0.2546, pruned_loss=0.07831, over 13225.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.253, pruned_loss=0.08371, over 2575141.35 frames. ], batch size: 59, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:16:37,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=213015.0, ans=0.125 2024-06-20 13:16:57,179 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.499e+02 1.785e+02 1.993e+02 2.153e+02 2.876e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 13:17:08,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=213088.33333333334, ans=0.0 2024-06-20 13:17:14,158 INFO [train.py:1028] (0/2) Epoch 12, batch 4950, loss[loss=0.2198, simple_loss=0.2468, pruned_loss=0.09644, over 10995.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.253, pruned_loss=0.08408, over 2569039.80 frames. ], batch size: 304, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:17:14,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213106.66666666666, ans=0.1 2024-06-20 13:17:17,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=213106.66666666666, ans=0.09899494936611666 2024-06-20 13:17:17,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213106.66666666666, ans=0.1 2024-06-20 13:17:27,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=213143.33333333334, ans=15.0 2024-06-20 13:17:28,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.26 vs. limit=22.5 2024-06-20 13:17:28,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213143.33333333334, ans=0.1 2024-06-20 13:17:44,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=15.0 2024-06-20 13:17:47,494 INFO [train.py:1028] (0/2) Epoch 12, batch 5000, loss[loss=0.2094, simple_loss=0.246, pruned_loss=0.08637, over 13172.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2531, pruned_loss=0.08387, over 2574694.62 frames. ], batch size: 95, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:18:01,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=213235.0, ans=0.125 2024-06-20 13:18:04,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=15.0 2024-06-20 13:18:07,090 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.729e+02 1.823e+02 2.050e+02 2.953e+02, threshold=3.646e+02, percent-clipped=0.0 2024-06-20 13:18:11,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=213253.33333333334, ans=0.2 2024-06-20 13:18:12,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=213253.33333333334, ans=0.2 2024-06-20 13:18:20,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.79 vs. limit=15.0 2024-06-20 13:18:21,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=213271.66666666666, ans=10.0 2024-06-20 13:18:22,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=213271.66666666666, ans=0.0 2024-06-20 13:18:24,879 INFO [train.py:1028] (0/2) Epoch 12, batch 5050, loss[loss=0.1913, simple_loss=0.2369, pruned_loss=0.07282, over 12953.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2532, pruned_loss=0.08358, over 2573000.01 frames. ], batch size: 36, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:18:35,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=213308.33333333334, ans=0.125 2024-06-20 13:18:46,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=213345.0, ans=0.125 2024-06-20 13:18:54,343 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-20 13:18:58,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=213363.33333333334, ans=0.2 2024-06-20 13:18:58,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=213363.33333333334, ans=0.125 2024-06-20 13:18:59,609 INFO [train.py:1028] (0/2) Epoch 12, batch 5100, loss[loss=0.2084, simple_loss=0.2612, pruned_loss=0.07783, over 12923.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2537, pruned_loss=0.0842, over 2569323.87 frames. ], batch size: 39, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:18:59,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=213381.66666666666, ans=0.0 2024-06-20 13:19:03,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=213381.66666666666, ans=0.0 2024-06-20 13:19:06,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213400.0, ans=0.125 2024-06-20 13:19:08,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=213400.0, ans=0.125 2024-06-20 13:19:22,840 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.784e+02 1.973e+02 2.199e+02 2.882e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 13:19:23,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2024-06-20 13:19:30,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=15.0 2024-06-20 13:19:32,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=213455.0, ans=0.0 2024-06-20 13:19:37,056 INFO [train.py:1028] (0/2) Epoch 12, batch 5150, loss[loss=0.2106, simple_loss=0.2478, pruned_loss=0.08668, over 13118.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2534, pruned_loss=0.08431, over 2571621.01 frames. ], batch size: 132, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:19:42,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=213473.33333333334, ans=0.125 2024-06-20 13:19:44,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.06 vs. limit=6.0 2024-06-20 13:19:46,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=213491.66666666666, ans=0.125 2024-06-20 13:19:57,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=213528.33333333334, ans=0.125 2024-06-20 13:20:07,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=213546.66666666666, ans=0.04949747468305833 2024-06-20 13:20:07,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=213546.66666666666, ans=0.0 2024-06-20 13:20:15,458 INFO [train.py:1028] (0/2) Epoch 12, batch 5200, loss[loss=0.2167, simple_loss=0.2625, pruned_loss=0.0855, over 13166.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2532, pruned_loss=0.08418, over 2574220.81 frames. ], batch size: 95, lr: 4.89e-03, grad_scale: 64.0 2024-06-20 13:20:18,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=213565.0, ans=0.125 2024-06-20 13:20:22,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=213583.33333333334, ans=0.2 2024-06-20 13:20:28,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2024-06-20 13:20:35,063 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.451e+02 1.758e+02 1.857e+02 2.118e+02 2.911e+02, threshold=3.715e+02, percent-clipped=0.0 2024-06-20 13:20:45,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=213638.33333333334, ans=0.025 2024-06-20 13:20:48,766 INFO [train.py:1028] (0/2) Epoch 12, batch 5250, loss[loss=0.229, simple_loss=0.2706, pruned_loss=0.09364, over 13291.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2536, pruned_loss=0.08452, over 2571104.92 frames. ], batch size: 52, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:20:53,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=213656.66666666666, ans=0.125 2024-06-20 13:21:04,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=213693.33333333334, ans=0.0 2024-06-20 13:21:04,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=213693.33333333334, ans=0.2 2024-06-20 13:21:16,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=213711.66666666666, ans=0.0 2024-06-20 13:21:18,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=213730.0, ans=0.125 2024-06-20 13:21:20,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=213730.0, ans=0.125 2024-06-20 13:21:22,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2024-06-20 13:21:23,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213730.0, ans=0.1 2024-06-20 13:21:25,066 INFO [train.py:1028] (0/2) Epoch 12, batch 5300, loss[loss=0.207, simple_loss=0.2433, pruned_loss=0.08529, over 12997.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2535, pruned_loss=0.08423, over 2568244.59 frames. ], batch size: 144, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:21:30,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213748.33333333334, ans=0.1 2024-06-20 13:21:34,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=213766.66666666666, ans=0.0 2024-06-20 13:21:38,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=213785.0, ans=0.2 2024-06-20 13:21:44,805 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.546e+02 1.727e+02 1.834e+02 2.014e+02 2.606e+02, threshold=3.667e+02, percent-clipped=0.0 2024-06-20 13:21:47,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=213803.33333333334, ans=0.0 2024-06-20 13:21:57,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-20 13:21:59,114 INFO [train.py:1028] (0/2) Epoch 12, batch 5350, loss[loss=0.2447, simple_loss=0.2898, pruned_loss=0.09986, over 12008.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2534, pruned_loss=0.08415, over 2575206.13 frames. ], batch size: 17, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:22:01,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=213840.0, ans=0.5 2024-06-20 13:22:02,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=213840.0, ans=0.2 2024-06-20 13:22:03,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=213840.0, ans=0.0 2024-06-20 13:22:05,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=213858.33333333334, ans=0.2 2024-06-20 13:22:34,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=213931.66666666666, ans=0.125 2024-06-20 13:22:34,746 INFO [train.py:1028] (0/2) Epoch 12, batch 5400, loss[loss=0.2281, simple_loss=0.2591, pruned_loss=0.09849, over 12178.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2538, pruned_loss=0.08465, over 2566870.50 frames. ], batch size: 240, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:22:54,251 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.570e+02 1.791e+02 1.911e+02 2.068e+02 2.817e+02, threshold=3.822e+02, percent-clipped=0.0 2024-06-20 13:23:11,954 INFO [train.py:1028] (0/2) Epoch 12, batch 5450, loss[loss=0.1928, simple_loss=0.2356, pruned_loss=0.07503, over 12960.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2529, pruned_loss=0.08389, over 2571256.58 frames. ], batch size: 26, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:23:22,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.46 vs. limit=15.0 2024-06-20 13:23:27,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=15.0 2024-06-20 13:23:35,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=214078.33333333334, ans=0.1 2024-06-20 13:23:45,346 INFO [train.py:1028] (0/2) Epoch 12, batch 5500, loss[loss=0.2466, simple_loss=0.2742, pruned_loss=0.1096, over 12305.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2532, pruned_loss=0.08375, over 2562997.27 frames. ], batch size: 241, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:23:48,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=214115.0, ans=0.125 2024-06-20 13:23:54,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=214133.33333333334, ans=0.07 2024-06-20 13:24:04,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.689e+02 1.784e+02 1.952e+02 2.382e+02, threshold=3.568e+02, percent-clipped=0.0 2024-06-20 13:24:06,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=214170.0, ans=0.05 2024-06-20 13:24:15,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=214188.33333333334, ans=0.125 2024-06-20 13:24:15,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=214188.33333333334, ans=10.0 2024-06-20 13:24:22,645 INFO [train.py:1028] (0/2) Epoch 12, batch 5550, loss[loss=0.205, simple_loss=0.256, pruned_loss=0.07696, over 13216.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2529, pruned_loss=0.08345, over 2566891.06 frames. ], batch size: 43, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:24:34,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=214225.0, ans=0.04949747468305833 2024-06-20 13:24:38,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=214243.33333333334, ans=0.125 2024-06-20 13:24:40,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=214243.33333333334, ans=0.0 2024-06-20 13:24:46,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=214261.66666666666, ans=0.125 2024-06-20 13:24:49,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=214261.66666666666, ans=0.0 2024-06-20 13:24:51,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.10 vs. limit=15.0 2024-06-20 13:24:58,988 INFO [train.py:1028] (0/2) Epoch 12, batch 5600, loss[loss=0.2138, simple_loss=0.2593, pruned_loss=0.08412, over 13193.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.252, pruned_loss=0.08317, over 2569385.67 frames. ], batch size: 89, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:25:15,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=214335.0, ans=0.125 2024-06-20 13:25:20,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2024-06-20 13:25:22,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.701e+02 1.840e+02 1.995e+02 3.174e+02, threshold=3.680e+02, percent-clipped=0.0 2024-06-20 13:25:22,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=214353.33333333334, ans=0.2 2024-06-20 13:25:35,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=12.0 2024-06-20 13:25:35,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=214390.0, ans=0.125 2024-06-20 13:25:35,976 INFO [train.py:1028] (0/2) Epoch 12, batch 5650, loss[loss=0.2194, simple_loss=0.2568, pruned_loss=0.09099, over 12552.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2523, pruned_loss=0.08309, over 2574389.40 frames. ], batch size: 202, lr: 4.88e-03, grad_scale: 64.0 2024-06-20 13:25:39,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.91 vs. limit=15.0 2024-06-20 13:25:40,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=214390.0, ans=0.125 2024-06-20 13:25:41,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=214390.0, ans=0.0 2024-06-20 13:25:42,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2024-06-20 13:25:44,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=214408.33333333334, ans=0.2 2024-06-20 13:25:45,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.04 vs. limit=15.0 2024-06-20 13:25:58,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=214445.0, ans=0.5 2024-06-20 13:26:02,870 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.38 vs. limit=15.0 2024-06-20 13:26:04,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=214463.33333333334, ans=0.0 2024-06-20 13:26:08,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=214463.33333333334, ans=0.125 2024-06-20 13:26:09,215 INFO [train.py:1028] (0/2) Epoch 12, batch 5700, loss[loss=0.2131, simple_loss=0.2597, pruned_loss=0.08323, over 13231.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2523, pruned_loss=0.08323, over 2578713.94 frames. ], batch size: 63, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:26:12,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.37 vs. limit=15.0 2024-06-20 13:26:23,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=214500.0, ans=0.0 2024-06-20 13:26:25,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=214518.33333333334, ans=0.1 2024-06-20 13:26:28,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.52 vs. limit=15.0 2024-06-20 13:26:30,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.83 vs. limit=6.0 2024-06-20 13:26:30,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=214518.33333333334, ans=0.125 2024-06-20 13:26:31,647 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.800e+02 2.007e+02 2.275e+02 3.250e+02, threshold=4.014e+02, percent-clipped=0.0 2024-06-20 13:26:33,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=214536.66666666666, ans=0.125 2024-06-20 13:26:40,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=214555.0, ans=0.0 2024-06-20 13:26:45,196 INFO [train.py:1028] (0/2) Epoch 12, batch 5750, loss[loss=0.2234, simple_loss=0.2552, pruned_loss=0.09582, over 12732.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2534, pruned_loss=0.08379, over 2580074.03 frames. ], batch size: 176, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:26:52,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2024-06-20 13:26:54,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2024-06-20 13:26:58,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=214610.0, ans=0.0 2024-06-20 13:26:59,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=214610.0, ans=0.025 2024-06-20 13:27:15,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=214646.66666666666, ans=0.125 2024-06-20 13:27:17,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=214646.66666666666, ans=6.0 2024-06-20 13:27:21,539 INFO [train.py:1028] (0/2) Epoch 12, batch 5800, loss[loss=0.2232, simple_loss=0.2584, pruned_loss=0.09395, over 12689.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2551, pruned_loss=0.08505, over 2579813.04 frames. ], batch size: 176, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:27:27,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=214665.0, ans=0.2 2024-06-20 13:27:39,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=214701.66666666666, ans=0.125 2024-06-20 13:27:40,284 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.545e+02 1.780e+02 1.908e+02 2.158e+02 3.257e+02, threshold=3.817e+02, percent-clipped=0.0 2024-06-20 13:27:45,054 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:27:45,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=214720.0, ans=0.2 2024-06-20 13:27:45,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=214720.0, ans=0.125 2024-06-20 13:27:46,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=12.0 2024-06-20 13:27:53,910 INFO [train.py:1028] (0/2) Epoch 12, batch 5850, loss[loss=0.237, simple_loss=0.2703, pruned_loss=0.1019, over 12621.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2567, pruned_loss=0.08575, over 2577891.19 frames. ], batch size: 202, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:27:59,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=15.0 2024-06-20 13:28:02,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2024-06-20 13:28:11,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=214793.33333333334, ans=0.0 2024-06-20 13:28:12,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=214793.33333333334, ans=0.2 2024-06-20 13:28:24,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2024-06-20 13:28:30,404 INFO [train.py:1028] (0/2) Epoch 12, batch 5900, loss[loss=0.2169, simple_loss=0.2609, pruned_loss=0.0864, over 13114.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2583, pruned_loss=0.08628, over 2578099.74 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:28:37,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=214866.66666666666, ans=0.125 2024-06-20 13:28:39,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=214866.66666666666, ans=0.125 2024-06-20 13:28:39,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=214866.66666666666, ans=0.125 2024-06-20 13:28:41,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=214866.66666666666, ans=0.2 2024-06-20 13:28:41,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214866.66666666666, ans=0.1 2024-06-20 13:28:49,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2024-06-20 13:28:49,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.759e+02 1.871e+02 2.066e+02 2.811e+02, threshold=3.742e+02, percent-clipped=0.0 2024-06-20 13:28:53,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=214903.33333333334, ans=0.0 2024-06-20 13:29:00,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=214921.66666666666, ans=0.0 2024-06-20 13:29:03,387 INFO [train.py:1028] (0/2) Epoch 12, batch 5950, loss[loss=0.2156, simple_loss=0.2518, pruned_loss=0.08973, over 13122.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2597, pruned_loss=0.08702, over 2581962.33 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:29:10,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=214940.0, ans=0.025 2024-06-20 13:29:34,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.24 vs. limit=22.5 2024-06-20 13:29:39,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=215031.66666666666, ans=0.0 2024-06-20 13:29:40,179 INFO [train.py:1028] (0/2) Epoch 12, batch 6000, loss[loss=0.2852, simple_loss=0.3143, pruned_loss=0.1281, over 12215.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2616, pruned_loss=0.08771, over 2575724.08 frames. ], batch size: 241, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:29:40,180 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 13:29:47,413 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.4263, 4.9495, 5.0226, 4.8407], device='cuda:0') 2024-06-20 13:29:48,352 INFO [train.py:1060] (0/2) Epoch 12, validation: loss=0.1938, simple_loss=0.2582, pruned_loss=0.0647, over 351949.00 frames. 2024-06-20 13:29:48,352 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 13:29:49,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=215031.66666666666, ans=0.1 2024-06-20 13:30:06,126 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:30:08,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.849e+02 2.009e+02 2.193e+02 2.852e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 13:30:23,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=215105.0, ans=0.0 2024-06-20 13:30:26,897 INFO [train.py:1028] (0/2) Epoch 12, batch 6050, loss[loss=0.1943, simple_loss=0.244, pruned_loss=0.07227, over 12900.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2625, pruned_loss=0.08774, over 2578397.35 frames. ], batch size: 39, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:30:32,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215123.33333333334, ans=0.125 2024-06-20 13:30:42,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=215160.0, ans=0.5 2024-06-20 13:30:49,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.23 vs. limit=15.0 2024-06-20 13:30:52,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215178.33333333334, ans=0.1 2024-06-20 13:31:01,032 INFO [train.py:1028] (0/2) Epoch 12, batch 6100, loss[loss=0.2255, simple_loss=0.2642, pruned_loss=0.09338, over 13111.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2636, pruned_loss=0.08836, over 2580141.35 frames. ], batch size: 121, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:31:03,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=215215.0, ans=0.0 2024-06-20 13:31:17,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=215251.66666666666, ans=0.0 2024-06-20 13:31:26,684 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.518e+02 1.840e+02 1.949e+02 2.179e+02 2.779e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 13:31:27,906 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2024-06-20 13:31:28,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2024-06-20 13:31:39,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=215288.33333333334, ans=0.0 2024-06-20 13:31:40,956 INFO [train.py:1028] (0/2) Epoch 12, batch 6150, loss[loss=0.2466, simple_loss=0.272, pruned_loss=0.1105, over 10715.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2655, pruned_loss=0.08902, over 2578928.23 frames. ], batch size: 305, lr: 4.87e-03, grad_scale: 64.0 2024-06-20 13:31:44,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.46 vs. limit=15.0 2024-06-20 13:31:49,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=215325.0, ans=0.05 2024-06-20 13:31:52,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215325.0, ans=0.125 2024-06-20 13:31:52,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=215325.0, ans=12.0 2024-06-20 13:31:54,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=215343.33333333334, ans=0.125 2024-06-20 13:32:01,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=215343.33333333334, ans=0.125 2024-06-20 13:32:07,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=215361.66666666666, ans=0.2 2024-06-20 13:32:10,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=215380.0, ans=0.0 2024-06-20 13:32:16,090 INFO [train.py:1028] (0/2) Epoch 12, batch 6200, loss[loss=0.2666, simple_loss=0.3156, pruned_loss=0.1088, over 13246.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2673, pruned_loss=0.09, over 2575904.59 frames. ], batch size: 89, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:32:27,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.09 vs. limit=22.5 2024-06-20 13:32:30,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=215416.66666666666, ans=0.0 2024-06-20 13:32:34,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-20 13:32:36,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=215435.0, ans=0.125 2024-06-20 13:32:40,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.875e+02 2.035e+02 2.276e+02 3.398e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 13:32:42,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=215453.33333333334, ans=0.025 2024-06-20 13:32:49,769 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:32:52,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=215471.66666666666, ans=0.125 2024-06-20 13:32:53,988 INFO [train.py:1028] (0/2) Epoch 12, batch 6250, loss[loss=0.2203, simple_loss=0.2599, pruned_loss=0.09041, over 13240.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2682, pruned_loss=0.09031, over 2570447.46 frames. ], batch size: 83, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:33:07,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=215526.66666666666, ans=0.1 2024-06-20 13:33:08,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2024-06-20 13:33:10,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215526.66666666666, ans=0.125 2024-06-20 13:33:17,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=215545.0, ans=0.025 2024-06-20 13:33:31,562 INFO [train.py:1028] (0/2) Epoch 12, batch 6300, loss[loss=0.237, simple_loss=0.2805, pruned_loss=0.09673, over 11221.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2699, pruned_loss=0.09107, over 2565376.69 frames. ], batch size: 16, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:33:35,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=215581.66666666666, ans=0.04949747468305833 2024-06-20 13:33:50,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=215618.33333333334, ans=0.125 2024-06-20 13:33:50,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=215618.33333333334, ans=0.0 2024-06-20 13:33:52,172 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.966e+02 2.217e+02 2.508e+02 3.661e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 13:33:59,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=215655.0, ans=0.025 2024-06-20 13:34:07,354 INFO [train.py:1028] (0/2) Epoch 12, batch 6350, loss[loss=0.2737, simple_loss=0.3104, pruned_loss=0.1185, over 12559.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2718, pruned_loss=0.0913, over 2574316.76 frames. ], batch size: 202, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:34:08,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215673.33333333334, ans=0.125 2024-06-20 13:34:08,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=215673.33333333334, ans=0.04949747468305833 2024-06-20 13:34:11,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=215673.33333333334, ans=0.125 2024-06-20 13:34:12,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.51 vs. limit=6.0 2024-06-20 13:34:25,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=215710.0, ans=0.1 2024-06-20 13:34:39,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=215728.33333333334, ans=0.0 2024-06-20 13:34:41,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=215746.66666666666, ans=0.125 2024-06-20 13:34:43,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=215746.66666666666, ans=0.04949747468305833 2024-06-20 13:34:49,067 INFO [train.py:1028] (0/2) Epoch 12, batch 6400, loss[loss=0.2183, simple_loss=0.2678, pruned_loss=0.08437, over 13213.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2731, pruned_loss=0.09197, over 2576058.52 frames. ], batch size: 67, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:34:50,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=215765.0, ans=0.125 2024-06-20 13:35:12,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2024-06-20 13:35:12,372 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.866e+02 2.009e+02 2.274e+02 3.021e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 13:35:20,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=215838.33333333334, ans=15.0 2024-06-20 13:35:27,468 INFO [train.py:1028] (0/2) Epoch 12, batch 6450, loss[loss=0.2593, simple_loss=0.2949, pruned_loss=0.1119, over 12496.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2745, pruned_loss=0.09229, over 2582345.37 frames. ], batch size: 202, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:35:52,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=215893.33333333334, ans=0.0 2024-06-20 13:35:56,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-20 13:35:59,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=215911.66666666666, ans=0.125 2024-06-20 13:36:02,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.02 vs. limit=10.0 2024-06-20 13:36:05,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=215930.0, ans=0.0 2024-06-20 13:36:09,594 INFO [train.py:1028] (0/2) Epoch 12, batch 6500, loss[loss=0.2495, simple_loss=0.2864, pruned_loss=0.1063, over 10554.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2768, pruned_loss=0.09305, over 2584330.79 frames. ], batch size: 303, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:36:15,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=215948.33333333334, ans=0.0 2024-06-20 13:36:29,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.05 vs. limit=10.0 2024-06-20 13:36:30,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=215985.0, ans=0.125 2024-06-20 13:36:31,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=216003.33333333334, ans=0.05 2024-06-20 13:36:32,298 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.861e+02 2.032e+02 2.229e+02 3.062e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-20 13:36:33,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=216003.33333333334, ans=10.0 2024-06-20 13:36:39,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=216021.66666666666, ans=0.125 2024-06-20 13:36:47,173 INFO [train.py:1028] (0/2) Epoch 12, batch 6550, loss[loss=0.22, simple_loss=0.2673, pruned_loss=0.08639, over 12696.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2767, pruned_loss=0.09249, over 2588604.30 frames. ], batch size: 22, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:37:10,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216076.66666666666, ans=0.1 2024-06-20 13:37:15,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=216095.0, ans=0.0 2024-06-20 13:37:22,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=15.0 2024-06-20 13:37:28,707 INFO [train.py:1028] (0/2) Epoch 12, batch 6600, loss[loss=0.2162, simple_loss=0.2645, pruned_loss=0.084, over 13303.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2768, pruned_loss=0.09263, over 2590757.61 frames. ], batch size: 72, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:37:30,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=216131.66666666666, ans=0.125 2024-06-20 13:37:41,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=216150.0, ans=0.125 2024-06-20 13:37:46,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=216168.33333333334, ans=10.0 2024-06-20 13:37:47,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.60 vs. limit=15.0 2024-06-20 13:37:52,258 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.934e+02 2.070e+02 2.267e+02 2.911e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-20 13:38:07,496 INFO [train.py:1028] (0/2) Epoch 12, batch 6650, loss[loss=0.2565, simple_loss=0.3054, pruned_loss=0.1038, over 12921.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2787, pruned_loss=0.09351, over 2584355.80 frames. ], batch size: 158, lr: 4.86e-03, grad_scale: 32.0 2024-06-20 13:38:16,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=216223.33333333334, ans=0.2 2024-06-20 13:38:34,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216278.33333333334, ans=0.1 2024-06-20 13:38:38,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.10 vs. limit=6.0 2024-06-20 13:38:51,133 INFO [train.py:1028] (0/2) Epoch 12, batch 6700, loss[loss=0.2406, simple_loss=0.2811, pruned_loss=0.1001, over 12816.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2799, pruned_loss=0.09424, over 2584699.01 frames. ], batch size: 177, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:39:03,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=216333.33333333334, ans=0.0 2024-06-20 13:39:04,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216333.33333333334, ans=0.125 2024-06-20 13:39:09,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=216351.66666666666, ans=0.125 2024-06-20 13:39:14,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.928e+02 2.141e+02 2.350e+02 4.353e+02, threshold=4.282e+02, percent-clipped=1.0 2024-06-20 13:39:33,181 INFO [train.py:1028] (0/2) Epoch 12, batch 6750, loss[loss=0.293, simple_loss=0.3278, pruned_loss=0.1291, over 12216.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.2806, pruned_loss=0.09463, over 2579602.80 frames. ], batch size: 241, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:39:35,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=216406.66666666666, ans=0.125 2024-06-20 13:39:58,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=216461.66666666666, ans=0.1 2024-06-20 13:39:59,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=216461.66666666666, ans=0.0 2024-06-20 13:40:02,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=216461.66666666666, ans=0.125 2024-06-20 13:40:03,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=216480.0, ans=0.2 2024-06-20 13:40:04,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216480.0, ans=0.125 2024-06-20 13:40:09,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=216480.0, ans=0.025 2024-06-20 13:40:11,798 INFO [train.py:1028] (0/2) Epoch 12, batch 6800, loss[loss=0.2475, simple_loss=0.2998, pruned_loss=0.09755, over 13239.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.2824, pruned_loss=0.0953, over 2581627.85 frames. ], batch size: 67, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:40:27,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=216535.0, ans=0.0 2024-06-20 13:40:38,254 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.917e+02 2.060e+02 2.278e+02 3.379e+02, threshold=4.121e+02, percent-clipped=0.0 2024-06-20 13:40:39,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=216553.33333333334, ans=0.1 2024-06-20 13:40:53,790 INFO [train.py:1028] (0/2) Epoch 12, batch 6850, loss[loss=0.2712, simple_loss=0.3245, pruned_loss=0.1089, over 13261.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2823, pruned_loss=0.09502, over 2584353.11 frames. ], batch size: 63, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:40:59,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=216590.0, ans=0.0 2024-06-20 13:41:08,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.02 vs. limit=15.0 2024-06-20 13:41:21,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=216645.0, ans=0.0 2024-06-20 13:41:29,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=216663.33333333334, ans=0.0 2024-06-20 13:41:32,965 INFO [train.py:1028] (0/2) Epoch 12, batch 6900, loss[loss=0.2591, simple_loss=0.3076, pruned_loss=0.1053, over 13296.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.284, pruned_loss=0.09561, over 2586369.82 frames. ], batch size: 49, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:41:52,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216718.33333333334, ans=0.1 2024-06-20 13:42:00,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2024-06-20 13:42:00,291 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.940e+02 2.227e+02 2.484e+02 3.852e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 13:42:06,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216736.66666666666, ans=0.125 2024-06-20 13:42:06,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=216736.66666666666, ans=0.025 2024-06-20 13:42:16,110 INFO [train.py:1028] (0/2) Epoch 12, batch 6950, loss[loss=0.2385, simple_loss=0.2782, pruned_loss=0.0994, over 11573.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.2842, pruned_loss=0.09532, over 2579911.13 frames. ], batch size: 16, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:42:28,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216791.66666666666, ans=0.125 2024-06-20 13:42:29,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2024-06-20 13:42:34,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=216810.0, ans=0.0 2024-06-20 13:42:46,712 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:42:52,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=216846.66666666666, ans=0.0 2024-06-20 13:42:52,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=216846.66666666666, ans=0.125 2024-06-20 13:42:54,986 INFO [train.py:1028] (0/2) Epoch 12, batch 7000, loss[loss=0.2502, simple_loss=0.2911, pruned_loss=0.1046, over 12951.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.2846, pruned_loss=0.09524, over 2576858.24 frames. ], batch size: 158, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:43:15,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=216901.66666666666, ans=0.025 2024-06-20 13:43:17,281 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.710e+01 2024-06-20 13:43:22,834 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.899e+02 2.008e+02 2.202e+02 3.785e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 13:43:25,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2024-06-20 13:43:31,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=216938.33333333334, ans=0.025 2024-06-20 13:43:32,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=216938.33333333334, ans=0.025 2024-06-20 13:43:38,980 INFO [train.py:1028] (0/2) Epoch 12, batch 7050, loss[loss=0.2527, simple_loss=0.3014, pruned_loss=0.102, over 12760.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.286, pruned_loss=0.09587, over 2584245.66 frames. ], batch size: 176, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:43:44,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2024-06-20 13:43:55,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216993.33333333334, ans=0.1 2024-06-20 13:44:00,097 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.00 vs. limit=15.0 2024-06-20 13:44:00,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=217011.66666666666, ans=0.0 2024-06-20 13:44:02,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=12.0 2024-06-20 13:44:08,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=217011.66666666666, ans=0.125 2024-06-20 13:44:21,366 INFO [train.py:1028] (0/2) Epoch 12, batch 7100, loss[loss=0.2411, simple_loss=0.2919, pruned_loss=0.09511, over 13147.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.2863, pruned_loss=0.09609, over 2575392.30 frames. ], batch size: 112, lr: 4.85e-03, grad_scale: 32.0 2024-06-20 13:44:34,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217066.66666666666, ans=0.1 2024-06-20 13:44:41,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217085.0, ans=0.1 2024-06-20 13:44:41,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=217085.0, ans=0.125 2024-06-20 13:44:44,804 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.931e+02 2.085e+02 2.284e+02 3.010e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 13:44:58,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=217121.66666666666, ans=0.125 2024-06-20 13:44:58,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=217121.66666666666, ans=0.0 2024-06-20 13:44:59,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2024-06-20 13:45:00,134 INFO [train.py:1028] (0/2) Epoch 12, batch 7150, loss[loss=0.2602, simple_loss=0.3026, pruned_loss=0.1089, over 12485.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.2874, pruned_loss=0.09646, over 2573839.91 frames. ], batch size: 202, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:45:00,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=217140.0, ans=0.0 2024-06-20 13:45:25,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217195.0, ans=0.1 2024-06-20 13:45:33,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=217195.0, ans=0.0 2024-06-20 13:45:37,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=217213.33333333334, ans=0.125 2024-06-20 13:45:41,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=217231.66666666666, ans=0.125 2024-06-20 13:45:42,251 INFO [train.py:1028] (0/2) Epoch 12, batch 7200, loss[loss=0.2599, simple_loss=0.3084, pruned_loss=0.1057, over 13168.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2884, pruned_loss=0.09699, over 2578876.97 frames. ], batch size: 112, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:45:43,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=217231.66666666666, ans=0.125 2024-06-20 13:45:47,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=12.0 2024-06-20 13:45:50,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=217250.0, ans=0.0 2024-06-20 13:45:51,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=217250.0, ans=0.125 2024-06-20 13:45:53,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=217250.0, ans=0.0 2024-06-20 13:45:55,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=217250.0, ans=0.125 2024-06-20 13:45:57,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=217268.33333333334, ans=0.0 2024-06-20 13:45:59,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=217268.33333333334, ans=0.025 2024-06-20 13:46:03,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=217268.33333333334, ans=0.125 2024-06-20 13:46:05,998 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.915e+02 2.070e+02 2.223e+02 2.968e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 13:46:19,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.80 vs. limit=22.5 2024-06-20 13:46:22,585 INFO [train.py:1028] (0/2) Epoch 12, batch 7250, loss[loss=0.2309, simple_loss=0.283, pruned_loss=0.08943, over 12896.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.2893, pruned_loss=0.09711, over 2578177.91 frames. ], batch size: 36, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:46:39,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.92 vs. limit=15.0 2024-06-20 13:46:42,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=217360.0, ans=0.2 2024-06-20 13:46:56,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.12 vs. limit=15.0 2024-06-20 13:47:04,805 INFO [train.py:1028] (0/2) Epoch 12, batch 7300, loss[loss=0.2197, simple_loss=0.2729, pruned_loss=0.08321, over 12929.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.2906, pruned_loss=0.09761, over 2577465.49 frames. ], batch size: 36, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:47:14,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2024-06-20 13:47:16,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=217433.33333333334, ans=0.125 2024-06-20 13:47:22,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217451.66666666666, ans=0.125 2024-06-20 13:47:27,851 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.910e+02 2.022e+02 2.185e+02 3.114e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-20 13:47:42,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217506.66666666666, ans=0.1 2024-06-20 13:47:42,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217506.66666666666, ans=0.1 2024-06-20 13:47:42,926 INFO [train.py:1028] (0/2) Epoch 12, batch 7350, loss[loss=0.2449, simple_loss=0.2899, pruned_loss=0.09995, over 13279.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2913, pruned_loss=0.09792, over 2579874.82 frames. ], batch size: 46, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:47:53,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2024-06-20 13:47:55,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=217525.0, ans=0.125 2024-06-20 13:47:56,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=217525.0, ans=0.125 2024-06-20 13:48:11,433 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:48:25,870 INFO [train.py:1028] (0/2) Epoch 12, batch 7400, loss[loss=0.2499, simple_loss=0.3037, pruned_loss=0.09808, over 13229.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.2912, pruned_loss=0.09815, over 2585786.05 frames. ], batch size: 63, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:48:32,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=217598.33333333334, ans=0.0 2024-06-20 13:48:37,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=15.0 2024-06-20 13:48:41,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=217635.0, ans=0.125 2024-06-20 13:48:49,624 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.978e+02 2.089e+02 2.332e+02 3.208e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 13:48:52,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217653.33333333334, ans=0.125 2024-06-20 13:48:54,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2024-06-20 13:49:02,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=217671.66666666666, ans=0.035 2024-06-20 13:49:02,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=217671.66666666666, ans=0.2 2024-06-20 13:49:09,487 INFO [train.py:1028] (0/2) Epoch 12, batch 7450, loss[loss=0.2209, simple_loss=0.2658, pruned_loss=0.08794, over 12715.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.2912, pruned_loss=0.09808, over 2581218.53 frames. ], batch size: 29, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:49:21,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=217708.33333333334, ans=0.0 2024-06-20 13:49:27,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2024-06-20 13:49:38,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=217745.0, ans=0.125 2024-06-20 13:49:49,541 INFO [train.py:1028] (0/2) Epoch 12, batch 7500, loss[loss=0.2503, simple_loss=0.2884, pruned_loss=0.1061, over 10618.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.2925, pruned_loss=0.09881, over 2577962.89 frames. ], batch size: 303, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:50:13,181 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.875e+02 2.048e+02 2.243e+02 2.837e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 13:50:13,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=217836.66666666666, ans=0.125 2024-06-20 13:50:15,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=217836.66666666666, ans=0.125 2024-06-20 13:50:17,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=217836.66666666666, ans=0.025 2024-06-20 13:50:28,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=217855.0, ans=0.125 2024-06-20 13:50:32,726 INFO [train.py:1028] (0/2) Epoch 12, batch 7550, loss[loss=0.235, simple_loss=0.2691, pruned_loss=0.1004, over 12936.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.2931, pruned_loss=0.0994, over 2576423.50 frames. ], batch size: 158, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:50:34,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217873.33333333334, ans=0.1 2024-06-20 13:50:43,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-20 13:50:44,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=217891.66666666666, ans=0.0 2024-06-20 13:50:46,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=217891.66666666666, ans=0.125 2024-06-20 13:50:50,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=12.0 2024-06-20 13:50:51,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=217910.0, ans=0.09899494936611666 2024-06-20 13:51:03,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=217928.33333333334, ans=0.0 2024-06-20 13:51:06,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=15.0 2024-06-20 13:51:12,540 INFO [train.py:1028] (0/2) Epoch 12, batch 7600, loss[loss=0.2453, simple_loss=0.2926, pruned_loss=0.09899, over 13185.00 frames. ], tot_loss[loss=0.247, simple_loss=0.2942, pruned_loss=0.09991, over 2576104.88 frames. ], batch size: 83, lr: 4.84e-03, grad_scale: 32.0 2024-06-20 13:51:39,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 1.955e+02 2.083e+02 2.239e+02 3.004e+02, threshold=4.166e+02, percent-clipped=0.0 2024-06-20 13:51:43,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=218020.0, ans=0.125 2024-06-20 13:51:48,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=218038.33333333334, ans=0.125 2024-06-20 13:51:49,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.16 vs. limit=15.0 2024-06-20 13:51:50,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218038.33333333334, ans=0.125 2024-06-20 13:51:52,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=218038.33333333334, ans=10.0 2024-06-20 13:51:55,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=218056.66666666666, ans=0.0 2024-06-20 13:51:55,555 INFO [train.py:1028] (0/2) Epoch 12, batch 7650, loss[loss=0.2342, simple_loss=0.282, pruned_loss=0.09325, over 12933.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2944, pruned_loss=0.09998, over 2572302.52 frames. ], batch size: 33, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:52:00,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=218056.66666666666, ans=0.05 2024-06-20 13:52:01,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=218056.66666666666, ans=12.0 2024-06-20 13:52:11,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=218093.33333333334, ans=0.125 2024-06-20 13:52:13,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=218093.33333333334, ans=0.0 2024-06-20 13:52:15,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=218093.33333333334, ans=0.025 2024-06-20 13:52:35,368 INFO [train.py:1028] (0/2) Epoch 12, batch 7700, loss[loss=0.2703, simple_loss=0.3221, pruned_loss=0.1093, over 13228.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.2947, pruned_loss=0.09986, over 2569012.53 frames. ], batch size: 63, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:52:52,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=218166.66666666666, ans=22.5 2024-06-20 13:53:02,349 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 1.944e+02 2.075e+02 2.290e+02 3.443e+02, threshold=4.150e+02, percent-clipped=0.0 2024-06-20 13:53:12,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=218221.66666666666, ans=0.2 2024-06-20 13:53:18,039 INFO [train.py:1028] (0/2) Epoch 12, batch 7750, loss[loss=0.2574, simple_loss=0.3111, pruned_loss=0.1019, over 13289.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.295, pruned_loss=0.1003, over 2574292.43 frames. ], batch size: 72, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:53:18,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.58 vs. limit=15.0 2024-06-20 13:53:24,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=218240.0, ans=0.0 2024-06-20 13:53:26,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.35 vs. limit=15.0 2024-06-20 13:53:27,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=218258.33333333334, ans=0.0 2024-06-20 13:53:29,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218258.33333333334, ans=0.1 2024-06-20 13:53:39,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.79 vs. limit=15.0 2024-06-20 13:53:51,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.29 vs. limit=22.5 2024-06-20 13:53:55,260 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 13:53:59,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=218313.33333333334, ans=0.025 2024-06-20 13:54:01,223 INFO [train.py:1028] (0/2) Epoch 12, batch 7800, loss[loss=0.2601, simple_loss=0.3103, pruned_loss=0.1049, over 13209.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2957, pruned_loss=0.1004, over 2579422.60 frames. ], batch size: 95, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:54:16,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=218368.33333333334, ans=0.04949747468305833 2024-06-20 13:54:25,309 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.919e+02 2.129e+02 2.375e+02 3.233e+02, threshold=4.258e+02, percent-clipped=0.0 2024-06-20 13:54:28,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=218386.66666666666, ans=0.2 2024-06-20 13:54:30,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218386.66666666666, ans=0.1 2024-06-20 13:54:35,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218405.0, ans=0.1 2024-06-20 13:54:35,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=218405.0, ans=0.04949747468305833 2024-06-20 13:54:37,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=218405.0, ans=0.2 2024-06-20 13:54:40,945 INFO [train.py:1028] (0/2) Epoch 12, batch 7850, loss[loss=0.2186, simple_loss=0.2696, pruned_loss=0.08376, over 11045.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.2964, pruned_loss=0.1006, over 2573787.43 frames. ], batch size: 16, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:54:51,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-20 13:54:56,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=218460.0, ans=0.125 2024-06-20 13:55:04,861 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.520e-03 2024-06-20 13:55:08,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=218478.33333333334, ans=0.0 2024-06-20 13:55:10,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=218478.33333333334, ans=0.125 2024-06-20 13:55:21,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218496.66666666666, ans=0.1 2024-06-20 13:55:23,824 INFO [train.py:1028] (0/2) Epoch 12, batch 7900, loss[loss=0.2307, simple_loss=0.2846, pruned_loss=0.08847, over 13149.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2962, pruned_loss=0.1006, over 2573240.73 frames. ], batch size: 77, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:55:26,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.75 vs. limit=10.0 2024-06-20 13:55:40,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=218551.66666666666, ans=0.0 2024-06-20 13:55:43,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.58 vs. limit=22.5 2024-06-20 13:55:46,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=12.0 2024-06-20 13:55:47,995 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.904e+02 1.975e+02 2.123e+02 3.024e+02, threshold=3.950e+02, percent-clipped=0.0 2024-06-20 13:55:50,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=218570.0, ans=0.125 2024-06-20 13:55:51,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.73 vs. limit=12.0 2024-06-20 13:56:07,861 INFO [train.py:1028] (0/2) Epoch 12, batch 7950, loss[loss=0.2553, simple_loss=0.2858, pruned_loss=0.1124, over 10662.00 frames. ], tot_loss[loss=0.249, simple_loss=0.2967, pruned_loss=0.1006, over 2575354.40 frames. ], batch size: 304, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:56:12,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=218606.66666666666, ans=0.125 2024-06-20 13:56:16,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-20 13:56:18,884 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=15.0 2024-06-20 13:56:23,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=218643.33333333334, ans=0.125 2024-06-20 13:56:34,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-20 13:56:35,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=218661.66666666666, ans=0.0 2024-06-20 13:56:48,061 INFO [train.py:1028] (0/2) Epoch 12, batch 8000, loss[loss=0.2224, simple_loss=0.2837, pruned_loss=0.08057, over 12607.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2972, pruned_loss=0.1003, over 2572461.11 frames. ], batch size: 29, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:56:50,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.97 vs. limit=22.5 2024-06-20 13:57:02,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=218735.0, ans=0.125 2024-06-20 13:57:04,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=218735.0, ans=0.1 2024-06-20 13:57:04,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=218735.0, ans=0.0 2024-06-20 13:57:10,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=218753.33333333334, ans=0.125 2024-06-20 13:57:11,159 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.996e+02 2.156e+02 2.497e+02 4.158e+02, threshold=4.311e+02, percent-clipped=1.0 2024-06-20 13:57:20,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.63 vs. limit=22.5 2024-06-20 13:57:27,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=218771.66666666666, ans=0.125 2024-06-20 13:57:31,167 INFO [train.py:1028] (0/2) Epoch 12, batch 8050, loss[loss=0.2531, simple_loss=0.3119, pruned_loss=0.0972, over 13206.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2964, pruned_loss=0.09977, over 2572125.99 frames. ], batch size: 83, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:57:57,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=15.0 2024-06-20 13:58:02,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=218863.33333333334, ans=0.2 2024-06-20 13:58:09,433 INFO [train.py:1028] (0/2) Epoch 12, batch 8100, loss[loss=0.2529, simple_loss=0.3035, pruned_loss=0.1012, over 13183.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2971, pruned_loss=0.1001, over 2576576.14 frames. ], batch size: 112, lr: 4.83e-03, grad_scale: 32.0 2024-06-20 13:58:18,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.81 vs. limit=10.0 2024-06-20 13:58:36,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.927e+02 2.078e+02 2.240e+02 3.380e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 13:58:40,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=218936.66666666666, ans=0.125 2024-06-20 13:58:41,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=218936.66666666666, ans=0.125 2024-06-20 13:58:52,373 INFO [train.py:1028] (0/2) Epoch 12, batch 8150, loss[loss=0.2328, simple_loss=0.2811, pruned_loss=0.09224, over 13113.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2975, pruned_loss=0.1, over 2580683.71 frames. ], batch size: 121, lr: 4.82e-03, grad_scale: 32.0 2024-06-20 13:58:53,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=218973.33333333334, ans=0.125 2024-06-20 13:59:01,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=218991.66666666666, ans=0.125 2024-06-20 13:59:10,123 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.40 vs. limit=22.5 2024-06-20 13:59:10,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=219010.0, ans=10.0 2024-06-20 13:59:22,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=219046.66666666666, ans=0.125 2024-06-20 13:59:30,905 INFO [train.py:1028] (0/2) Epoch 12, batch 8200, loss[loss=0.2513, simple_loss=0.3027, pruned_loss=0.09991, over 13188.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.2981, pruned_loss=0.1004, over 2584287.42 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 13:59:49,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=219083.33333333334, ans=0.0 2024-06-20 13:59:54,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2024-06-20 13:59:57,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=219101.66666666666, ans=0.0 2024-06-20 13:59:59,237 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.988e+02 2.167e+02 2.477e+02 3.130e+02, threshold=4.334e+02, percent-clipped=0.0 2024-06-20 14:00:01,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219120.0, ans=0.1 2024-06-20 14:00:03,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219120.0, ans=0.1 2024-06-20 14:00:08,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=219138.33333333334, ans=0.125 2024-06-20 14:00:10,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=219138.33333333334, ans=0.125 2024-06-20 14:00:15,316 INFO [train.py:1028] (0/2) Epoch 12, batch 8250, loss[loss=0.2492, simple_loss=0.3091, pruned_loss=0.09463, over 13302.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.2982, pruned_loss=0.1002, over 2584228.71 frames. ], batch size: 52, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:00:15,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219156.66666666666, ans=0.1 2024-06-20 14:00:40,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=219211.66666666666, ans=0.2 2024-06-20 14:00:40,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=219211.66666666666, ans=0.1 2024-06-20 14:00:40,585 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.54 vs. limit=12.0 2024-06-20 14:00:47,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=219230.0, ans=0.0 2024-06-20 14:00:59,389 INFO [train.py:1028] (0/2) Epoch 12, batch 8300, loss[loss=0.2448, simple_loss=0.2875, pruned_loss=0.1011, over 13021.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.2977, pruned_loss=0.09976, over 2582651.98 frames. ], batch size: 102, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:00:59,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=219248.33333333334, ans=0.2 2024-06-20 14:01:23,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.972e+02 2.100e+02 2.238e+02 2.935e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-20 14:01:38,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=219321.66666666666, ans=0.2 2024-06-20 14:01:40,357 INFO [train.py:1028] (0/2) Epoch 12, batch 8350, loss[loss=0.2539, simple_loss=0.3017, pruned_loss=0.103, over 13163.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.2979, pruned_loss=0.09956, over 2581790.08 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:01:53,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219358.33333333334, ans=0.1 2024-06-20 14:01:59,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=219376.66666666666, ans=0.125 2024-06-20 14:02:00,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=219376.66666666666, ans=10.0 2024-06-20 14:02:10,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=219395.0, ans=0.125 2024-06-20 14:02:19,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=219413.33333333334, ans=0.2 2024-06-20 14:02:21,846 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2024-06-20 14:02:23,914 INFO [train.py:1028] (0/2) Epoch 12, batch 8400, loss[loss=0.2161, simple_loss=0.267, pruned_loss=0.0826, over 12911.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.2985, pruned_loss=0.1002, over 2578464.52 frames. ], batch size: 39, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:02:29,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.95 vs. limit=10.0 2024-06-20 14:02:32,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=219450.0, ans=0.1 2024-06-20 14:02:34,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.60 vs. limit=10.0 2024-06-20 14:02:39,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219468.33333333334, ans=0.1 2024-06-20 14:02:44,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=219468.33333333334, ans=0.025 2024-06-20 14:02:47,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.004e+02 2.187e+02 2.537e+02 3.691e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-20 14:02:51,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2024-06-20 14:02:55,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219505.0, ans=0.1 2024-06-20 14:03:01,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=219505.0, ans=0.125 2024-06-20 14:03:03,327 INFO [train.py:1028] (0/2) Epoch 12, batch 8450, loss[loss=0.2529, simple_loss=0.3012, pruned_loss=0.1023, over 13114.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.2998, pruned_loss=0.1009, over 2580784.85 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:03:17,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-20 14:03:39,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=219596.66666666666, ans=0.125 2024-06-20 14:03:42,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.14 vs. limit=15.0 2024-06-20 14:03:47,028 INFO [train.py:1028] (0/2) Epoch 12, batch 8500, loss[loss=0.2411, simple_loss=0.2951, pruned_loss=0.09358, over 12790.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3008, pruned_loss=0.1014, over 2579361.45 frames. ], batch size: 29, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:04:01,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=219633.33333333334, ans=0.125 2024-06-20 14:04:02,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=219651.66666666666, ans=0.1 2024-06-20 14:04:09,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219651.66666666666, ans=0.1 2024-06-20 14:04:09,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219651.66666666666, ans=0.1 2024-06-20 14:04:11,224 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.986e+02 2.151e+02 2.409e+02 3.291e+02, threshold=4.302e+02, percent-clipped=0.0 2024-06-20 14:04:12,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=219670.0, ans=0.0 2024-06-20 14:04:21,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=219688.33333333334, ans=0.0 2024-06-20 14:04:27,317 INFO [train.py:1028] (0/2) Epoch 12, batch 8550, loss[loss=0.2429, simple_loss=0.2945, pruned_loss=0.0956, over 12613.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.2994, pruned_loss=0.1008, over 2577480.98 frames. ], batch size: 22, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:05:11,147 INFO [train.py:1028] (0/2) Epoch 12, batch 8600, loss[loss=0.2684, simple_loss=0.3091, pruned_loss=0.1139, over 13162.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2998, pruned_loss=0.101, over 2574618.64 frames. ], batch size: 112, lr: 4.82e-03, grad_scale: 64.0 2024-06-20 14:05:14,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=219798.33333333334, ans=0.0 2024-06-20 14:05:23,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2024-06-20 14:05:25,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=22.5 2024-06-20 14:05:28,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=219835.0, ans=0.125 2024-06-20 14:05:28,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=15.0 2024-06-20 14:05:35,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=9.26 vs. limit=12.0 2024-06-20 14:05:35,710 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.913e+02 2.023e+02 2.263e+02 2.929e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 14:05:39,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=219853.33333333334, ans=0.125 2024-06-20 14:05:52,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=219871.66666666666, ans=0.0 2024-06-20 14:05:55,535 INFO [train.py:1028] (0/2) Epoch 12, batch 8650, loss[loss=0.2441, simple_loss=0.2873, pruned_loss=0.1005, over 13081.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.2999, pruned_loss=0.1008, over 2576904.73 frames. ], batch size: 102, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:06:03,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.40 vs. limit=15.0 2024-06-20 14:06:05,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219908.33333333334, ans=0.1 2024-06-20 14:06:06,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=219908.33333333334, ans=0.125 2024-06-20 14:06:17,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=219926.66666666666, ans=0.125 2024-06-20 14:06:26,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219963.33333333334, ans=0.1 2024-06-20 14:06:28,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=219963.33333333334, ans=0.0 2024-06-20 14:06:31,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=219963.33333333334, ans=0.125 2024-06-20 14:06:34,568 INFO [train.py:1028] (0/2) Epoch 12, batch 8700, loss[loss=0.259, simple_loss=0.3073, pruned_loss=0.1053, over 13203.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3009, pruned_loss=0.1016, over 2573066.90 frames. ], batch size: 59, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:06:35,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=219981.66666666666, ans=0.125 2024-06-20 14:06:39,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=219981.66666666666, ans=0.125 2024-06-20 14:06:41,990 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-120000.pt 2024-06-20 14:06:47,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220000.0, ans=0.1 2024-06-20 14:06:55,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=220018.33333333334, ans=0.125 2024-06-20 14:06:56,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220018.33333333334, ans=0.1 2024-06-20 14:07:04,084 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.921e+02 2.080e+02 2.261e+02 4.273e+02, threshold=4.159e+02, percent-clipped=1.0 2024-06-20 14:07:12,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=220036.66666666666, ans=0.125 2024-06-20 14:07:24,014 INFO [train.py:1028] (0/2) Epoch 12, batch 8750, loss[loss=0.2789, simple_loss=0.3149, pruned_loss=0.1214, over 13114.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3005, pruned_loss=0.1016, over 2568343.78 frames. ], batch size: 121, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:07:26,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=220073.33333333334, ans=0.125 2024-06-20 14:07:35,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=220091.66666666666, ans=0.125 2024-06-20 14:07:41,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=220110.0, ans=0.125 2024-06-20 14:07:44,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=220110.0, ans=0.04949747468305833 2024-06-20 14:08:04,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.98 vs. limit=6.0 2024-06-20 14:08:04,910 INFO [train.py:1028] (0/2) Epoch 12, batch 8800, loss[loss=0.2534, simple_loss=0.3065, pruned_loss=0.1001, over 13265.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.301, pruned_loss=0.1017, over 2573701.61 frames. ], batch size: 72, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:08:12,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=220165.0, ans=0.125 2024-06-20 14:08:14,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=220165.0, ans=0.2 2024-06-20 14:08:30,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=220201.66666666666, ans=0.125 2024-06-20 14:08:31,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=220201.66666666666, ans=0.2 2024-06-20 14:08:33,114 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 1.949e+02 2.099e+02 2.318e+02 3.000e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 14:08:34,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=220220.0, ans=0.125 2024-06-20 14:08:38,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=220220.0, ans=0.125 2024-06-20 14:08:43,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=220238.33333333334, ans=0.0 2024-06-20 14:08:49,317 INFO [train.py:1028] (0/2) Epoch 12, batch 8850, loss[loss=0.2609, simple_loss=0.3089, pruned_loss=0.1064, over 12588.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3006, pruned_loss=0.1017, over 2560698.47 frames. ], batch size: 202, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:08:59,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=220275.0, ans=0.0 2024-06-20 14:09:00,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=220275.0, ans=0.125 2024-06-20 14:09:03,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=220275.0, ans=0.125 2024-06-20 14:09:03,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220275.0, ans=0.1 2024-06-20 14:09:14,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=220311.66666666666, ans=0.07 2024-06-20 14:09:32,596 INFO [train.py:1028] (0/2) Epoch 12, batch 8900, loss[loss=0.2626, simple_loss=0.3072, pruned_loss=0.109, over 12932.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3008, pruned_loss=0.1018, over 2560532.70 frames. ], batch size: 33, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:09:46,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=220366.66666666666, ans=0.125 2024-06-20 14:09:51,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2024-06-20 14:09:55,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=220403.33333333334, ans=0.125 2024-06-20 14:09:56,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.028e+02 2.239e+02 2.518e+02 3.270e+02, threshold=4.478e+02, percent-clipped=0.0 2024-06-20 14:09:58,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=220403.33333333334, ans=0.125 2024-06-20 14:10:12,117 INFO [train.py:1028] (0/2) Epoch 12, batch 8950, loss[loss=0.276, simple_loss=0.3171, pruned_loss=0.1175, over 12444.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3012, pruned_loss=0.1014, over 2561636.41 frames. ], batch size: 202, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:10:13,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=220440.0, ans=0.125 2024-06-20 14:10:14,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2024-06-20 14:10:37,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-20 14:10:45,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=220495.0, ans=0.0 2024-06-20 14:10:55,826 INFO [train.py:1028] (0/2) Epoch 12, batch 9000, loss[loss=0.2395, simple_loss=0.2914, pruned_loss=0.09374, over 13324.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3006, pruned_loss=0.1005, over 2568439.85 frames. ], batch size: 46, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:10:55,827 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 14:11:04,505 INFO [train.py:1060] (0/2) Epoch 12, validation: loss=0.1927, simple_loss=0.257, pruned_loss=0.06422, over 351949.00 frames. 2024-06-20 14:11:04,505 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 14:11:05,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=220531.66666666666, ans=0.125 2024-06-20 14:11:16,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=220550.0, ans=0.125 2024-06-20 14:11:22,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=220568.33333333334, ans=0.2 2024-06-20 14:11:27,960 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.928e+02 2.043e+02 2.203e+02 2.725e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 14:11:32,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=220586.66666666666, ans=0.0 2024-06-20 14:11:42,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=220605.0, ans=0.1 2024-06-20 14:11:43,471 INFO [train.py:1028] (0/2) Epoch 12, batch 9050, loss[loss=0.2083, simple_loss=0.2693, pruned_loss=0.07371, over 10849.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3012, pruned_loss=0.1009, over 2567145.54 frames. ], batch size: 16, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:11:57,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.43 vs. limit=15.0 2024-06-20 14:12:07,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=15.0 2024-06-20 14:12:10,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2024-06-20 14:12:11,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 14:12:16,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2024-06-20 14:12:22,862 INFO [train.py:1028] (0/2) Epoch 12, batch 9100, loss[loss=0.276, simple_loss=0.3354, pruned_loss=0.1083, over 13081.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3011, pruned_loss=0.1008, over 2566290.44 frames. ], batch size: 71, lr: 4.81e-03, grad_scale: 64.0 2024-06-20 14:12:36,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220733.33333333334, ans=0.1 2024-06-20 14:12:49,234 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.982e+02 2.172e+02 2.488e+02 3.363e+02, threshold=4.344e+02, percent-clipped=0.0 2024-06-20 14:13:04,593 INFO [train.py:1028] (0/2) Epoch 12, batch 9150, loss[loss=0.2368, simple_loss=0.2955, pruned_loss=0.08906, over 13224.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3014, pruned_loss=0.1011, over 2568581.76 frames. ], batch size: 77, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:13:06,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220806.66666666666, ans=0.125 2024-06-20 14:13:24,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=220843.33333333334, ans=0.0 2024-06-20 14:13:32,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=220861.66666666666, ans=0.125 2024-06-20 14:13:38,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=220880.0, ans=0.125 2024-06-20 14:13:42,517 INFO [train.py:1028] (0/2) Epoch 12, batch 9200, loss[loss=0.2374, simple_loss=0.2979, pruned_loss=0.08841, over 12870.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3006, pruned_loss=0.1003, over 2572092.36 frames. ], batch size: 36, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:13:50,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=220916.66666666666, ans=0.025 2024-06-20 14:14:03,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=220953.33333333334, ans=0.0 2024-06-20 14:14:04,434 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.916e+02 2.015e+02 2.198e+02 3.345e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-20 14:14:19,617 INFO [train.py:1028] (0/2) Epoch 12, batch 9250, loss[loss=0.2437, simple_loss=0.297, pruned_loss=0.09516, over 13235.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3012, pruned_loss=0.1005, over 2574031.22 frames. ], batch size: 67, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:14:21,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.44 vs. limit=22.5 2024-06-20 14:14:26,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=220990.0, ans=0.125 2024-06-20 14:14:35,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=221026.66666666666, ans=0.1 2024-06-20 14:14:35,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=221026.66666666666, ans=0.2 2024-06-20 14:14:49,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=221045.0, ans=0.025 2024-06-20 14:14:51,794 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.23 vs. limit=15.0 2024-06-20 14:14:51,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.68 vs. limit=15.0 2024-06-20 14:14:54,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=221063.33333333334, ans=0.0 2024-06-20 14:14:55,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=221063.33333333334, ans=0.0 2024-06-20 14:15:00,833 INFO [train.py:1028] (0/2) Epoch 12, batch 9300, loss[loss=0.2534, simple_loss=0.2982, pruned_loss=0.1043, over 13012.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3012, pruned_loss=0.1007, over 2571172.28 frames. ], batch size: 39, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:15:10,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=221100.0, ans=0.125 2024-06-20 14:15:23,715 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.980e+02 2.150e+02 2.290e+02 3.532e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-20 14:15:25,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=221136.66666666666, ans=0.0 2024-06-20 14:15:33,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=221155.0, ans=0.0 2024-06-20 14:15:34,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=221155.0, ans=0.2 2024-06-20 14:15:38,983 INFO [train.py:1028] (0/2) Epoch 12, batch 9350, loss[loss=0.2375, simple_loss=0.2848, pruned_loss=0.0951, over 12404.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3017, pruned_loss=0.1011, over 2567527.71 frames. ], batch size: 22, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:15:39,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=221173.33333333334, ans=0.025 2024-06-20 14:15:43,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=221173.33333333334, ans=0.1 2024-06-20 14:15:50,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=221191.66666666666, ans=0.0 2024-06-20 14:15:57,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221210.0, ans=0.1 2024-06-20 14:16:05,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=221228.33333333334, ans=0.125 2024-06-20 14:16:15,428 INFO [train.py:1028] (0/2) Epoch 12, batch 9400, loss[loss=0.2503, simple_loss=0.3013, pruned_loss=0.09962, over 13255.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3028, pruned_loss=0.1018, over 2567070.46 frames. ], batch size: 52, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:16:19,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=221265.0, ans=0.0 2024-06-20 14:16:21,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221265.0, ans=0.0 2024-06-20 14:16:22,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=221283.33333333334, ans=0.0 2024-06-20 14:16:28,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=221283.33333333334, ans=0.0 2024-06-20 14:16:31,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=221301.66666666666, ans=0.125 2024-06-20 14:16:37,862 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.973e+02 2.162e+02 2.358e+02 3.495e+02, threshold=4.324e+02, percent-clipped=0.0 2024-06-20 14:16:38,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.40 vs. limit=22.5 2024-06-20 14:16:53,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.84 vs. limit=15.0 2024-06-20 14:16:55,355 INFO [train.py:1028] (0/2) Epoch 12, batch 9450, loss[loss=0.2479, simple_loss=0.2943, pruned_loss=0.1007, over 12741.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3027, pruned_loss=0.1019, over 2568339.05 frames. ], batch size: 22, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:17:07,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=221375.0, ans=0.1 2024-06-20 14:17:12,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=221393.33333333334, ans=0.025 2024-06-20 14:17:31,360 INFO [train.py:1028] (0/2) Epoch 12, batch 9500, loss[loss=0.2599, simple_loss=0.3108, pruned_loss=0.1045, over 13306.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.302, pruned_loss=0.1013, over 2577959.47 frames. ], batch size: 43, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:17:49,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=221485.0, ans=0.1 2024-06-20 14:17:52,985 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.998e+02 2.143e+02 2.298e+02 3.211e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 14:17:55,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=221503.33333333334, ans=0.125 2024-06-20 14:18:04,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=221521.66666666666, ans=0.0 2024-06-20 14:18:07,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2024-06-20 14:18:09,752 INFO [train.py:1028] (0/2) Epoch 12, batch 9550, loss[loss=0.244, simple_loss=0.2927, pruned_loss=0.09761, over 12887.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3019, pruned_loss=0.1014, over 2572310.97 frames. ], batch size: 39, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:18:15,315 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.612e+00 2024-06-20 14:18:40,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.81 vs. limit=15.0 2024-06-20 14:18:46,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=221631.66666666666, ans=0.0 2024-06-20 14:18:46,623 INFO [train.py:1028] (0/2) Epoch 12, batch 9600, loss[loss=0.2718, simple_loss=0.3116, pruned_loss=0.116, over 10346.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3022, pruned_loss=0.1018, over 2571095.17 frames. ], batch size: 303, lr: 4.80e-03, grad_scale: 64.0 2024-06-20 14:18:57,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=221650.0, ans=0.2 2024-06-20 14:19:04,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=221668.33333333334, ans=0.0 2024-06-20 14:19:06,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=221668.33333333334, ans=0.04949747468305833 2024-06-20 14:19:08,049 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.944e+02 2.066e+02 2.300e+02 3.119e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-20 14:19:10,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=221686.66666666666, ans=0.125 2024-06-20 14:19:10,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.10 vs. limit=15.0 2024-06-20 14:19:23,154 INFO [train.py:1028] (0/2) Epoch 12, batch 9650, loss[loss=0.246, simple_loss=0.2843, pruned_loss=0.1039, over 13071.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.302, pruned_loss=0.1023, over 2562357.87 frames. ], batch size: 132, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:19:27,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=221723.33333333334, ans=0.04949747468305833 2024-06-20 14:19:28,573 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2024-06-20 14:19:30,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=221741.66666666666, ans=0.125 2024-06-20 14:19:35,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=221741.66666666666, ans=0.125 2024-06-20 14:19:51,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.01 vs. limit=15.0 2024-06-20 14:20:01,634 INFO [train.py:1028] (0/2) Epoch 12, batch 9700, loss[loss=0.2713, simple_loss=0.3103, pruned_loss=0.1161, over 13049.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3014, pruned_loss=0.1021, over 2556121.04 frames. ], batch size: 144, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:20:06,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=221815.0, ans=0.125 2024-06-20 14:20:18,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=221851.66666666666, ans=0.125 2024-06-20 14:20:23,479 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.968e+02 2.235e+02 2.458e+02 3.114e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-20 14:20:26,092 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-20 14:20:31,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=221888.33333333334, ans=0.0 2024-06-20 14:20:38,145 INFO [train.py:1028] (0/2) Epoch 12, batch 9750, loss[loss=0.2618, simple_loss=0.2985, pruned_loss=0.1125, over 13135.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.2998, pruned_loss=0.1014, over 2551999.74 frames. ], batch size: 132, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:20:47,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=221925.0, ans=0.125 2024-06-20 14:20:52,026 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.776e+00 2024-06-20 14:20:54,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=221943.33333333334, ans=0.025 2024-06-20 14:21:08,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=221980.0, ans=0.2 2024-06-20 14:21:11,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.96 vs. limit=6.0 2024-06-20 14:21:16,937 INFO [train.py:1028] (0/2) Epoch 12, batch 9800, loss[loss=0.236, simple_loss=0.288, pruned_loss=0.09199, over 12978.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2996, pruned_loss=0.1011, over 2545089.86 frames. ], batch size: 39, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:21:25,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=222016.66666666666, ans=0.125 2024-06-20 14:21:33,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=222035.0, ans=0.2 2024-06-20 14:21:38,643 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.929e+02 2.070e+02 2.246e+02 2.993e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-20 14:21:46,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=222071.66666666666, ans=0.125 2024-06-20 14:21:50,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=222071.66666666666, ans=0.0 2024-06-20 14:21:52,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=12.0 2024-06-20 14:21:52,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=222090.0, ans=0.125 2024-06-20 14:21:53,325 INFO [train.py:1028] (0/2) Epoch 12, batch 9850, loss[loss=0.2636, simple_loss=0.3104, pruned_loss=0.1084, over 13051.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.2996, pruned_loss=0.1011, over 2536767.90 frames. ], batch size: 102, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:22:11,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.94 vs. limit=22.5 2024-06-20 14:22:12,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=222126.66666666666, ans=0.125 2024-06-20 14:22:19,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=222145.0, ans=0.125 2024-06-20 14:22:23,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=222163.33333333334, ans=0.0 2024-06-20 14:22:26,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222163.33333333334, ans=0.125 2024-06-20 14:22:29,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=222163.33333333334, ans=0.0 2024-06-20 14:22:31,196 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2024-06-20 14:22:31,553 INFO [train.py:1028] (0/2) Epoch 12, batch 9900, loss[loss=0.2147, simple_loss=0.2674, pruned_loss=0.08095, over 12862.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.2989, pruned_loss=0.101, over 2530886.58 frames. ], batch size: 39, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:22:31,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=222181.66666666666, ans=0.125 2024-06-20 14:22:43,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=15.0 2024-06-20 14:22:53,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 1.984e+02 2.170e+02 2.409e+02 3.510e+02, threshold=4.341e+02, percent-clipped=0.0 2024-06-20 14:22:59,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=222236.66666666666, ans=0.0 2024-06-20 14:23:08,997 INFO [train.py:1028] (0/2) Epoch 12, batch 9950, loss[loss=0.242, simple_loss=0.2941, pruned_loss=0.09496, over 12561.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.2979, pruned_loss=0.1009, over 2526064.80 frames. ], batch size: 29, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:23:17,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222291.66666666666, ans=0.1 2024-06-20 14:23:22,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=222291.66666666666, ans=0.125 2024-06-20 14:23:23,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2024-06-20 14:23:28,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222310.0, ans=0.1 2024-06-20 14:23:28,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=222310.0, ans=0.125 2024-06-20 14:23:30,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=222310.0, ans=0.125 2024-06-20 14:23:47,424 INFO [train.py:1028] (0/2) Epoch 12, batch 10000, loss[loss=0.2691, simple_loss=0.3244, pruned_loss=0.1069, over 12602.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.299, pruned_loss=0.1017, over 2486814.12 frames. ], batch size: 22, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:23:47,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=222365.0, ans=10.0 2024-06-20 14:23:47,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=222365.0, ans=0.125 2024-06-20 14:23:55,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=222383.33333333334, ans=0.125 2024-06-20 14:23:55,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=222383.33333333334, ans=0.0 2024-06-20 14:24:01,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=222401.66666666666, ans=0.125 2024-06-20 14:24:06,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222401.66666666666, ans=0.125 2024-06-20 14:24:08,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222401.66666666666, ans=0.125 2024-06-20 14:24:09,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 1.983e+02 2.142e+02 2.370e+02 3.061e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-20 14:24:10,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-20 14:24:16,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-20 14:24:24,903 INFO [train.py:1028] (0/2) Epoch 12, batch 10050, loss[loss=0.2521, simple_loss=0.312, pruned_loss=0.09615, over 12835.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.2987, pruned_loss=0.1023, over 2444039.98 frames. ], batch size: 22, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:24:28,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.53 vs. limit=10.0 2024-06-20 14:24:35,470 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:24:38,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=222493.33333333334, ans=0.2 2024-06-20 14:24:43,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=222493.33333333334, ans=0.025 2024-06-20 14:25:00,284 INFO [train.py:1028] (0/2) Epoch 12, batch 10100, loss[loss=0.2479, simple_loss=0.2886, pruned_loss=0.1036, over 11316.00 frames. ], tot_loss[loss=0.251, simple_loss=0.2981, pruned_loss=0.1019, over 2424299.04 frames. ], batch size: 16, lr: 4.79e-03, grad_scale: 64.0 2024-06-20 14:25:00,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=12.0 2024-06-20 14:25:10,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=222566.66666666666, ans=0.125 2024-06-20 14:25:16,555 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-12.pt 2024-06-20 14:27:34,246 INFO [train.py:1028] (0/2) Epoch 13, batch 0, loss[loss=0.2245, simple_loss=0.2751, pruned_loss=0.08696, over 12908.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2751, pruned_loss=0.08696, over 12908.00 frames. ], batch size: 36, lr: 4.60e-03, grad_scale: 64.0 2024-06-20 14:27:34,247 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 14:27:40,843 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.9977, 2.6479, 2.9150, 2.7799], device='cuda:0') 2024-06-20 14:27:42,253 INFO [train.py:1060] (0/2) Epoch 13, validation: loss=0.1944, simple_loss=0.2592, pruned_loss=0.06477, over 351949.00 frames. 2024-06-20 14:27:42,254 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 14:27:43,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=222579.5, ans=0.05 2024-06-20 14:27:45,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=222579.5, ans=0.0 2024-06-20 14:27:46,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=222579.5, ans=0.125 2024-06-20 14:27:48,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.19 vs. limit=15.0 2024-06-20 14:27:52,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=222597.83333333334, ans=0.125 2024-06-20 14:27:53,144 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.932e+02 2.063e+02 2.347e+02 3.300e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 14:28:09,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=222634.5, ans=0.125 2024-06-20 14:28:23,040 INFO [train.py:1028] (0/2) Epoch 13, batch 50, loss[loss=0.2286, simple_loss=0.2841, pruned_loss=0.0865, over 12693.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2798, pruned_loss=0.09432, over 574713.03 frames. ], batch size: 29, lr: 4.60e-03, grad_scale: 64.0 2024-06-20 14:28:24,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=222671.16666666666, ans=0.2 2024-06-20 14:28:36,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222689.5, ans=0.1 2024-06-20 14:28:39,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222707.83333333334, ans=0.1 2024-06-20 14:28:41,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=222707.83333333334, ans=0.125 2024-06-20 14:29:00,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222744.5, ans=0.1 2024-06-20 14:29:01,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=222744.5, ans=0.2 2024-06-20 14:29:04,028 INFO [train.py:1028] (0/2) Epoch 13, batch 100, loss[loss=0.2246, simple_loss=0.2777, pruned_loss=0.08573, over 13289.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2764, pruned_loss=0.09212, over 1017704.81 frames. ], batch size: 46, lr: 4.60e-03, grad_scale: 128.0 2024-06-20 14:29:04,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=222762.83333333334, ans=0.2 2024-06-20 14:29:10,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=222762.83333333334, ans=0.0 2024-06-20 14:29:17,417 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 1.875e+02 1.989e+02 2.211e+02 3.304e+02, threshold=3.978e+02, percent-clipped=0.0 2024-06-20 14:29:21,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=222781.16666666666, ans=0.0 2024-06-20 14:29:25,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.13 vs. limit=10.0 2024-06-20 14:29:34,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=222817.83333333334, ans=0.125 2024-06-20 14:29:37,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=222817.83333333334, ans=0.07 2024-06-20 14:29:40,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=222836.16666666666, ans=0.125 2024-06-20 14:29:42,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=222836.16666666666, ans=0.0 2024-06-20 14:29:43,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=222836.16666666666, ans=0.0 2024-06-20 14:29:45,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=222854.5, ans=0.95 2024-06-20 14:29:45,815 INFO [train.py:1028] (0/2) Epoch 13, batch 150, loss[loss=0.2258, simple_loss=0.2791, pruned_loss=0.08625, over 12554.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2761, pruned_loss=0.09102, over 1365489.84 frames. ], batch size: 29, lr: 4.60e-03, grad_scale: 128.0 2024-06-20 14:29:47,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=222854.5, ans=0.2 2024-06-20 14:29:49,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=222854.5, ans=0.0 2024-06-20 14:29:51,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222854.5, ans=0.1 2024-06-20 14:30:04,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=222891.16666666666, ans=0.125 2024-06-20 14:30:10,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=222909.5, ans=0.0 2024-06-20 14:30:24,625 INFO [train.py:1028] (0/2) Epoch 13, batch 200, loss[loss=0.2673, simple_loss=0.3047, pruned_loss=0.115, over 12618.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2765, pruned_loss=0.09081, over 1635009.65 frames. ], batch size: 202, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:30:29,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=222946.16666666666, ans=0.125 2024-06-20 14:30:31,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=222946.16666666666, ans=0.07 2024-06-20 14:30:33,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.47 vs. limit=22.5 2024-06-20 14:30:34,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.888e+02 2.138e+02 2.350e+02 3.201e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-20 14:30:34,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=222964.5, ans=0.0 2024-06-20 14:30:39,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2024-06-20 14:30:41,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222982.83333333334, ans=0.1 2024-06-20 14:31:03,097 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:31:03,642 INFO [train.py:1028] (0/2) Epoch 13, batch 250, loss[loss=0.237, simple_loss=0.2746, pruned_loss=0.09975, over 13005.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2753, pruned_loss=0.0898, over 1846803.86 frames. ], batch size: 144, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:31:08,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2024-06-20 14:31:17,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=223056.16666666666, ans=0.2 2024-06-20 14:31:20,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=223056.16666666666, ans=0.125 2024-06-20 14:31:41,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=223111.16666666666, ans=0.125 2024-06-20 14:31:42,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=223111.16666666666, ans=0.125 2024-06-20 14:31:49,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=223129.5, ans=0.125 2024-06-20 14:31:49,654 INFO [train.py:1028] (0/2) Epoch 13, batch 300, loss[loss=0.2184, simple_loss=0.2627, pruned_loss=0.08704, over 13169.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2753, pruned_loss=0.09006, over 2009650.02 frames. ], batch size: 112, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:31:54,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=223129.5, ans=15.0 2024-06-20 14:31:54,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=223129.5, ans=0.0 2024-06-20 14:31:54,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=12.0 2024-06-20 14:31:55,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=12.0 2024-06-20 14:31:59,853 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.804e+02 1.940e+02 2.059e+02 2.624e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 14:32:03,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=223147.83333333334, ans=0.025 2024-06-20 14:32:04,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=15.0 2024-06-20 14:32:07,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.43 vs. limit=12.0 2024-06-20 14:32:27,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=223221.16666666666, ans=0.125 2024-06-20 14:32:28,420 INFO [train.py:1028] (0/2) Epoch 13, batch 350, loss[loss=0.22, simple_loss=0.2751, pruned_loss=0.0824, over 12946.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2753, pruned_loss=0.08998, over 2138475.11 frames. ], batch size: 33, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:32:53,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2024-06-20 14:32:55,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=223276.16666666666, ans=0.125 2024-06-20 14:33:00,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=223294.5, ans=0.1 2024-06-20 14:33:07,867 INFO [train.py:1028] (0/2) Epoch 13, batch 400, loss[loss=0.2134, simple_loss=0.2746, pruned_loss=0.07612, over 13286.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2753, pruned_loss=0.08964, over 2239417.05 frames. ], batch size: 63, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:33:13,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=223312.83333333334, ans=0.0 2024-06-20 14:33:18,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.850e+02 2.035e+02 2.216e+02 3.237e+02, threshold=4.070e+02, percent-clipped=0.0 2024-06-20 14:33:18,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=223331.16666666666, ans=0.125 2024-06-20 14:33:19,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=223331.16666666666, ans=0.0 2024-06-20 14:33:20,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223331.16666666666, ans=0.125 2024-06-20 14:33:34,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=223367.83333333334, ans=0.0 2024-06-20 14:33:36,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=223367.83333333334, ans=0.125 2024-06-20 14:33:40,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=223386.16666666666, ans=0.125 2024-06-20 14:33:49,659 INFO [train.py:1028] (0/2) Epoch 13, batch 450, loss[loss=0.2168, simple_loss=0.2709, pruned_loss=0.08138, over 13224.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2758, pruned_loss=0.08992, over 2313542.11 frames. ], batch size: 67, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:33:49,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223404.5, ans=0.0 2024-06-20 14:34:12,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=223441.16666666666, ans=0.125 2024-06-20 14:34:21,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=223459.5, ans=0.0 2024-06-20 14:34:42,013 INFO [train.py:1028] (0/2) Epoch 13, batch 500, loss[loss=0.2283, simple_loss=0.2682, pruned_loss=0.09419, over 13100.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2757, pruned_loss=0.08958, over 2375840.51 frames. ], batch size: 121, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:34:51,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=223514.5, ans=0.0 2024-06-20 14:34:53,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.783e+02 1.899e+02 2.002e+02 2.418e+02, threshold=3.798e+02, percent-clipped=0.0 2024-06-20 14:34:57,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=223514.5, ans=0.07 2024-06-20 14:35:08,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=223551.16666666666, ans=0.125 2024-06-20 14:35:19,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=223569.5, ans=0.025 2024-06-20 14:35:23,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=223569.5, ans=0.125 2024-06-20 14:35:27,436 INFO [train.py:1028] (0/2) Epoch 13, batch 550, loss[loss=0.2205, simple_loss=0.2615, pruned_loss=0.08977, over 12926.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2751, pruned_loss=0.08932, over 2420448.34 frames. ], batch size: 158, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:35:46,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=223624.5, ans=0.025 2024-06-20 14:35:47,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=223624.5, ans=0.125 2024-06-20 14:35:55,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223642.83333333334, ans=0.0 2024-06-20 14:36:03,293 INFO [train.py:1028] (0/2) Epoch 13, batch 600, loss[loss=0.2187, simple_loss=0.2575, pruned_loss=0.08997, over 13034.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2757, pruned_loss=0.08991, over 2457609.89 frames. ], batch size: 144, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:36:13,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.863e+02 1.978e+02 2.174e+02 3.321e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-20 14:36:14,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=223697.83333333334, ans=0.0 2024-06-20 14:36:21,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=223716.16666666666, ans=0.125 2024-06-20 14:36:26,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223734.5, ans=0.1 2024-06-20 14:36:27,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=223734.5, ans=0.1 2024-06-20 14:36:32,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=223734.5, ans=0.2 2024-06-20 14:36:36,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=223734.5, ans=0.0 2024-06-20 14:36:39,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=223752.83333333334, ans=0.1 2024-06-20 14:36:45,686 INFO [train.py:1028] (0/2) Epoch 13, batch 650, loss[loss=0.221, simple_loss=0.273, pruned_loss=0.08455, over 13249.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2757, pruned_loss=0.08956, over 2489339.45 frames. ], batch size: 59, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:36:50,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223771.16666666666, ans=0.1 2024-06-20 14:36:54,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=223771.16666666666, ans=0.0 2024-06-20 14:37:12,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=223826.16666666666, ans=0.0 2024-06-20 14:37:23,517 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:37:25,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=223844.5, ans=0.125 2024-06-20 14:37:27,935 INFO [train.py:1028] (0/2) Epoch 13, batch 700, loss[loss=0.2006, simple_loss=0.2626, pruned_loss=0.06928, over 13306.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2748, pruned_loss=0.08912, over 2511458.18 frames. ], batch size: 46, lr: 4.59e-03, grad_scale: 128.0 2024-06-20 14:37:28,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=223862.83333333334, ans=0.125 2024-06-20 14:37:29,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223862.83333333334, ans=0.0 2024-06-20 14:37:31,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=223862.83333333334, ans=0.125 2024-06-20 14:37:37,573 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.888e+02 2.069e+02 2.370e+02 3.607e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-20 14:37:46,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=223899.5, ans=0.5 2024-06-20 14:37:57,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=15.0 2024-06-20 14:38:06,733 INFO [train.py:1028] (0/2) Epoch 13, batch 750, loss[loss=0.1991, simple_loss=0.2568, pruned_loss=0.0707, over 13237.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2749, pruned_loss=0.08877, over 2527861.89 frames. ], batch size: 63, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:38:17,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223972.83333333334, ans=0.125 2024-06-20 14:38:29,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=224009.5, ans=0.125 2024-06-20 14:38:31,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=224009.5, ans=0.5 2024-06-20 14:38:37,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=224027.83333333334, ans=0.125 2024-06-20 14:38:43,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=15.0 2024-06-20 14:38:45,488 INFO [train.py:1028] (0/2) Epoch 13, batch 800, loss[loss=0.2219, simple_loss=0.2796, pruned_loss=0.08204, over 12938.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2747, pruned_loss=0.08875, over 2541548.33 frames. ], batch size: 36, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:38:48,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=224046.16666666666, ans=0.125 2024-06-20 14:38:54,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=224064.5, ans=0.125 2024-06-20 14:38:55,539 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.793e+02 1.891e+02 2.058e+02 2.751e+02, threshold=3.782e+02, percent-clipped=0.0 2024-06-20 14:39:13,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2024-06-20 14:39:15,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=224101.16666666666, ans=0.125 2024-06-20 14:39:23,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224119.5, ans=0.0 2024-06-20 14:39:25,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.58 vs. limit=22.5 2024-06-20 14:39:28,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=224119.5, ans=0.125 2024-06-20 14:39:31,766 INFO [train.py:1028] (0/2) Epoch 13, batch 850, loss[loss=0.2142, simple_loss=0.2595, pruned_loss=0.08448, over 13184.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2752, pruned_loss=0.08907, over 2551997.72 frames. ], batch size: 95, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:39:35,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2024-06-20 14:39:43,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2024-06-20 14:39:53,289 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=5.180e-03 2024-06-20 14:39:54,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224192.83333333334, ans=0.0 2024-06-20 14:39:56,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=224192.83333333334, ans=0.125 2024-06-20 14:40:10,791 INFO [train.py:1028] (0/2) Epoch 13, batch 900, loss[loss=0.2414, simple_loss=0.2858, pruned_loss=0.09848, over 12853.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2754, pruned_loss=0.08937, over 2557039.77 frames. ], batch size: 36, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:40:11,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224229.5, ans=0.125 2024-06-20 14:40:15,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.94 vs. limit=15.0 2024-06-20 14:40:21,216 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.843e+02 1.973e+02 2.142e+02 2.751e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 14:40:35,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224284.5, ans=0.1 2024-06-20 14:40:37,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.61 vs. limit=22.5 2024-06-20 14:40:49,650 INFO [train.py:1028] (0/2) Epoch 13, batch 950, loss[loss=0.2379, simple_loss=0.2899, pruned_loss=0.09297, over 12964.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2753, pruned_loss=0.08911, over 2559408.71 frames. ], batch size: 39, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:40:56,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=224321.16666666666, ans=0.125 2024-06-20 14:40:56,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=224339.5, ans=0.2 2024-06-20 14:40:58,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=224339.5, ans=0.125 2024-06-20 14:41:06,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=224357.83333333334, ans=0.1 2024-06-20 14:41:08,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=224357.83333333334, ans=0.0 2024-06-20 14:41:12,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=224376.16666666666, ans=0.125 2024-06-20 14:41:12,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.31 vs. limit=15.0 2024-06-20 14:41:27,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=224394.5, ans=0.125 2024-06-20 14:41:31,801 INFO [train.py:1028] (0/2) Epoch 13, batch 1000, loss[loss=0.236, simple_loss=0.2872, pruned_loss=0.09236, over 13078.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.276, pruned_loss=0.09006, over 2560778.84 frames. ], batch size: 48, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:41:40,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2024-06-20 14:41:44,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224431.16666666666, ans=0.125 2024-06-20 14:41:45,027 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.823e+02 1.926e+02 2.080e+02 3.009e+02, threshold=3.852e+02, percent-clipped=0.0 2024-06-20 14:41:45,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=224431.16666666666, ans=0.0 2024-06-20 14:41:47,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=224431.16666666666, ans=0.0 2024-06-20 14:42:09,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=224486.16666666666, ans=0.0 2024-06-20 14:42:14,180 INFO [train.py:1028] (0/2) Epoch 13, batch 1050, loss[loss=0.2193, simple_loss=0.2711, pruned_loss=0.08374, over 13126.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.277, pruned_loss=0.09043, over 2564313.23 frames. ], batch size: 77, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:42:16,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=224504.5, ans=0.125 2024-06-20 14:42:21,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=22.5 2024-06-20 14:42:29,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224541.16666666666, ans=0.1 2024-06-20 14:42:29,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=224541.16666666666, ans=0.05 2024-06-20 14:42:36,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=224559.5, ans=0.125 2024-06-20 14:42:44,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-20 14:42:48,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=224577.83333333334, ans=0.0 2024-06-20 14:42:51,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=224577.83333333334, ans=0.0 2024-06-20 14:42:53,698 INFO [train.py:1028] (0/2) Epoch 13, batch 1100, loss[loss=0.2112, simple_loss=0.2695, pruned_loss=0.07643, over 13239.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2773, pruned_loss=0.0906, over 2568944.52 frames. ], batch size: 52, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:42:57,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.42 vs. limit=15.0 2024-06-20 14:43:00,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.91 vs. limit=15.0 2024-06-20 14:43:03,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.862e+02 2.001e+02 2.185e+02 2.725e+02, threshold=4.002e+02, percent-clipped=0.0 2024-06-20 14:43:04,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=224614.5, ans=0.0 2024-06-20 14:43:15,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=224632.83333333334, ans=10.0 2024-06-20 14:43:29,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=224669.5, ans=0.05 2024-06-20 14:43:32,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=224687.83333333334, ans=0.125 2024-06-20 14:43:32,522 INFO [train.py:1028] (0/2) Epoch 13, batch 1150, loss[loss=0.229, simple_loss=0.281, pruned_loss=0.08857, over 13312.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2768, pruned_loss=0.09061, over 2571106.68 frames. ], batch size: 52, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:43:42,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=224687.83333333334, ans=0.125 2024-06-20 14:43:57,889 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:44:17,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.30 vs. limit=15.0 2024-06-20 14:44:18,089 INFO [train.py:1028] (0/2) Epoch 13, batch 1200, loss[loss=0.2094, simple_loss=0.2657, pruned_loss=0.07657, over 13210.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2766, pruned_loss=0.09059, over 2573701.66 frames. ], batch size: 77, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:44:19,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224779.5, ans=0.1 2024-06-20 14:44:22,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=224779.5, ans=0.125 2024-06-20 14:44:27,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.855e+02 2.025e+02 2.211e+02 2.856e+02, threshold=4.050e+02, percent-clipped=0.0 2024-06-20 14:44:30,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=15.0 2024-06-20 14:44:42,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=224834.5, ans=0.125 2024-06-20 14:44:50,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=224852.83333333334, ans=0.125 2024-06-20 14:44:56,136 INFO [train.py:1028] (0/2) Epoch 13, batch 1250, loss[loss=0.2061, simple_loss=0.2532, pruned_loss=0.07948, over 13135.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2763, pruned_loss=0.09023, over 2583222.92 frames. ], batch size: 112, lr: 4.58e-03, grad_scale: 128.0 2024-06-20 14:45:00,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.85 vs. limit=10.0 2024-06-20 14:45:00,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.33 vs. limit=22.5 2024-06-20 14:45:07,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=224889.5, ans=0.125 2024-06-20 14:45:15,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224907.83333333334, ans=0.1 2024-06-20 14:45:23,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=224926.16666666666, ans=0.125 2024-06-20 14:45:23,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=224926.16666666666, ans=0.025 2024-06-20 14:45:27,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=224944.5, ans=0.125 2024-06-20 14:45:28,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=224944.5, ans=0.05 2024-06-20 14:45:35,675 INFO [train.py:1028] (0/2) Epoch 13, batch 1300, loss[loss=0.2485, simple_loss=0.2873, pruned_loss=0.1048, over 12741.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2767, pruned_loss=0.0906, over 2584089.12 frames. ], batch size: 176, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:45:44,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224981.16666666666, ans=0.1 2024-06-20 14:45:45,117 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.819e+02 1.917e+02 2.112e+02 2.858e+02, threshold=3.835e+02, percent-clipped=0.0 2024-06-20 14:45:46,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224981.16666666666, ans=0.1 2024-06-20 14:46:06,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.14 vs. limit=22.5 2024-06-20 14:46:16,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=15.0 2024-06-20 14:46:17,955 INFO [train.py:1028] (0/2) Epoch 13, batch 1350, loss[loss=0.2208, simple_loss=0.2746, pruned_loss=0.08354, over 13219.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2768, pruned_loss=0.09045, over 2585964.20 frames. ], batch size: 59, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:46:26,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=225072.83333333334, ans=0.0 2024-06-20 14:46:33,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=225072.83333333334, ans=0.025 2024-06-20 14:46:34,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=225072.83333333334, ans=15.0 2024-06-20 14:46:51,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=225109.5, ans=0.0 2024-06-20 14:46:58,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=225127.83333333334, ans=0.125 2024-06-20 14:47:00,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=225146.16666666666, ans=0.0 2024-06-20 14:47:00,952 INFO [train.py:1028] (0/2) Epoch 13, batch 1400, loss[loss=0.2152, simple_loss=0.2646, pruned_loss=0.08291, over 12913.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2764, pruned_loss=0.09042, over 2587391.56 frames. ], batch size: 26, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:47:01,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.41 vs. limit=10.0 2024-06-20 14:47:07,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=225146.16666666666, ans=0.125 2024-06-20 14:47:10,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.858e+02 1.963e+02 2.070e+02 2.904e+02, threshold=3.927e+02, percent-clipped=0.0 2024-06-20 14:47:15,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=225182.83333333334, ans=0.125 2024-06-20 14:47:19,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.78 vs. limit=15.0 2024-06-20 14:47:25,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225201.16666666666, ans=0.1 2024-06-20 14:47:26,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=225201.16666666666, ans=0.0 2024-06-20 14:47:26,704 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:47:29,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=12.0 2024-06-20 14:47:34,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=225219.5, ans=0.05 2024-06-20 14:47:35,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=225219.5, ans=0.0 2024-06-20 14:47:38,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=225237.83333333334, ans=0.125 2024-06-20 14:47:39,411 INFO [train.py:1028] (0/2) Epoch 13, batch 1450, loss[loss=0.2179, simple_loss=0.2665, pruned_loss=0.08463, over 13075.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2764, pruned_loss=0.09048, over 2587162.46 frames. ], batch size: 121, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:48:03,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=225292.83333333334, ans=0.05 2024-06-20 14:48:07,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=225292.83333333334, ans=0.125 2024-06-20 14:48:18,206 INFO [train.py:1028] (0/2) Epoch 13, batch 1500, loss[loss=0.2636, simple_loss=0.3036, pruned_loss=0.1118, over 13217.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2762, pruned_loss=0.09021, over 2589987.25 frames. ], batch size: 83, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:48:31,439 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.586e+02 1.867e+02 2.008e+02 2.156e+02 2.806e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 14:48:37,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2024-06-20 14:48:43,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=225366.16666666666, ans=0.0 2024-06-20 14:48:47,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=225384.5, ans=0.0 2024-06-20 14:48:51,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=225384.5, ans=0.125 2024-06-20 14:48:59,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225402.83333333334, ans=0.125 2024-06-20 14:49:01,333 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:49:03,103 INFO [train.py:1028] (0/2) Epoch 13, batch 1550, loss[loss=0.233, simple_loss=0.2728, pruned_loss=0.09658, over 13012.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2761, pruned_loss=0.09032, over 2585207.38 frames. ], batch size: 102, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:49:06,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=225421.16666666666, ans=0.2 2024-06-20 14:49:12,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=225439.5, ans=0.0 2024-06-20 14:49:15,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2024-06-20 14:49:21,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=225457.83333333334, ans=0.0 2024-06-20 14:49:29,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=225476.16666666666, ans=0.1 2024-06-20 14:49:36,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=225494.5, ans=0.0 2024-06-20 14:49:36,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=225494.5, ans=0.125 2024-06-20 14:49:42,170 INFO [train.py:1028] (0/2) Epoch 13, batch 1600, loss[loss=0.2138, simple_loss=0.2615, pruned_loss=0.08308, over 13196.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2763, pruned_loss=0.09034, over 2580886.50 frames. ], batch size: 77, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:49:47,124 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:49:52,034 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.814e+02 1.897e+02 2.102e+02 2.746e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 14:49:58,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=225549.5, ans=0.125 2024-06-20 14:50:09,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=225567.83333333334, ans=0.0 2024-06-20 14:50:15,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=225586.16666666666, ans=0.1 2024-06-20 14:50:20,217 INFO [train.py:1028] (0/2) Epoch 13, batch 1650, loss[loss=0.2309, simple_loss=0.2686, pruned_loss=0.09657, over 13174.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2768, pruned_loss=0.09092, over 2577398.40 frames. ], batch size: 95, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:50:26,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=225604.5, ans=0.5 2024-06-20 14:50:37,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.34 vs. limit=10.0 2024-06-20 14:50:44,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=15.0 2024-06-20 14:51:03,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.70 vs. limit=15.0 2024-06-20 14:51:04,155 INFO [train.py:1028] (0/2) Epoch 13, batch 1700, loss[loss=0.2269, simple_loss=0.2746, pruned_loss=0.08962, over 12919.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2772, pruned_loss=0.09103, over 2582516.67 frames. ], batch size: 26, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:51:17,413 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.830e+02 1.938e+02 2.073e+02 2.988e+02, threshold=3.876e+02, percent-clipped=0.0 2024-06-20 14:51:28,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225732.83333333334, ans=0.1 2024-06-20 14:51:32,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225751.16666666666, ans=0.125 2024-06-20 14:51:36,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.04 vs. limit=22.5 2024-06-20 14:51:39,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=225769.5, ans=0.025 2024-06-20 14:51:45,708 INFO [train.py:1028] (0/2) Epoch 13, batch 1750, loss[loss=0.2246, simple_loss=0.2807, pruned_loss=0.0843, over 12479.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2774, pruned_loss=0.09083, over 2582685.93 frames. ], batch size: 22, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:51:50,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=225787.83333333334, ans=0.125 2024-06-20 14:52:06,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=225824.5, ans=0.125 2024-06-20 14:52:07,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225824.5, ans=0.125 2024-06-20 14:52:10,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=225842.83333333334, ans=0.125 2024-06-20 14:52:25,292 INFO [train.py:1028] (0/2) Epoch 13, batch 1800, loss[loss=0.2362, simple_loss=0.2912, pruned_loss=0.09058, over 13223.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2772, pruned_loss=0.09051, over 2582531.51 frames. ], batch size: 67, lr: 4.57e-03, grad_scale: 128.0 2024-06-20 14:52:33,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225897.83333333334, ans=0.125 2024-06-20 14:52:35,053 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.839e+02 1.985e+02 2.144e+02 2.950e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 14:52:36,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-20 14:53:04,267 INFO [train.py:1028] (0/2) Epoch 13, batch 1850, loss[loss=0.2109, simple_loss=0.2593, pruned_loss=0.08126, over 13184.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2773, pruned_loss=0.0906, over 2583772.07 frames. ], batch size: 83, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:53:05,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=225971.16666666666, ans=0.125 2024-06-20 14:53:07,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.70 vs. limit=6.0 2024-06-20 14:53:08,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=225971.16666666666, ans=0.125 2024-06-20 14:53:24,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2024-06-20 14:53:27,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226007.83333333334, ans=0.1 2024-06-20 14:53:40,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=226044.5, ans=0.125 2024-06-20 14:53:49,568 INFO [train.py:1028] (0/2) Epoch 13, batch 1900, loss[loss=0.216, simple_loss=0.262, pruned_loss=0.08504, over 13182.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2772, pruned_loss=0.09081, over 2585860.58 frames. ], batch size: 95, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:53:59,772 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.879e+02 2.021e+02 2.157e+02 2.869e+02, threshold=4.043e+02, percent-clipped=0.0 2024-06-20 14:54:07,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2024-06-20 14:54:13,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=226117.83333333334, ans=0.0 2024-06-20 14:54:22,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=226136.16666666666, ans=0.025 2024-06-20 14:54:23,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=226136.16666666666, ans=0.125 2024-06-20 14:54:25,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.75 vs. limit=15.0 2024-06-20 14:54:28,599 INFO [train.py:1028] (0/2) Epoch 13, batch 1950, loss[loss=0.2114, simple_loss=0.266, pruned_loss=0.07844, over 13260.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2769, pruned_loss=0.09096, over 2592521.54 frames. ], batch size: 52, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:54:29,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=226154.5, ans=0.125 2024-06-20 14:54:38,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=226172.83333333334, ans=0.0 2024-06-20 14:54:39,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226172.83333333334, ans=0.1 2024-06-20 14:54:39,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=226172.83333333334, ans=0.125 2024-06-20 14:54:49,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=226191.16666666666, ans=0.0 2024-06-20 14:54:56,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2024-06-20 14:55:01,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=226227.83333333334, ans=0.0 2024-06-20 14:55:06,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-20 14:55:07,601 INFO [train.py:1028] (0/2) Epoch 13, batch 2000, loss[loss=0.2306, simple_loss=0.2798, pruned_loss=0.09071, over 12751.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.276, pruned_loss=0.09046, over 2588985.43 frames. ], batch size: 22, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:55:17,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.838e+02 1.965e+02 2.115e+02 2.659e+02, threshold=3.930e+02, percent-clipped=0.0 2024-06-20 14:55:36,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=226301.16666666666, ans=0.0 2024-06-20 14:55:41,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=226319.5, ans=0.2 2024-06-20 14:55:47,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=226319.5, ans=0.125 2024-06-20 14:55:53,113 INFO [train.py:1028] (0/2) Epoch 13, batch 2050, loss[loss=0.2132, simple_loss=0.2656, pruned_loss=0.08041, over 12612.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2764, pruned_loss=0.09068, over 2583801.04 frames. ], batch size: 29, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:55:55,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.84 vs. limit=22.5 2024-06-20 14:56:00,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2024-06-20 14:56:06,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2024-06-20 14:56:13,994 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 14:56:17,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=226392.83333333334, ans=0.0 2024-06-20 14:56:32,428 INFO [train.py:1028] (0/2) Epoch 13, batch 2100, loss[loss=0.2168, simple_loss=0.2708, pruned_loss=0.08143, over 13223.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2768, pruned_loss=0.09041, over 2586420.94 frames. ], batch size: 59, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:56:37,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2024-06-20 14:56:41,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=226447.83333333334, ans=0.025 2024-06-20 14:56:42,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.875e+02 2.043e+02 2.266e+02 2.728e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 14:56:58,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=226484.5, ans=0.0 2024-06-20 14:57:05,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=226502.83333333334, ans=0.2 2024-06-20 14:57:11,450 INFO [train.py:1028] (0/2) Epoch 13, batch 2150, loss[loss=0.2213, simple_loss=0.2781, pruned_loss=0.08225, over 13239.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2765, pruned_loss=0.09008, over 2588997.93 frames. ], batch size: 52, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:57:23,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=226539.5, ans=0.125 2024-06-20 14:57:29,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=226557.83333333334, ans=0.125 2024-06-20 14:57:41,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=226576.16666666666, ans=0.125 2024-06-20 14:57:41,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 14:57:47,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=226594.5, ans=0.0 2024-06-20 14:57:49,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=226594.5, ans=0.125 2024-06-20 14:57:51,666 INFO [train.py:1028] (0/2) Epoch 13, batch 2200, loss[loss=0.2271, simple_loss=0.2767, pruned_loss=0.08873, over 13208.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2771, pruned_loss=0.09051, over 2588485.96 frames. ], batch size: 83, lr: 4.56e-03, grad_scale: 256.0 2024-06-20 14:57:57,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=226612.83333333334, ans=0.05 2024-06-20 14:58:02,081 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.836e+02 1.991e+02 2.160e+02 3.169e+02, threshold=3.983e+02, percent-clipped=0.0 2024-06-20 14:58:13,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=226649.5, ans=0.125 2024-06-20 14:58:21,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=226667.83333333334, ans=0.125 2024-06-20 14:58:38,464 INFO [train.py:1028] (0/2) Epoch 13, batch 2250, loss[loss=0.2281, simple_loss=0.28, pruned_loss=0.0881, over 13282.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2768, pruned_loss=0.09031, over 2587802.82 frames. ], batch size: 63, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:58:45,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2024-06-20 14:58:52,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=226722.83333333334, ans=0.025 2024-06-20 14:59:15,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=226777.83333333334, ans=0.125 2024-06-20 14:59:15,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=226777.83333333334, ans=0.125 2024-06-20 14:59:17,235 INFO [train.py:1028] (0/2) Epoch 13, batch 2300, loss[loss=0.2129, simple_loss=0.2642, pruned_loss=0.08082, over 12861.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2774, pruned_loss=0.09052, over 2581659.84 frames. ], batch size: 33, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 14:59:18,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=226796.16666666666, ans=0.0 2024-06-20 14:59:20,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.55 vs. limit=12.0 2024-06-20 14:59:21,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=226796.16666666666, ans=0.1 2024-06-20 14:59:24,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-20 14:59:28,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.856e+02 2.002e+02 2.238e+02 2.871e+02, threshold=4.004e+02, percent-clipped=0.0 2024-06-20 14:59:35,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=226832.83333333334, ans=0.2 2024-06-20 14:59:37,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=226832.83333333334, ans=0.0 2024-06-20 14:59:49,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=226869.5, ans=0.2 2024-06-20 14:59:49,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-20 14:59:51,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=226869.5, ans=0.2 2024-06-20 14:59:52,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226869.5, ans=0.1 2024-06-20 14:59:56,515 INFO [train.py:1028] (0/2) Epoch 13, batch 2350, loss[loss=0.2113, simple_loss=0.2647, pruned_loss=0.07895, over 13204.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2772, pruned_loss=0.09055, over 2585488.21 frames. ], batch size: 67, lr: 4.56e-03, grad_scale: 128.0 2024-06-20 15:00:04,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226906.16666666666, ans=0.125 2024-06-20 15:00:17,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=226924.5, ans=0.125 2024-06-20 15:00:35,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=226961.16666666666, ans=0.0 2024-06-20 15:00:39,797 INFO [train.py:1028] (0/2) Epoch 13, batch 2400, loss[loss=0.2136, simple_loss=0.2647, pruned_loss=0.08125, over 13252.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2761, pruned_loss=0.09009, over 2588309.70 frames. ], batch size: 46, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:00:40,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=226979.5, ans=0.07 2024-06-20 15:00:49,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=226979.5, ans=0.0 2024-06-20 15:00:53,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=226997.83333333334, ans=0.2 2024-06-20 15:00:54,319 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.862e+02 1.969e+02 2.205e+02 2.678e+02, threshold=3.939e+02, percent-clipped=0.0 2024-06-20 15:00:57,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=226997.83333333334, ans=0.2 2024-06-20 15:01:03,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=227016.16666666666, ans=0.125 2024-06-20 15:01:07,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2024-06-20 15:01:10,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2024-06-20 15:01:17,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=227052.83333333334, ans=0.125 2024-06-20 15:01:21,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2024-06-20 15:01:22,138 INFO [train.py:1028] (0/2) Epoch 13, batch 2450, loss[loss=0.2153, simple_loss=0.2667, pruned_loss=0.08196, over 13312.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2753, pruned_loss=0.09031, over 2584847.15 frames. ], batch size: 63, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:01:23,138 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.139e+01 2024-06-20 15:01:35,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=227089.5, ans=0.125 2024-06-20 15:01:36,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=227107.83333333334, ans=0.125 2024-06-20 15:01:41,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=227107.83333333334, ans=0.95 2024-06-20 15:01:59,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=227144.5, ans=0.2 2024-06-20 15:02:00,930 INFO [train.py:1028] (0/2) Epoch 13, batch 2500, loss[loss=0.2191, simple_loss=0.2643, pruned_loss=0.08695, over 13189.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2737, pruned_loss=0.08978, over 2588880.97 frames. ], batch size: 83, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:02:01,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=227162.83333333334, ans=0.125 2024-06-20 15:02:07,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=227162.83333333334, ans=0.125 2024-06-20 15:02:10,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=227181.16666666666, ans=0.125 2024-06-20 15:02:11,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=227181.16666666666, ans=0.025 2024-06-20 15:02:12,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 1.797e+02 1.907e+02 2.135e+02 3.272e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 15:02:14,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=227181.16666666666, ans=0.5 2024-06-20 15:02:22,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=227199.5, ans=0.125 2024-06-20 15:02:25,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=227217.83333333334, ans=0.025 2024-06-20 15:02:36,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=227236.16666666666, ans=0.125 2024-06-20 15:02:40,436 INFO [train.py:1028] (0/2) Epoch 13, batch 2550, loss[loss=0.2384, simple_loss=0.287, pruned_loss=0.09487, over 12621.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2731, pruned_loss=0.08975, over 2588242.66 frames. ], batch size: 22, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:02:45,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227254.5, ans=0.125 2024-06-20 15:02:46,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=227254.5, ans=0.125 2024-06-20 15:03:09,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=227309.5, ans=0.0 2024-06-20 15:03:11,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.73 vs. limit=10.0 2024-06-20 15:03:19,523 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-124000.pt 2024-06-20 15:03:29,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.41 vs. limit=22.5 2024-06-20 15:03:31,741 INFO [train.py:1028] (0/2) Epoch 13, batch 2600, loss[loss=0.2116, simple_loss=0.2745, pruned_loss=0.07437, over 13309.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2718, pruned_loss=0.08886, over 2588185.53 frames. ], batch size: 52, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:03:33,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2024-06-20 15:03:36,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=227346.16666666666, ans=0.125 2024-06-20 15:03:40,872 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2024-06-20 15:03:42,720 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.653e+02 1.910e+02 2.041e+02 2.210e+02 2.784e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 15:03:43,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=227364.5, ans=0.0 2024-06-20 15:03:51,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2024-06-20 15:03:51,092 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.69 vs. limit=22.5 2024-06-20 15:04:00,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=227401.16666666666, ans=0.5 2024-06-20 15:04:06,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2024-06-20 15:04:10,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.44 vs. limit=10.0 2024-06-20 15:04:11,134 INFO [train.py:1028] (0/2) Epoch 13, batch 2650, loss[loss=0.2211, simple_loss=0.2625, pruned_loss=0.08984, over 13028.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2708, pruned_loss=0.08862, over 2588778.92 frames. ], batch size: 144, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:04:17,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=227456.16666666666, ans=0.125 2024-06-20 15:04:23,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=227456.16666666666, ans=0.1 2024-06-20 15:04:26,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.22 vs. limit=22.5 2024-06-20 15:04:28,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=227474.5, ans=0.0 2024-06-20 15:04:41,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=227492.83333333334, ans=0.125 2024-06-20 15:04:49,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227511.16666666666, ans=0.125 2024-06-20 15:04:50,529 INFO [train.py:1028] (0/2) Epoch 13, batch 2700, loss[loss=0.2093, simple_loss=0.2578, pruned_loss=0.08041, over 13294.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2696, pruned_loss=0.08854, over 2586291.77 frames. ], batch size: 89, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:04:50,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=227529.5, ans=0.0 2024-06-20 15:04:52,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-20 15:04:54,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=227529.5, ans=0.07 2024-06-20 15:04:57,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2024-06-20 15:05:01,439 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.895e+02 2.106e+02 2.323e+02 3.880e+02, threshold=4.212e+02, percent-clipped=0.0 2024-06-20 15:05:04,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=227547.83333333334, ans=0.2 2024-06-20 15:05:20,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=227584.5, ans=0.0 2024-06-20 15:05:24,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.85 vs. limit=15.0 2024-06-20 15:05:36,780 INFO [train.py:1028] (0/2) Epoch 13, batch 2750, loss[loss=0.2017, simple_loss=0.2518, pruned_loss=0.07583, over 13252.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2683, pruned_loss=0.08756, over 2581679.78 frames. ], batch size: 43, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:05:40,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227621.16666666666, ans=0.125 2024-06-20 15:05:49,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=227639.5, ans=0.2 2024-06-20 15:05:56,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=227657.83333333334, ans=0.2 2024-06-20 15:06:10,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=227694.5, ans=0.2 2024-06-20 15:06:10,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=227694.5, ans=0.09899494936611666 2024-06-20 15:06:14,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.59 vs. limit=22.5 2024-06-20 15:06:17,346 INFO [train.py:1028] (0/2) Epoch 13, batch 2800, loss[loss=0.2163, simple_loss=0.2506, pruned_loss=0.09096, over 10745.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2675, pruned_loss=0.08745, over 2579139.51 frames. ], batch size: 304, lr: 4.55e-03, grad_scale: 128.0 2024-06-20 15:06:19,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.41 vs. limit=22.5 2024-06-20 15:06:23,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=227712.83333333334, ans=0.1 2024-06-20 15:06:28,585 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.818e+02 1.926e+02 2.112e+02 2.648e+02, threshold=3.852e+02, percent-clipped=0.0 2024-06-20 15:06:35,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=227749.5, ans=0.0 2024-06-20 15:06:45,226 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:06:51,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=227786.16666666666, ans=0.0 2024-06-20 15:06:56,942 INFO [train.py:1028] (0/2) Epoch 13, batch 2850, loss[loss=0.2221, simple_loss=0.2622, pruned_loss=0.09105, over 13264.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2668, pruned_loss=0.08736, over 2577378.01 frames. ], batch size: 49, lr: 4.55e-03, grad_scale: 64.0 2024-06-20 15:06:59,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=227804.5, ans=0.125 2024-06-20 15:07:01,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=227804.5, ans=0.125 2024-06-20 15:07:02,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=227804.5, ans=15.0 2024-06-20 15:07:05,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=227822.83333333334, ans=0.125 2024-06-20 15:07:09,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=227822.83333333334, ans=0.0 2024-06-20 15:07:15,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.57 vs. limit=12.0 2024-06-20 15:07:15,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=227841.16666666666, ans=0.125 2024-06-20 15:07:15,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=227841.16666666666, ans=0.125 2024-06-20 15:07:42,227 INFO [train.py:1028] (0/2) Epoch 13, batch 2900, loss[loss=0.2178, simple_loss=0.2661, pruned_loss=0.08471, over 13101.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2653, pruned_loss=0.08688, over 2585843.88 frames. ], batch size: 55, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:07:54,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.790e+02 1.893e+02 2.052e+02 3.270e+02, threshold=3.787e+02, percent-clipped=0.0 2024-06-20 15:08:11,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227951.16666666666, ans=0.125 2024-06-20 15:08:15,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.62 vs. limit=22.5 2024-06-20 15:08:21,302 INFO [train.py:1028] (0/2) Epoch 13, batch 2950, loss[loss=0.2177, simple_loss=0.2651, pruned_loss=0.08511, over 13244.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2652, pruned_loss=0.08698, over 2579283.29 frames. ], batch size: 43, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:08:28,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=228006.16666666666, ans=0.025 2024-06-20 15:08:30,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=228006.16666666666, ans=0.125 2024-06-20 15:08:35,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2024-06-20 15:08:38,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=228024.5, ans=0.125 2024-06-20 15:08:50,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228042.83333333334, ans=0.0 2024-06-20 15:08:55,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228061.16666666666, ans=0.1 2024-06-20 15:08:57,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228061.16666666666, ans=0.1 2024-06-20 15:09:02,005 INFO [train.py:1028] (0/2) Epoch 13, batch 3000, loss[loss=0.216, simple_loss=0.2653, pruned_loss=0.08338, over 13223.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2643, pruned_loss=0.08677, over 2577540.56 frames. ], batch size: 59, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:09:02,006 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 15:09:08,327 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.0418, 3.6332, 3.9596, 3.6880], device='cuda:0') 2024-06-20 15:09:11,010 INFO [train.py:1060] (0/2) Epoch 13, validation: loss=0.192, simple_loss=0.2563, pruned_loss=0.06384, over 351949.00 frames. 2024-06-20 15:09:11,011 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 15:09:14,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=228079.5, ans=0.125 2024-06-20 15:09:22,990 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 1.809e+02 1.925e+02 2.100e+02 4.409e+02, threshold=3.851e+02, percent-clipped=1.0 2024-06-20 15:09:32,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228116.16666666666, ans=0.125 2024-06-20 15:09:38,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=228134.5, ans=0.0 2024-06-20 15:09:43,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=228134.5, ans=0.125 2024-06-20 15:09:43,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2024-06-20 15:09:46,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=228152.83333333334, ans=0.125 2024-06-20 15:09:52,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=228152.83333333334, ans=0.1 2024-06-20 15:09:54,268 INFO [train.py:1028] (0/2) Epoch 13, batch 3050, loss[loss=0.1955, simple_loss=0.2441, pruned_loss=0.07352, over 13259.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2631, pruned_loss=0.08639, over 2577316.42 frames. ], batch size: 46, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:09:54,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=228171.16666666666, ans=0.5 2024-06-20 15:10:10,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=15.0 2024-06-20 15:10:27,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.13 vs. limit=12.0 2024-06-20 15:10:33,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-06-20 15:10:34,028 INFO [train.py:1028] (0/2) Epoch 13, batch 3100, loss[loss=0.2054, simple_loss=0.2497, pruned_loss=0.08051, over 13045.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2615, pruned_loss=0.08537, over 2578503.37 frames. ], batch size: 144, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:10:34,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228262.83333333334, ans=0.1 2024-06-20 15:10:37,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228262.83333333334, ans=0.1 2024-06-20 15:10:46,084 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.836e+02 1.964e+02 2.137e+02 3.172e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 15:10:46,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=228281.16666666666, ans=0.125 2024-06-20 15:10:48,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=228281.16666666666, ans=0.025 2024-06-20 15:10:52,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=228299.5, ans=0.1 2024-06-20 15:10:55,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=228299.5, ans=0.0 2024-06-20 15:11:00,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=228317.83333333334, ans=0.0 2024-06-20 15:11:02,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228317.83333333334, ans=0.1 2024-06-20 15:11:07,964 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:11:13,172 INFO [train.py:1028] (0/2) Epoch 13, batch 3150, loss[loss=0.201, simple_loss=0.2439, pruned_loss=0.07908, over 12930.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2603, pruned_loss=0.08474, over 2580504.31 frames. ], batch size: 158, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:11:16,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=228354.5, ans=0.125 2024-06-20 15:11:31,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=12.0 2024-06-20 15:11:33,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=15.0 2024-06-20 15:11:33,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.68 vs. limit=22.5 2024-06-20 15:11:40,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=228409.5, ans=0.0 2024-06-20 15:11:40,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.17 vs. limit=22.5 2024-06-20 15:11:50,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=228427.83333333334, ans=0.0 2024-06-20 15:11:53,841 INFO [train.py:1028] (0/2) Epoch 13, batch 3200, loss[loss=0.2073, simple_loss=0.259, pruned_loss=0.07776, over 13210.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2594, pruned_loss=0.08441, over 2581854.93 frames. ], batch size: 55, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:12:00,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=228446.16666666666, ans=0.025 2024-06-20 15:12:05,392 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.803e+02 1.967e+02 2.199e+02 2.722e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-20 15:12:28,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=228501.16666666666, ans=0.0 2024-06-20 15:12:32,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=228519.5, ans=0.125 2024-06-20 15:12:37,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=15.0 2024-06-20 15:12:39,100 INFO [train.py:1028] (0/2) Epoch 13, batch 3250, loss[loss=0.2012, simple_loss=0.2434, pruned_loss=0.07947, over 13198.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2593, pruned_loss=0.08448, over 2585655.43 frames. ], batch size: 72, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:12:39,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=228537.83333333334, ans=0.0 2024-06-20 15:13:00,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=228574.5, ans=0.0 2024-06-20 15:13:04,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228592.83333333334, ans=0.1 2024-06-20 15:13:06,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=228592.83333333334, ans=0.035 2024-06-20 15:13:06,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=228592.83333333334, ans=0.125 2024-06-20 15:13:08,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228592.83333333334, ans=0.125 2024-06-20 15:13:13,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=228611.16666666666, ans=0.025 2024-06-20 15:13:14,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=15.0 2024-06-20 15:13:19,484 INFO [train.py:1028] (0/2) Epoch 13, batch 3300, loss[loss=0.2521, simple_loss=0.2884, pruned_loss=0.1079, over 12698.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2587, pruned_loss=0.0842, over 2582509.73 frames. ], batch size: 176, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:13:30,953 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.557e+02 1.829e+02 2.005e+02 2.309e+02 2.933e+02, threshold=4.010e+02, percent-clipped=0.0 2024-06-20 15:13:43,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=228684.5, ans=0.04949747468305833 2024-06-20 15:13:44,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=228684.5, ans=0.09899494936611666 2024-06-20 15:13:49,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=228684.5, ans=0.125 2024-06-20 15:13:58,053 INFO [train.py:1028] (0/2) Epoch 13, batch 3350, loss[loss=0.2255, simple_loss=0.2607, pruned_loss=0.09517, over 12950.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2589, pruned_loss=0.08483, over 2577741.18 frames. ], batch size: 158, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:14:20,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=228757.83333333334, ans=0.125 2024-06-20 15:14:39,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.03 vs. limit=6.0 2024-06-20 15:14:39,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=228794.5, ans=0.0 2024-06-20 15:14:39,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228794.5, ans=0.0 2024-06-20 15:14:42,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=22.5 2024-06-20 15:14:44,309 INFO [train.py:1028] (0/2) Epoch 13, batch 3400, loss[loss=0.2412, simple_loss=0.2904, pruned_loss=0.09595, over 12560.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2586, pruned_loss=0.08478, over 2575296.78 frames. ], batch size: 22, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:14:44,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=228812.83333333334, ans=0.125 2024-06-20 15:14:55,782 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.787e+02 1.935e+02 2.073e+02 2.749e+02, threshold=3.870e+02, percent-clipped=0.0 2024-06-20 15:15:09,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=228867.83333333334, ans=0.125 2024-06-20 15:15:13,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=228867.83333333334, ans=0.125 2024-06-20 15:15:23,656 INFO [train.py:1028] (0/2) Epoch 13, batch 3450, loss[loss=0.2206, simple_loss=0.262, pruned_loss=0.08965, over 12725.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2579, pruned_loss=0.08436, over 2576922.00 frames. ], batch size: 176, lr: 4.54e-03, grad_scale: 64.0 2024-06-20 15:15:39,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228941.16666666666, ans=0.1 2024-06-20 15:15:45,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=228959.5, ans=0.025 2024-06-20 15:15:58,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=228977.83333333334, ans=0.125 2024-06-20 15:16:01,714 INFO [train.py:1028] (0/2) Epoch 13, batch 3500, loss[loss=0.2263, simple_loss=0.2666, pruned_loss=0.09302, over 12948.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2575, pruned_loss=0.08389, over 2575997.79 frames. ], batch size: 33, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:16:06,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=228996.16666666666, ans=0.0 2024-06-20 15:16:09,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=229014.5, ans=0.2 2024-06-20 15:16:13,394 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.881e+02 2.115e+02 2.331e+02 4.090e+02, threshold=4.229e+02, percent-clipped=1.0 2024-06-20 15:16:17,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=229032.83333333334, ans=0.09899494936611666 2024-06-20 15:16:27,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.58 vs. limit=15.0 2024-06-20 15:16:38,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=229069.5, ans=0.0 2024-06-20 15:16:44,246 INFO [train.py:1028] (0/2) Epoch 13, batch 3550, loss[loss=0.1898, simple_loss=0.2387, pruned_loss=0.07043, over 13210.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2574, pruned_loss=0.08389, over 2577239.46 frames. ], batch size: 95, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:16:49,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.10 vs. limit=15.0 2024-06-20 15:16:50,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229087.83333333334, ans=0.1 2024-06-20 15:17:12,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=229142.83333333334, ans=0.125 2024-06-20 15:17:20,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=229161.16666666666, ans=0.04949747468305833 2024-06-20 15:17:21,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=229161.16666666666, ans=0.0 2024-06-20 15:17:26,558 INFO [train.py:1028] (0/2) Epoch 13, batch 3600, loss[loss=0.2271, simple_loss=0.2625, pruned_loss=0.09584, over 13327.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2569, pruned_loss=0.08376, over 2581597.53 frames. ], batch size: 49, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:17:38,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.744e+02 1.894e+02 2.214e+02 2.668e+02, threshold=3.788e+02, percent-clipped=0.0 2024-06-20 15:17:59,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229252.83333333334, ans=0.1 2024-06-20 15:18:02,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=229252.83333333334, ans=0.125 2024-06-20 15:18:04,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2024-06-20 15:18:06,383 INFO [train.py:1028] (0/2) Epoch 13, batch 3650, loss[loss=0.2191, simple_loss=0.2595, pruned_loss=0.08931, over 13002.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2568, pruned_loss=0.0837, over 2579598.38 frames. ], batch size: 102, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:18:11,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=229271.16666666666, ans=0.0 2024-06-20 15:18:13,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229289.5, ans=0.1 2024-06-20 15:18:17,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=229289.5, ans=0.125 2024-06-20 15:18:26,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=229307.83333333334, ans=0.125 2024-06-20 15:18:28,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2024-06-20 15:18:30,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=229326.16666666666, ans=0.0 2024-06-20 15:18:31,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=229326.16666666666, ans=0.2 2024-06-20 15:18:37,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=229344.5, ans=0.1 2024-06-20 15:18:45,548 INFO [train.py:1028] (0/2) Epoch 13, batch 3700, loss[loss=0.2111, simple_loss=0.2589, pruned_loss=0.08163, over 13243.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2565, pruned_loss=0.08365, over 2583992.11 frames. ], batch size: 72, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:18:47,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=229362.83333333334, ans=15.0 2024-06-20 15:18:57,331 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.760e+02 1.879e+02 2.033e+02 2.967e+02, threshold=3.757e+02, percent-clipped=0.0 2024-06-20 15:18:59,288 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:19:01,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.67 vs. limit=15.0 2024-06-20 15:19:02,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=229399.5, ans=0.125 2024-06-20 15:19:09,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=229399.5, ans=0.0 2024-06-20 15:19:12,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=15.0 2024-06-20 15:19:13,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=229417.83333333334, ans=0.125 2024-06-20 15:19:18,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=229417.83333333334, ans=0.2 2024-06-20 15:19:29,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=229436.16666666666, ans=0.0 2024-06-20 15:19:31,747 INFO [train.py:1028] (0/2) Epoch 13, batch 3750, loss[loss=0.234, simple_loss=0.2799, pruned_loss=0.09411, over 12595.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2559, pruned_loss=0.08322, over 2585349.14 frames. ], batch size: 22, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:19:34,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=229454.5, ans=0.0 2024-06-20 15:19:35,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=229454.5, ans=0.125 2024-06-20 15:19:38,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=229472.83333333334, ans=0.0 2024-06-20 15:19:45,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=229472.83333333334, ans=0.125 2024-06-20 15:19:53,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2024-06-20 15:19:57,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=229509.5, ans=0.125 2024-06-20 15:20:02,474 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:20:05,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2024-06-20 15:20:09,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2024-06-20 15:20:10,851 INFO [train.py:1028] (0/2) Epoch 13, batch 3800, loss[loss=0.2172, simple_loss=0.2574, pruned_loss=0.08852, over 13165.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2557, pruned_loss=0.08322, over 2583009.32 frames. ], batch size: 83, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:20:21,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-20 15:20:22,490 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.766e+02 1.889e+02 2.083e+02 2.735e+02, threshold=3.778e+02, percent-clipped=0.0 2024-06-20 15:20:22,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=229564.5, ans=0.0 2024-06-20 15:20:28,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=229582.83333333334, ans=0.125 2024-06-20 15:20:33,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=229601.16666666666, ans=0.2 2024-06-20 15:20:35,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=229601.16666666666, ans=0.0 2024-06-20 15:20:43,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=229619.5, ans=0.0 2024-06-20 15:20:45,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229619.5, ans=0.125 2024-06-20 15:20:49,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=229637.83333333334, ans=0.2 2024-06-20 15:20:50,360 INFO [train.py:1028] (0/2) Epoch 13, batch 3850, loss[loss=0.1969, simple_loss=0.2427, pruned_loss=0.0755, over 13039.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2555, pruned_loss=0.08291, over 2581972.78 frames. ], batch size: 144, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:20:56,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229637.83333333334, ans=0.125 2024-06-20 15:21:12,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229692.83333333334, ans=0.1 2024-06-20 15:21:18,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=229692.83333333334, ans=0.1 2024-06-20 15:21:25,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=229711.16666666666, ans=0.125 2024-06-20 15:21:27,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=229729.5, ans=0.0 2024-06-20 15:21:28,030 INFO [train.py:1028] (0/2) Epoch 13, batch 3900, loss[loss=0.209, simple_loss=0.2482, pruned_loss=0.08486, over 13221.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.255, pruned_loss=0.08284, over 2584597.17 frames. ], batch size: 83, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:21:28,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=229729.5, ans=0.04949747468305833 2024-06-20 15:21:28,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=229729.5, ans=0.125 2024-06-20 15:21:32,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.25 vs. limit=10.0 2024-06-20 15:21:43,211 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.782e+02 1.884e+02 1.984e+02 2.506e+02, threshold=3.768e+02, percent-clipped=0.0 2024-06-20 15:21:58,958 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:21:59,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229784.5, ans=0.125 2024-06-20 15:22:05,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=229784.5, ans=0.1 2024-06-20 15:22:09,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=229802.83333333334, ans=0.04949747468305833 2024-06-20 15:22:09,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=229802.83333333334, ans=0.2 2024-06-20 15:22:12,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=229802.83333333334, ans=0.09899494936611666 2024-06-20 15:22:14,824 INFO [train.py:1028] (0/2) Epoch 13, batch 3950, loss[loss=0.2067, simple_loss=0.2553, pruned_loss=0.07903, over 13137.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2535, pruned_loss=0.08169, over 2587308.31 frames. ], batch size: 132, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:22:17,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=229821.16666666666, ans=0.125 2024-06-20 15:22:22,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=229839.5, ans=0.09899494936611666 2024-06-20 15:22:32,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2024-06-20 15:22:47,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229894.5, ans=0.1 2024-06-20 15:22:53,046 INFO [train.py:1028] (0/2) Epoch 13, batch 4000, loss[loss=0.1968, simple_loss=0.2455, pruned_loss=0.07401, over 12992.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2533, pruned_loss=0.08179, over 2582486.48 frames. ], batch size: 39, lr: 4.53e-03, grad_scale: 64.0 2024-06-20 15:22:53,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=229912.83333333334, ans=0.2 2024-06-20 15:23:00,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229931.16666666666, ans=0.1 2024-06-20 15:23:04,996 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.798e+02 1.982e+02 2.122e+02 2.925e+02, threshold=3.963e+02, percent-clipped=0.0 2024-06-20 15:23:06,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.03 vs. limit=10.0 2024-06-20 15:23:11,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=229949.5, ans=0.025 2024-06-20 15:23:12,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.00 vs. limit=22.5 2024-06-20 15:23:14,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229949.5, ans=0.1 2024-06-20 15:23:14,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=229949.5, ans=0.125 2024-06-20 15:23:33,025 INFO [train.py:1028] (0/2) Epoch 13, batch 4050, loss[loss=0.2162, simple_loss=0.2472, pruned_loss=0.09258, over 10931.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2537, pruned_loss=0.08198, over 2579702.71 frames. ], batch size: 304, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:23:39,596 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:23:42,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.45 vs. limit=22.5 2024-06-20 15:23:45,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 15:23:49,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=230041.16666666666, ans=0.04949747468305833 2024-06-20 15:24:18,920 INFO [train.py:1028] (0/2) Epoch 13, batch 4100, loss[loss=0.2327, simple_loss=0.2646, pruned_loss=0.1004, over 13031.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2544, pruned_loss=0.0825, over 2576532.29 frames. ], batch size: 102, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:24:23,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.53 vs. limit=22.5 2024-06-20 15:24:30,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 1.793e+02 1.970e+02 2.196e+02 2.662e+02, threshold=3.941e+02, percent-clipped=0.0 2024-06-20 15:24:32,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230114.5, ans=0.1 2024-06-20 15:24:47,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.61 vs. limit=22.5 2024-06-20 15:24:48,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=230151.16666666666, ans=0.125 2024-06-20 15:24:50,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2024-06-20 15:24:58,284 INFO [train.py:1028] (0/2) Epoch 13, batch 4150, loss[loss=0.2018, simple_loss=0.2483, pruned_loss=0.07761, over 13155.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2542, pruned_loss=0.08229, over 2574379.42 frames. ], batch size: 55, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:25:12,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=230206.16666666666, ans=0.125 2024-06-20 15:25:28,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=230242.83333333334, ans=0.125 2024-06-20 15:25:33,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230261.16666666666, ans=0.1 2024-06-20 15:25:36,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=22.5 2024-06-20 15:25:38,316 INFO [train.py:1028] (0/2) Epoch 13, batch 4200, loss[loss=0.2449, simple_loss=0.2804, pruned_loss=0.1047, over 13107.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.254, pruned_loss=0.08225, over 2577407.17 frames. ], batch size: 103, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:25:42,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.67 vs. limit=22.5 2024-06-20 15:25:49,956 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.543e+02 1.769e+02 1.944e+02 2.138e+02 2.693e+02, threshold=3.888e+02, percent-clipped=0.0 2024-06-20 15:25:51,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=230297.83333333334, ans=0.125 2024-06-20 15:26:12,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230352.83333333334, ans=0.1 2024-06-20 15:26:17,580 INFO [train.py:1028] (0/2) Epoch 13, batch 4250, loss[loss=0.2046, simple_loss=0.2515, pruned_loss=0.07884, over 13349.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2537, pruned_loss=0.08203, over 2580243.72 frames. ], batch size: 46, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:26:19,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=230371.16666666666, ans=0.0 2024-06-20 15:26:21,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230371.16666666666, ans=0.1 2024-06-20 15:26:29,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=230389.5, ans=0.2 2024-06-20 15:26:42,159 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:26:43,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=230407.83333333334, ans=0.0 2024-06-20 15:26:52,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=230426.16666666666, ans=0.125 2024-06-20 15:27:04,014 INFO [train.py:1028] (0/2) Epoch 13, batch 4300, loss[loss=0.1993, simple_loss=0.243, pruned_loss=0.07776, over 13192.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2533, pruned_loss=0.0818, over 2580179.35 frames. ], batch size: 59, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:27:06,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230462.83333333334, ans=0.1 2024-06-20 15:27:08,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=230462.83333333334, ans=0.125 2024-06-20 15:27:12,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=230481.16666666666, ans=0.2 2024-06-20 15:27:15,780 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.565e+02 1.746e+02 1.862e+02 2.031e+02 2.664e+02, threshold=3.723e+02, percent-clipped=0.0 2024-06-20 15:27:29,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.04 vs. limit=6.0 2024-06-20 15:27:42,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=230554.5, ans=0.0 2024-06-20 15:27:43,414 INFO [train.py:1028] (0/2) Epoch 13, batch 4350, loss[loss=0.2032, simple_loss=0.2504, pruned_loss=0.078, over 13215.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2523, pruned_loss=0.08143, over 2584638.31 frames. ], batch size: 59, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:27:46,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.37 vs. limit=15.0 2024-06-20 15:28:01,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=230591.16666666666, ans=0.125 2024-06-20 15:28:06,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2024-06-20 15:28:10,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=230609.5, ans=0.2 2024-06-20 15:28:10,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=230609.5, ans=0.0 2024-06-20 15:28:11,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.08 vs. limit=10.0 2024-06-20 15:28:22,929 INFO [train.py:1028] (0/2) Epoch 13, batch 4400, loss[loss=0.2045, simple_loss=0.2433, pruned_loss=0.08289, over 13204.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2525, pruned_loss=0.08146, over 2584282.42 frames. ], batch size: 83, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:28:34,533 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.744e+02 1.908e+02 2.040e+02 2.905e+02, threshold=3.816e+02, percent-clipped=0.0 2024-06-20 15:28:39,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.07 vs. limit=22.5 2024-06-20 15:29:04,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=230719.5, ans=0.2 2024-06-20 15:29:04,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=230719.5, ans=0.2 2024-06-20 15:29:09,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=230737.83333333334, ans=0.125 2024-06-20 15:29:09,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=230737.83333333334, ans=0.04949747468305833 2024-06-20 15:29:09,680 INFO [train.py:1028] (0/2) Epoch 13, batch 4450, loss[loss=0.2122, simple_loss=0.2609, pruned_loss=0.08176, over 12954.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2529, pruned_loss=0.08203, over 2578183.25 frames. ], batch size: 33, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:29:13,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=230737.83333333334, ans=0.125 2024-06-20 15:29:19,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=230756.16666666666, ans=0.125 2024-06-20 15:29:22,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=230756.16666666666, ans=0.125 2024-06-20 15:29:33,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=230792.83333333334, ans=0.125 2024-06-20 15:29:35,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=230792.83333333334, ans=0.0 2024-06-20 15:29:38,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=230792.83333333334, ans=0.0 2024-06-20 15:29:47,955 INFO [train.py:1028] (0/2) Epoch 13, batch 4500, loss[loss=0.1906, simple_loss=0.2342, pruned_loss=0.07348, over 13251.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2525, pruned_loss=0.08215, over 2582475.67 frames. ], batch size: 89, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:29:48,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=230829.5, ans=0.125 2024-06-20 15:29:59,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=230847.83333333334, ans=0.5 2024-06-20 15:29:59,707 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.783e+02 1.901e+02 2.120e+02 2.684e+02, threshold=3.802e+02, percent-clipped=0.0 2024-06-20 15:30:05,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-20 15:30:08,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=230866.16666666666, ans=0.125 2024-06-20 15:30:14,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=230884.5, ans=0.0 2024-06-20 15:30:21,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=230902.83333333334, ans=0.125 2024-06-20 15:30:23,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=230902.83333333334, ans=0.125 2024-06-20 15:30:27,832 INFO [train.py:1028] (0/2) Epoch 13, batch 4550, loss[loss=0.1944, simple_loss=0.2396, pruned_loss=0.07461, over 13219.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2525, pruned_loss=0.0823, over 2586999.91 frames. ], batch size: 52, lr: 4.52e-03, grad_scale: 64.0 2024-06-20 15:30:29,619 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:30:32,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230921.16666666666, ans=0.1 2024-06-20 15:30:39,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=15.0 2024-06-20 15:30:43,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=230957.83333333334, ans=0.0 2024-06-20 15:31:00,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230994.5, ans=0.1 2024-06-20 15:31:12,595 INFO [train.py:1028] (0/2) Epoch 13, batch 4600, loss[loss=0.2261, simple_loss=0.2591, pruned_loss=0.09654, over 12604.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2517, pruned_loss=0.08168, over 2582997.77 frames. ], batch size: 202, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:31:18,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=231012.83333333334, ans=0.0 2024-06-20 15:31:24,994 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.747e+02 1.893e+02 2.118e+02 3.705e+02, threshold=3.786e+02, percent-clipped=0.0 2024-06-20 15:31:37,813 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:31:42,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.21 vs. limit=22.5 2024-06-20 15:31:51,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=231086.16666666666, ans=0.0 2024-06-20 15:31:57,053 INFO [train.py:1028] (0/2) Epoch 13, batch 4650, loss[loss=0.2058, simple_loss=0.2402, pruned_loss=0.08569, over 13111.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2511, pruned_loss=0.08144, over 2586759.40 frames. ], batch size: 132, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:32:06,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=231122.83333333334, ans=0.2 2024-06-20 15:32:08,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=231122.83333333334, ans=0.125 2024-06-20 15:32:14,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231141.16666666666, ans=0.125 2024-06-20 15:32:16,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=231141.16666666666, ans=0.5 2024-06-20 15:32:19,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231141.16666666666, ans=0.1 2024-06-20 15:32:23,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231159.5, ans=0.1 2024-06-20 15:32:23,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2024-06-20 15:32:26,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=231159.5, ans=0.2 2024-06-20 15:32:32,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=231177.83333333334, ans=0.125 2024-06-20 15:32:37,431 INFO [train.py:1028] (0/2) Epoch 13, batch 4700, loss[loss=0.2022, simple_loss=0.2652, pruned_loss=0.06964, over 12895.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2514, pruned_loss=0.08145, over 2583069.51 frames. ], batch size: 26, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:32:39,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=231196.16666666666, ans=0.0 2024-06-20 15:32:43,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=231196.16666666666, ans=0.125 2024-06-20 15:32:49,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.826e+02 1.960e+02 2.292e+02 3.430e+02, threshold=3.921e+02, percent-clipped=0.0 2024-06-20 15:32:53,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231232.83333333334, ans=0.1 2024-06-20 15:32:55,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=231232.83333333334, ans=0.0 2024-06-20 15:33:05,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=231251.16666666666, ans=0.0 2024-06-20 15:33:15,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-20 15:33:16,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=231287.83333333334, ans=0.02 2024-06-20 15:33:17,368 INFO [train.py:1028] (0/2) Epoch 13, batch 4750, loss[loss=0.2265, simple_loss=0.2643, pruned_loss=0.09436, over 12530.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2508, pruned_loss=0.08133, over 2579476.43 frames. ], batch size: 202, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:33:28,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=231306.16666666666, ans=0.2 2024-06-20 15:33:38,552 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:33:38,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=231324.5, ans=0.0 2024-06-20 15:33:40,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=231324.5, ans=0.04949747468305833 2024-06-20 15:34:04,787 INFO [train.py:1028] (0/2) Epoch 13, batch 4800, loss[loss=0.1862, simple_loss=0.2359, pruned_loss=0.06828, over 13251.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2504, pruned_loss=0.08095, over 2576641.75 frames. ], batch size: 63, lr: 4.51e-03, grad_scale: 64.0 2024-06-20 15:34:07,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231379.5, ans=0.1 2024-06-20 15:34:17,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.776e+02 1.902e+02 2.059e+02 3.063e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 15:34:28,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=231434.5, ans=0.125 2024-06-20 15:34:28,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=231434.5, ans=0.125 2024-06-20 15:34:30,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=231434.5, ans=0.125 2024-06-20 15:34:41,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2024-06-20 15:34:44,938 INFO [train.py:1028] (0/2) Epoch 13, batch 4850, loss[loss=0.2066, simple_loss=0.2439, pruned_loss=0.08463, over 13257.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2497, pruned_loss=0.08082, over 2575295.40 frames. ], batch size: 89, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:34:48,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=231471.16666666666, ans=0.125 2024-06-20 15:34:53,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=231489.5, ans=0.0 2024-06-20 15:35:06,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=231507.83333333334, ans=0.025 2024-06-20 15:35:08,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=231526.16666666666, ans=0.125 2024-06-20 15:35:14,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=231526.16666666666, ans=22.5 2024-06-20 15:35:26,256 INFO [train.py:1028] (0/2) Epoch 13, batch 4900, loss[loss=0.2053, simple_loss=0.2528, pruned_loss=0.07893, over 13215.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2497, pruned_loss=0.08097, over 2576698.15 frames. ], batch size: 59, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:35:32,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.08 vs. limit=12.0 2024-06-20 15:35:38,731 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.770e+02 1.923e+02 2.109e+02 3.113e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 15:35:40,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=231581.16666666666, ans=0.125 2024-06-20 15:35:49,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=231599.5, ans=0.125 2024-06-20 15:35:49,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.88 vs. limit=15.0 2024-06-20 15:35:58,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=231617.83333333334, ans=0.125 2024-06-20 15:36:00,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=231617.83333333334, ans=0.125 2024-06-20 15:36:04,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231636.16666666666, ans=0.1 2024-06-20 15:36:13,765 INFO [train.py:1028] (0/2) Epoch 13, batch 4950, loss[loss=0.2007, simple_loss=0.2353, pruned_loss=0.08307, over 10992.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2501, pruned_loss=0.0813, over 2570871.16 frames. ], batch size: 304, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:36:17,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=231654.5, ans=0.125 2024-06-20 15:36:21,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231672.83333333334, ans=0.125 2024-06-20 15:36:26,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=231672.83333333334, ans=0.0 2024-06-20 15:36:27,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231672.83333333334, ans=0.1 2024-06-20 15:36:31,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=231691.16666666666, ans=0.0 2024-06-20 15:36:52,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231727.83333333334, ans=0.0 2024-06-20 15:36:56,349 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:36:56,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.44 vs. limit=10.0 2024-06-20 15:37:00,285 INFO [train.py:1028] (0/2) Epoch 13, batch 5000, loss[loss=0.1972, simple_loss=0.2453, pruned_loss=0.0746, over 13242.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2497, pruned_loss=0.08116, over 2573884.32 frames. ], batch size: 95, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:37:14,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231764.5, ans=0.1 2024-06-20 15:37:15,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.807e+02 1.898e+02 2.007e+02 2.497e+02, threshold=3.795e+02, percent-clipped=0.0 2024-06-20 15:37:29,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=231801.16666666666, ans=0.125 2024-06-20 15:37:35,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=231801.16666666666, ans=0.2 2024-06-20 15:37:49,581 INFO [train.py:1028] (0/2) Epoch 13, batch 5050, loss[loss=0.2018, simple_loss=0.2397, pruned_loss=0.08198, over 13012.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2491, pruned_loss=0.08049, over 2572606.59 frames. ], batch size: 36, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:37:49,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=231837.83333333334, ans=0.125 2024-06-20 15:38:04,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=231856.16666666666, ans=0.2 2024-06-20 15:38:11,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231874.5, ans=0.125 2024-06-20 15:38:32,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231911.16666666666, ans=0.1 2024-06-20 15:38:39,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=231911.16666666666, ans=0.2 2024-06-20 15:38:42,184 INFO [train.py:1028] (0/2) Epoch 13, batch 5100, loss[loss=0.1955, simple_loss=0.2507, pruned_loss=0.07017, over 12924.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2489, pruned_loss=0.08047, over 2569346.78 frames. ], batch size: 39, lr: 4.51e-03, grad_scale: 128.0 2024-06-20 15:38:45,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231929.5, ans=0.1 2024-06-20 15:38:56,766 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 1.840e+02 2.082e+02 2.334e+02 3.611e+02, threshold=4.164e+02, percent-clipped=0.0 2024-06-20 15:39:23,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=231984.5, ans=0.0 2024-06-20 15:39:24,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231984.5, ans=0.125 2024-06-20 15:39:27,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2024-06-20 15:39:36,101 INFO [train.py:1028] (0/2) Epoch 13, batch 5150, loss[loss=0.1942, simple_loss=0.2317, pruned_loss=0.07833, over 13091.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2486, pruned_loss=0.08041, over 2571123.23 frames. ], batch size: 132, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:39:48,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=232039.5, ans=0.0 2024-06-20 15:39:50,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=232039.5, ans=0.125 2024-06-20 15:40:05,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=232076.16666666666, ans=0.0 2024-06-20 15:40:06,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=232076.16666666666, ans=0.0 2024-06-20 15:40:09,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=232076.16666666666, ans=0.125 2024-06-20 15:40:09,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=232076.16666666666, ans=0.0 2024-06-20 15:40:19,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=232094.5, ans=0.0 2024-06-20 15:40:24,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=232112.83333333334, ans=0.125 2024-06-20 15:40:24,687 INFO [train.py:1028] (0/2) Epoch 13, batch 5200, loss[loss=0.2007, simple_loss=0.2446, pruned_loss=0.07838, over 13131.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2486, pruned_loss=0.08023, over 2574841.30 frames. ], batch size: 95, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:40:31,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2024-06-20 15:40:33,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=232131.16666666666, ans=0.125 2024-06-20 15:40:39,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.789e+02 1.902e+02 2.145e+02 2.916e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 15:40:50,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.04 vs. limit=22.5 2024-06-20 15:40:58,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=232167.83333333334, ans=0.025 2024-06-20 15:41:02,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=232167.83333333334, ans=0.07 2024-06-20 15:41:04,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=232186.16666666666, ans=0.0 2024-06-20 15:41:11,917 INFO [train.py:1028] (0/2) Epoch 13, batch 5250, loss[loss=0.1901, simple_loss=0.2362, pruned_loss=0.07195, over 13311.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2492, pruned_loss=0.08068, over 2571900.79 frames. ], batch size: 52, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:41:37,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=232241.16666666666, ans=0.125 2024-06-20 15:41:51,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=232277.83333333334, ans=0.0 2024-06-20 15:42:07,597 INFO [train.py:1028] (0/2) Epoch 13, batch 5300, loss[loss=0.1937, simple_loss=0.2382, pruned_loss=0.07462, over 12982.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2485, pruned_loss=0.08039, over 2567842.37 frames. ], batch size: 144, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:42:14,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=232296.16666666666, ans=0.125 2024-06-20 15:42:22,542 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.483e+02 1.758e+02 1.862e+02 2.002e+02 2.730e+02, threshold=3.725e+02, percent-clipped=0.0 2024-06-20 15:42:28,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=232332.83333333334, ans=0.0 2024-06-20 15:42:30,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=12.0 2024-06-20 15:42:37,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2024-06-20 15:42:40,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=232351.16666666666, ans=0.125 2024-06-20 15:42:40,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=232351.16666666666, ans=0.125 2024-06-20 15:42:57,679 INFO [train.py:1028] (0/2) Epoch 13, batch 5350, loss[loss=0.2246, simple_loss=0.2663, pruned_loss=0.09145, over 11872.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2481, pruned_loss=0.08029, over 2574441.43 frames. ], batch size: 17, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:43:09,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-20 15:43:15,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=232406.16666666666, ans=0.125 2024-06-20 15:43:17,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=232424.5, ans=0.025 2024-06-20 15:43:23,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=232424.5, ans=0.025 2024-06-20 15:43:28,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=232442.83333333334, ans=0.0 2024-06-20 15:43:34,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=232442.83333333334, ans=0.125 2024-06-20 15:43:46,945 INFO [train.py:1028] (0/2) Epoch 13, batch 5400, loss[loss=0.2163, simple_loss=0.2547, pruned_loss=0.08899, over 12257.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2489, pruned_loss=0.08117, over 2567440.73 frames. ], batch size: 240, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:43:59,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=232497.83333333334, ans=0.125 2024-06-20 15:44:01,823 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.835e+02 1.939e+02 2.149e+02 2.788e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 15:44:04,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232497.83333333334, ans=0.1 2024-06-20 15:44:04,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=232497.83333333334, ans=0.2 2024-06-20 15:44:05,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=232516.16666666666, ans=0.125 2024-06-20 15:44:06,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=232516.16666666666, ans=0.0 2024-06-20 15:44:12,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=232516.16666666666, ans=0.125 2024-06-20 15:44:21,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=232534.5, ans=0.125 2024-06-20 15:44:27,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=232534.5, ans=0.125 2024-06-20 15:44:29,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=232534.5, ans=0.125 2024-06-20 15:44:40,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=232571.16666666666, ans=0.125 2024-06-20 15:44:41,277 INFO [train.py:1028] (0/2) Epoch 13, batch 5450, loss[loss=0.212, simple_loss=0.2523, pruned_loss=0.08585, over 12915.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.249, pruned_loss=0.08079, over 2571148.96 frames. ], batch size: 26, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:44:52,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=232571.16666666666, ans=0.0 2024-06-20 15:44:55,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=232571.16666666666, ans=0.125 2024-06-20 15:45:01,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=232589.5, ans=0.1 2024-06-20 15:45:01,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=232589.5, ans=0.0 2024-06-20 15:45:05,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2024-06-20 15:45:28,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=232644.5, ans=0.0 2024-06-20 15:45:37,743 INFO [train.py:1028] (0/2) Epoch 13, batch 5500, loss[loss=0.2234, simple_loss=0.2606, pruned_loss=0.09313, over 12159.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2491, pruned_loss=0.08067, over 2564066.12 frames. ], batch size: 240, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:45:38,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=232662.83333333334, ans=0.125 2024-06-20 15:45:41,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=232662.83333333334, ans=0.2 2024-06-20 15:45:42,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=232662.83333333334, ans=0.125 2024-06-20 15:45:46,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=232681.16666666666, ans=0.025 2024-06-20 15:45:52,338 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.797e+02 1.911e+02 2.166e+02 3.072e+02, threshold=3.821e+02, percent-clipped=0.0 2024-06-20 15:46:07,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=232717.83333333334, ans=0.025 2024-06-20 15:46:09,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2024-06-20 15:46:19,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=232736.16666666666, ans=0.125 2024-06-20 15:46:23,894 INFO [train.py:1028] (0/2) Epoch 13, batch 5550, loss[loss=0.2097, simple_loss=0.2584, pruned_loss=0.08048, over 13278.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2487, pruned_loss=0.08023, over 2568529.15 frames. ], batch size: 43, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:46:24,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=232754.5, ans=0.125 2024-06-20 15:46:28,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=232754.5, ans=0.125 2024-06-20 15:46:28,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2024-06-20 15:46:30,285 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:46:38,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=12.0 2024-06-20 15:46:39,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=232772.83333333334, ans=0.125 2024-06-20 15:46:48,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.91 vs. limit=15.0 2024-06-20 15:46:54,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2024-06-20 15:47:00,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=232809.5, ans=0.05 2024-06-20 15:47:20,371 INFO [train.py:1028] (0/2) Epoch 13, batch 5600, loss[loss=0.1862, simple_loss=0.2347, pruned_loss=0.06889, over 13227.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2483, pruned_loss=0.07998, over 2570101.22 frames. ], batch size: 89, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:47:22,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=232846.16666666666, ans=0.125 2024-06-20 15:47:35,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2024-06-20 15:47:36,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.745e+02 1.871e+02 2.102e+02 2.698e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 15:47:37,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2024-06-20 15:47:56,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232882.83333333334, ans=0.1 2024-06-20 15:47:59,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=232901.16666666666, ans=0.125 2024-06-20 15:48:00,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2024-06-20 15:48:15,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=232919.5, ans=0.125 2024-06-20 15:48:17,799 INFO [train.py:1028] (0/2) Epoch 13, batch 5650, loss[loss=0.2169, simple_loss=0.256, pruned_loss=0.08891, over 12531.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2472, pruned_loss=0.07921, over 2575393.47 frames. ], batch size: 202, lr: 4.50e-03, grad_scale: 128.0 2024-06-20 15:48:39,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=232974.5, ans=0.035 2024-06-20 15:48:40,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=232974.5, ans=0.125 2024-06-20 15:48:50,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=232992.83333333334, ans=0.1 2024-06-20 15:48:57,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.51 vs. limit=22.5 2024-06-20 15:49:05,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=233011.16666666666, ans=0.125 2024-06-20 15:49:06,725 INFO [train.py:1028] (0/2) Epoch 13, batch 5700, loss[loss=0.2078, simple_loss=0.2546, pruned_loss=0.08045, over 13224.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2476, pruned_loss=0.07955, over 2579595.04 frames. ], batch size: 63, lr: 4.49e-03, grad_scale: 128.0 2024-06-20 15:49:06,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=233029.5, ans=0.125 2024-06-20 15:49:14,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233029.5, ans=0.1 2024-06-20 15:49:15,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=233047.83333333334, ans=0.125 2024-06-20 15:49:21,297 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.744e+02 1.871e+02 1.989e+02 2.447e+02, threshold=3.741e+02, percent-clipped=0.0 2024-06-20 15:49:30,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233066.16666666666, ans=0.1 2024-06-20 15:49:31,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=233066.16666666666, ans=0.0 2024-06-20 15:49:35,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=233084.5, ans=0.2 2024-06-20 15:49:47,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233102.83333333334, ans=0.125 2024-06-20 15:49:54,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.59 vs. limit=10.0 2024-06-20 15:49:54,309 INFO [train.py:1028] (0/2) Epoch 13, batch 5750, loss[loss=0.214, simple_loss=0.256, pruned_loss=0.08604, over 12679.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.249, pruned_loss=0.0802, over 2578846.15 frames. ], batch size: 176, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:49:59,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.33 vs. limit=22.5 2024-06-20 15:50:02,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=233121.16666666666, ans=0.125 2024-06-20 15:50:06,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=233139.5, ans=0.125 2024-06-20 15:50:07,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=233139.5, ans=0.125 2024-06-20 15:50:38,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=233194.5, ans=0.125 2024-06-20 15:50:52,844 INFO [train.py:1028] (0/2) Epoch 13, batch 5800, loss[loss=0.2405, simple_loss=0.2742, pruned_loss=0.1034, over 12768.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2505, pruned_loss=0.08136, over 2578249.60 frames. ], batch size: 176, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:50:55,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2024-06-20 15:51:03,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=233231.16666666666, ans=0.1 2024-06-20 15:51:03,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=233231.16666666666, ans=0.125 2024-06-20 15:51:05,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.599e+02 1.887e+02 2.078e+02 2.312e+02 3.160e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 15:51:13,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.87 vs. limit=22.5 2024-06-20 15:51:13,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=233249.5, ans=0.125 2024-06-20 15:51:23,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2024-06-20 15:51:25,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=233267.83333333334, ans=0.125 2024-06-20 15:51:29,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=233286.16666666666, ans=0.2 2024-06-20 15:51:36,241 INFO [train.py:1028] (0/2) Epoch 13, batch 5850, loss[loss=0.231, simple_loss=0.2687, pruned_loss=0.09666, over 12488.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2526, pruned_loss=0.08224, over 2576949.61 frames. ], batch size: 202, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:51:45,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2024-06-20 15:51:53,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=233322.83333333334, ans=0.0 2024-06-20 15:51:53,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=233322.83333333334, ans=0.05 2024-06-20 15:51:56,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=233341.16666666666, ans=0.125 2024-06-20 15:51:59,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.57 vs. limit=15.0 2024-06-20 15:52:03,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=233341.16666666666, ans=0.125 2024-06-20 15:52:15,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=233377.83333333334, ans=0.125 2024-06-20 15:52:25,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=233396.16666666666, ans=0.125 2024-06-20 15:52:26,109 INFO [train.py:1028] (0/2) Epoch 13, batch 5900, loss[loss=0.1974, simple_loss=0.24, pruned_loss=0.07737, over 13101.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2545, pruned_loss=0.08296, over 2576148.72 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:52:27,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.08 vs. limit=22.5 2024-06-20 15:52:28,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=233396.16666666666, ans=0.0 2024-06-20 15:52:41,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=233414.5, ans=0.125 2024-06-20 15:52:42,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 1.805e+02 1.964e+02 2.143e+02 2.868e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 15:52:47,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=233432.83333333334, ans=0.0 2024-06-20 15:52:47,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=233432.83333333334, ans=0.125 2024-06-20 15:52:48,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=233432.83333333334, ans=0.125 2024-06-20 15:52:50,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.64 vs. limit=22.5 2024-06-20 15:52:55,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233451.16666666666, ans=0.1 2024-06-20 15:53:02,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=233451.16666666666, ans=0.0 2024-06-20 15:53:11,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=233469.5, ans=0.125 2024-06-20 15:53:21,271 INFO [train.py:1028] (0/2) Epoch 13, batch 5950, loss[loss=0.179, simple_loss=0.226, pruned_loss=0.06599, over 13030.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2565, pruned_loss=0.084, over 2580151.18 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:53:52,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=233524.5, ans=0.2 2024-06-20 15:54:07,864 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:54:07,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=233542.83333333334, ans=0.0 2024-06-20 15:54:13,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=233561.16666666666, ans=0.125 2024-06-20 15:54:20,403 INFO [train.py:1028] (0/2) Epoch 13, batch 6000, loss[loss=0.2679, simple_loss=0.2932, pruned_loss=0.1213, over 12202.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2578, pruned_loss=0.08467, over 2573734.81 frames. ], batch size: 240, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:54:20,405 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 15:54:31,920 INFO [train.py:1060] (0/2) Epoch 13, validation: loss=0.1926, simple_loss=0.2569, pruned_loss=0.06418, over 351949.00 frames. 2024-06-20 15:54:31,920 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 15:54:34,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=233579.5, ans=0.0 2024-06-20 15:54:47,823 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 1.852e+02 2.010e+02 2.218e+02 3.854e+02, threshold=4.020e+02, percent-clipped=0.0 2024-06-20 15:54:48,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=233597.83333333334, ans=0.125 2024-06-20 15:54:50,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=233597.83333333334, ans=0.0 2024-06-20 15:55:02,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=233634.5, ans=0.125 2024-06-20 15:55:06,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=233634.5, ans=0.2 2024-06-20 15:55:12,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2024-06-20 15:55:12,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.57 vs. limit=15.0 2024-06-20 15:55:22,777 INFO [train.py:1028] (0/2) Epoch 13, batch 6050, loss[loss=0.216, simple_loss=0.2657, pruned_loss=0.0831, over 12839.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2599, pruned_loss=0.08536, over 2575860.23 frames. ], batch size: 39, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:55:36,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=233689.5, ans=0.125 2024-06-20 15:55:37,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2024-06-20 15:55:39,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2024-06-20 15:55:40,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=233689.5, ans=0.1 2024-06-20 15:55:49,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=233707.83333333334, ans=0.125 2024-06-20 15:55:55,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=233726.16666666666, ans=0.05 2024-06-20 15:56:07,225 INFO [train.py:1028] (0/2) Epoch 13, batch 6100, loss[loss=0.1876, simple_loss=0.2307, pruned_loss=0.07226, over 13126.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2612, pruned_loss=0.08565, over 2577899.42 frames. ], batch size: 121, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:56:20,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2024-06-20 15:56:22,788 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.680e+00 2024-06-20 15:56:23,504 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 1.907e+02 2.030e+02 2.293e+02 3.277e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-20 15:56:43,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=233817.83333333334, ans=0.2 2024-06-20 15:57:04,565 INFO [train.py:1028] (0/2) Epoch 13, batch 6150, loss[loss=0.237, simple_loss=0.2657, pruned_loss=0.1041, over 10805.00 frames. ], tot_loss[loss=0.218, simple_loss=0.263, pruned_loss=0.08654, over 2576064.13 frames. ], batch size: 303, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:57:17,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=233872.83333333334, ans=0.125 2024-06-20 15:57:21,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=233872.83333333334, ans=0.07 2024-06-20 15:57:27,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.74 vs. limit=15.0 2024-06-20 15:57:52,652 INFO [train.py:1028] (0/2) Epoch 13, batch 6200, loss[loss=0.2582, simple_loss=0.3042, pruned_loss=0.106, over 13262.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2644, pruned_loss=0.08709, over 2574847.94 frames. ], batch size: 89, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:57:55,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.76 vs. limit=10.0 2024-06-20 15:58:02,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-20 15:58:07,083 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.909e+02 2.129e+02 2.448e+02 3.357e+02, threshold=4.258e+02, percent-clipped=0.0 2024-06-20 15:58:29,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=234001.16666666666, ans=0.125 2024-06-20 15:58:35,398 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 15:58:41,351 INFO [train.py:1028] (0/2) Epoch 13, batch 6250, loss[loss=0.1909, simple_loss=0.2356, pruned_loss=0.07308, over 13215.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2648, pruned_loss=0.08714, over 2569231.67 frames. ], batch size: 83, lr: 4.49e-03, grad_scale: 64.0 2024-06-20 15:58:42,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.84 vs. limit=22.5 2024-06-20 15:58:43,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=234037.83333333334, ans=0.125 2024-06-20 15:58:45,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=234037.83333333334, ans=0.95 2024-06-20 15:58:54,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234056.16666666666, ans=0.125 2024-06-20 15:58:55,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234056.16666666666, ans=0.0 2024-06-20 15:59:23,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=234111.16666666666, ans=0.2 2024-06-20 15:59:23,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=234111.16666666666, ans=0.0 2024-06-20 15:59:31,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.88 vs. limit=12.0 2024-06-20 15:59:33,079 INFO [train.py:1028] (0/2) Epoch 13, batch 6300, loss[loss=0.2118, simple_loss=0.2619, pruned_loss=0.08085, over 11892.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2661, pruned_loss=0.08754, over 2566011.10 frames. ], batch size: 17, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 15:59:46,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234129.5, ans=0.1 2024-06-20 15:59:48,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234147.83333333334, ans=0.1 2024-06-20 15:59:53,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=234147.83333333334, ans=0.1 2024-06-20 15:59:54,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=234147.83333333334, ans=0.125 2024-06-20 15:59:55,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.0 2024-06-20 15:59:55,743 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.631e+02 1.864e+02 1.979e+02 2.097e+02 2.752e+02, threshold=3.958e+02, percent-clipped=0.0 2024-06-20 16:00:11,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=234184.5, ans=0.125 2024-06-20 16:00:13,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.11 vs. limit=15.0 2024-06-20 16:00:25,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234202.83333333334, ans=0.1 2024-06-20 16:00:29,072 INFO [train.py:1028] (0/2) Epoch 13, batch 6350, loss[loss=0.255, simple_loss=0.2964, pruned_loss=0.1068, over 12602.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2675, pruned_loss=0.0878, over 2574347.19 frames. ], batch size: 202, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:00:39,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=234239.5, ans=0.2 2024-06-20 16:00:40,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=234239.5, ans=0.125 2024-06-20 16:01:06,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234294.5, ans=0.1 2024-06-20 16:01:11,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=234294.5, ans=0.015 2024-06-20 16:01:15,842 INFO [train.py:1028] (0/2) Epoch 13, batch 6400, loss[loss=0.2, simple_loss=0.2516, pruned_loss=0.07422, over 13253.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2702, pruned_loss=0.0893, over 2575062.23 frames. ], batch size: 67, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:01:24,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=234312.83333333334, ans=0.125 2024-06-20 16:01:31,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 1.943e+02 2.108e+02 2.389e+02 3.739e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-20 16:01:33,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=234331.16666666666, ans=0.125 2024-06-20 16:01:36,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=234349.5, ans=0.5 2024-06-20 16:01:42,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=234349.5, ans=0.0 2024-06-20 16:01:46,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=234367.83333333334, ans=0.0 2024-06-20 16:01:51,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234367.83333333334, ans=0.1 2024-06-20 16:02:09,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=234386.16666666666, ans=0.0 2024-06-20 16:02:10,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=234404.5, ans=0.125 2024-06-20 16:02:10,943 INFO [train.py:1028] (0/2) Epoch 13, batch 6450, loss[loss=0.2913, simple_loss=0.3248, pruned_loss=0.1289, over 12586.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2717, pruned_loss=0.0896, over 2580425.60 frames. ], batch size: 202, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:02:24,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=234422.83333333334, ans=0.0 2024-06-20 16:02:50,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=234477.83333333334, ans=0.125 2024-06-20 16:02:51,356 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.336e+00 2024-06-20 16:02:54,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234477.83333333334, ans=0.1 2024-06-20 16:02:58,636 INFO [train.py:1028] (0/2) Epoch 13, batch 6500, loss[loss=0.25, simple_loss=0.2865, pruned_loss=0.1068, over 10804.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2725, pruned_loss=0.08963, over 2583426.43 frames. ], batch size: 306, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:03:03,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=234496.16666666666, ans=0.0 2024-06-20 16:03:08,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.93 vs. limit=15.0 2024-06-20 16:03:14,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.906e+02 2.069e+02 2.215e+02 3.011e+02, threshold=4.137e+02, percent-clipped=0.0 2024-06-20 16:03:21,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=234532.83333333334, ans=0.0 2024-06-20 16:03:45,898 INFO [train.py:1028] (0/2) Epoch 13, batch 6550, loss[loss=0.2173, simple_loss=0.2683, pruned_loss=0.08316, over 12443.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2736, pruned_loss=0.08964, over 2588931.67 frames. ], batch size: 22, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:04:02,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.15 vs. limit=10.0 2024-06-20 16:04:13,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=234642.83333333334, ans=0.07 2024-06-20 16:04:15,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=234642.83333333334, ans=0.025 2024-06-20 16:04:20,288 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-128000.pt 2024-06-20 16:04:33,324 INFO [train.py:1028] (0/2) Epoch 13, batch 6600, loss[loss=0.2134, simple_loss=0.2626, pruned_loss=0.08208, over 13284.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2734, pruned_loss=0.08935, over 2591148.08 frames. ], batch size: 72, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:04:41,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2024-06-20 16:04:44,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.25 vs. limit=10.0 2024-06-20 16:04:45,801 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.535e+02 1.930e+02 2.182e+02 2.480e+02 3.228e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-20 16:05:17,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2024-06-20 16:05:25,454 INFO [train.py:1028] (0/2) Epoch 13, batch 6650, loss[loss=0.2647, simple_loss=0.3028, pruned_loss=0.1133, over 12931.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2755, pruned_loss=0.09043, over 2584499.13 frames. ], batch size: 158, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:05:46,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.50 vs. limit=15.0 2024-06-20 16:05:56,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234807.83333333334, ans=0.125 2024-06-20 16:05:56,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234807.83333333334, ans=0.1 2024-06-20 16:06:00,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.07 vs. limit=15.0 2024-06-20 16:06:01,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=234826.16666666666, ans=0.025 2024-06-20 16:06:17,617 INFO [train.py:1028] (0/2) Epoch 13, batch 6700, loss[loss=0.2477, simple_loss=0.2894, pruned_loss=0.103, over 12744.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2772, pruned_loss=0.09139, over 2583752.36 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:06:18,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234862.83333333334, ans=0.1 2024-06-20 16:06:23,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=234862.83333333334, ans=0.025 2024-06-20 16:06:29,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2024-06-20 16:06:33,040 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.926e+02 2.056e+02 2.356e+02 3.484e+02, threshold=4.111e+02, percent-clipped=0.0 2024-06-20 16:06:34,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=234881.16666666666, ans=0.125 2024-06-20 16:06:36,919 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.295e+00 2024-06-20 16:06:52,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=234917.83333333334, ans=0.035 2024-06-20 16:06:58,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 16:07:04,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=234936.16666666666, ans=0.125 2024-06-20 16:07:06,276 INFO [train.py:1028] (0/2) Epoch 13, batch 6750, loss[loss=0.2752, simple_loss=0.3122, pruned_loss=0.119, over 12266.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2783, pruned_loss=0.09205, over 2577462.68 frames. ], batch size: 241, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:07:16,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234972.83333333334, ans=0.1 2024-06-20 16:07:20,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=234972.83333333334, ans=0.125 2024-06-20 16:07:20,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=234972.83333333334, ans=0.2 2024-06-20 16:07:32,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234991.16666666666, ans=0.1 2024-06-20 16:07:39,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=235009.5, ans=0.1 2024-06-20 16:07:59,832 INFO [train.py:1028] (0/2) Epoch 13, batch 6800, loss[loss=0.2237, simple_loss=0.2687, pruned_loss=0.08934, over 13207.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2793, pruned_loss=0.09231, over 2579100.97 frames. ], batch size: 67, lr: 4.48e-03, grad_scale: 64.0 2024-06-20 16:08:04,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-20 16:08:15,718 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.927e+02 2.013e+02 2.200e+02 2.988e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 16:08:21,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=235082.83333333334, ans=0.125 2024-06-20 16:08:30,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235101.16666666666, ans=0.125 2024-06-20 16:08:30,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=235101.16666666666, ans=0.125 2024-06-20 16:08:34,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=235101.16666666666, ans=0.5 2024-06-20 16:08:35,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235101.16666666666, ans=0.0 2024-06-20 16:08:51,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=235119.5, ans=0.125 2024-06-20 16:08:54,754 INFO [train.py:1028] (0/2) Epoch 13, batch 6850, loss[loss=0.2284, simple_loss=0.2918, pruned_loss=0.08252, over 13274.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.28, pruned_loss=0.09235, over 2582315.95 frames. ], batch size: 63, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:08:56,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2024-06-20 16:09:06,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2024-06-20 16:09:17,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235174.5, ans=0.1 2024-06-20 16:09:17,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=235174.5, ans=0.125 2024-06-20 16:09:31,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=235211.16666666666, ans=0.125 2024-06-20 16:09:38,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2024-06-20 16:09:40,498 INFO [train.py:1028] (0/2) Epoch 13, batch 6900, loss[loss=0.2203, simple_loss=0.2762, pruned_loss=0.08219, over 13287.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2804, pruned_loss=0.09247, over 2584707.68 frames. ], batch size: 49, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:09:51,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2024-06-20 16:09:55,803 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 1.911e+02 2.083e+02 2.290e+02 2.958e+02, threshold=4.167e+02, percent-clipped=0.0 2024-06-20 16:10:05,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=235266.16666666666, ans=0.0 2024-06-20 16:10:26,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=15.0 2024-06-20 16:10:29,568 INFO [train.py:1028] (0/2) Epoch 13, batch 6950, loss[loss=0.2228, simple_loss=0.2776, pruned_loss=0.08401, over 11151.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.2807, pruned_loss=0.09228, over 2577773.95 frames. ], batch size: 16, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:10:33,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=235321.16666666666, ans=0.125 2024-06-20 16:10:35,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=235321.16666666666, ans=0.125 2024-06-20 16:10:56,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=235357.83333333334, ans=0.04949747468305833 2024-06-20 16:10:57,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.34 vs. limit=22.5 2024-06-20 16:11:00,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235376.16666666666, ans=0.0 2024-06-20 16:11:18,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=235394.5, ans=0.95 2024-06-20 16:11:23,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=235412.83333333334, ans=0.125 2024-06-20 16:11:23,918 INFO [train.py:1028] (0/2) Epoch 13, batch 7000, loss[loss=0.254, simple_loss=0.2925, pruned_loss=0.1077, over 12962.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2813, pruned_loss=0.0925, over 2575003.12 frames. ], batch size: 158, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:11:29,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=235412.83333333334, ans=0.125 2024-06-20 16:11:38,622 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.970e+02 2.143e+02 2.425e+02 3.357e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-20 16:11:46,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.55 vs. limit=15.0 2024-06-20 16:11:59,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2024-06-20 16:12:06,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=235467.83333333334, ans=0.125 2024-06-20 16:12:15,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=235486.16666666666, ans=0.125 2024-06-20 16:12:20,942 INFO [train.py:1028] (0/2) Epoch 13, batch 7050, loss[loss=0.233, simple_loss=0.2723, pruned_loss=0.09683, over 12697.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2824, pruned_loss=0.09297, over 2582314.16 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:12:31,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=235522.83333333334, ans=0.125 2024-06-20 16:12:31,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=235522.83333333334, ans=0.1 2024-06-20 16:12:45,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=235541.16666666666, ans=0.125 2024-06-20 16:13:01,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-20 16:13:03,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.01 vs. limit=15.0 2024-06-20 16:13:04,994 INFO [train.py:1028] (0/2) Epoch 13, batch 7100, loss[loss=0.2625, simple_loss=0.3118, pruned_loss=0.1066, over 13153.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2836, pruned_loss=0.09355, over 2574009.50 frames. ], batch size: 112, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:13:16,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=235614.5, ans=0.125 2024-06-20 16:13:19,973 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 2.022e+02 2.228e+02 2.469e+02 3.621e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-20 16:13:46,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=235669.5, ans=0.125 2024-06-20 16:13:53,773 INFO [train.py:1028] (0/2) Epoch 13, batch 7150, loss[loss=0.2621, simple_loss=0.3015, pruned_loss=0.1113, over 12513.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2847, pruned_loss=0.09386, over 2572372.83 frames. ], batch size: 202, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:13:56,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.54 vs. limit=10.0 2024-06-20 16:14:11,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=235706.16666666666, ans=0.0 2024-06-20 16:14:18,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=235706.16666666666, ans=0.125 2024-06-20 16:14:44,410 INFO [train.py:1028] (0/2) Epoch 13, batch 7200, loss[loss=0.2527, simple_loss=0.3041, pruned_loss=0.1006, over 13157.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2856, pruned_loss=0.09412, over 2577747.59 frames. ], batch size: 112, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:14:44,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=235779.5, ans=0.0 2024-06-20 16:14:46,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=235779.5, ans=0.125 2024-06-20 16:15:02,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=235797.83333333334, ans=0.1 2024-06-20 16:15:03,967 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.965e+02 2.146e+02 2.355e+02 3.295e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-20 16:15:09,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=235816.16666666666, ans=0.125 2024-06-20 16:15:17,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=235834.5, ans=0.09899494936611666 2024-06-20 16:15:21,101 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2024-06-20 16:15:28,005 INFO [train.py:1028] (0/2) Epoch 13, batch 7250, loss[loss=0.2326, simple_loss=0.2832, pruned_loss=0.09095, over 12981.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.2861, pruned_loss=0.09403, over 2578372.30 frames. ], batch size: 36, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:15:36,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=235889.5, ans=0.125 2024-06-20 16:15:52,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.81 vs. limit=15.0 2024-06-20 16:16:00,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=235926.16666666666, ans=0.2 2024-06-20 16:16:07,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=235944.5, ans=0.05 2024-06-20 16:16:13,057 INFO [train.py:1028] (0/2) Epoch 13, batch 7300, loss[loss=0.2174, simple_loss=0.2767, pruned_loss=0.0791, over 12828.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.2872, pruned_loss=0.0945, over 2579196.08 frames. ], batch size: 36, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:16:15,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=235962.83333333334, ans=0.125 2024-06-20 16:16:25,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=235981.16666666666, ans=0.125 2024-06-20 16:16:26,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=235981.16666666666, ans=0.025 2024-06-20 16:16:27,621 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.987e+02 2.159e+02 2.332e+02 3.155e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-20 16:16:41,592 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:16:46,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236017.83333333334, ans=0.1 2024-06-20 16:17:08,889 INFO [train.py:1028] (0/2) Epoch 13, batch 7350, loss[loss=0.2578, simple_loss=0.3045, pruned_loss=0.1056, over 13241.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.2875, pruned_loss=0.09468, over 2579574.66 frames. ], batch size: 46, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:17:12,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236054.5, ans=0.1 2024-06-20 16:17:26,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=236091.16666666666, ans=0.0 2024-06-20 16:18:03,086 INFO [train.py:1028] (0/2) Epoch 13, batch 7400, loss[loss=0.2352, simple_loss=0.2939, pruned_loss=0.08828, over 13256.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.2875, pruned_loss=0.09445, over 2585255.25 frames. ], batch size: 63, lr: 4.47e-03, grad_scale: 64.0 2024-06-20 16:18:18,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.922e+02 2.090e+02 2.344e+02 3.456e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 16:18:20,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=236164.5, ans=0.5 2024-06-20 16:18:42,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=236219.5, ans=0.2 2024-06-20 16:18:50,717 INFO [train.py:1028] (0/2) Epoch 13, batch 7450, loss[loss=0.2436, simple_loss=0.288, pruned_loss=0.09958, over 12537.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.2874, pruned_loss=0.09417, over 2578435.58 frames. ], batch size: 29, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:18:50,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=236237.83333333334, ans=0.125 2024-06-20 16:18:57,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=236237.83333333334, ans=0.125 2024-06-20 16:19:09,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=236274.5, ans=0.125 2024-06-20 16:19:22,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=236292.83333333334, ans=0.125 2024-06-20 16:19:25,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=236292.83333333334, ans=0.0 2024-06-20 16:19:28,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=236292.83333333334, ans=0.0 2024-06-20 16:19:29,287 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2024-06-20 16:19:32,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=236311.16666666666, ans=0.0 2024-06-20 16:19:34,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=236311.16666666666, ans=0.0 2024-06-20 16:19:39,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=236329.5, ans=0.125 2024-06-20 16:19:39,924 INFO [train.py:1028] (0/2) Epoch 13, batch 7500, loss[loss=0.2426, simple_loss=0.2789, pruned_loss=0.1032, over 10720.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.2888, pruned_loss=0.0949, over 2576566.81 frames. ], batch size: 304, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:19:52,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.03 vs. limit=22.5 2024-06-20 16:19:54,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=236347.83333333334, ans=0.125 2024-06-20 16:19:55,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236347.83333333334, ans=0.1 2024-06-20 16:20:00,006 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.951e+02 2.127e+02 2.339e+02 3.812e+02, threshold=4.253e+02, percent-clipped=0.0 2024-06-20 16:20:03,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=236366.16666666666, ans=0.0 2024-06-20 16:20:25,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=8.0 2024-06-20 16:20:27,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.60 vs. limit=15.0 2024-06-20 16:20:30,263 INFO [train.py:1028] (0/2) Epoch 13, batch 7550, loss[loss=0.2566, simple_loss=0.2964, pruned_loss=0.1083, over 12998.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.2886, pruned_loss=0.09519, over 2576149.68 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:20:47,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=236439.5, ans=0.125 2024-06-20 16:20:54,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=236457.83333333334, ans=0.0 2024-06-20 16:21:00,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.87 vs. limit=15.0 2024-06-20 16:21:03,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=236457.83333333334, ans=0.2 2024-06-20 16:21:05,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=236476.16666666666, ans=0.0 2024-06-20 16:21:19,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=236494.5, ans=0.04949747468305833 2024-06-20 16:21:19,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=236494.5, ans=0.0 2024-06-20 16:21:26,151 INFO [train.py:1028] (0/2) Epoch 13, batch 7600, loss[loss=0.2575, simple_loss=0.3008, pruned_loss=0.1071, over 13212.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2891, pruned_loss=0.09529, over 2575952.73 frames. ], batch size: 83, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:21:37,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=236531.16666666666, ans=0.125 2024-06-20 16:21:37,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2024-06-20 16:21:42,411 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 1.955e+02 2.099e+02 2.375e+02 4.096e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-20 16:21:49,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236549.5, ans=0.1 2024-06-20 16:21:53,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=236549.5, ans=0.125 2024-06-20 16:21:59,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236567.83333333334, ans=0.1 2024-06-20 16:22:15,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=236604.5, ans=0.125 2024-06-20 16:22:16,515 INFO [train.py:1028] (0/2) Epoch 13, batch 7650, loss[loss=0.1998, simple_loss=0.2511, pruned_loss=0.07423, over 12942.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2896, pruned_loss=0.09556, over 2573921.32 frames. ], batch size: 33, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:22:24,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=236604.5, ans=0.125 2024-06-20 16:22:26,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=236622.83333333334, ans=0.025 2024-06-20 16:22:29,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236622.83333333334, ans=0.1 2024-06-20 16:22:48,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=236641.16666666666, ans=0.1 2024-06-20 16:23:01,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.98 vs. limit=22.5 2024-06-20 16:23:08,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=236677.83333333334, ans=10.0 2024-06-20 16:23:12,121 INFO [train.py:1028] (0/2) Epoch 13, batch 7700, loss[loss=0.2455, simple_loss=0.3089, pruned_loss=0.09107, over 13264.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2902, pruned_loss=0.09605, over 2570647.88 frames. ], batch size: 63, lr: 4.46e-03, grad_scale: 64.0 2024-06-20 16:23:12,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=236696.16666666666, ans=0.09899494936611666 2024-06-20 16:23:20,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236714.5, ans=0.1 2024-06-20 16:23:23,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=236714.5, ans=0.125 2024-06-20 16:23:25,132 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.680e+02 1.955e+02 2.120e+02 2.386e+02 3.256e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-20 16:23:35,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=16.85 vs. limit=15.0 2024-06-20 16:23:47,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=236751.16666666666, ans=0.125 2024-06-20 16:23:51,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2024-06-20 16:24:01,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=236769.5, ans=0.125 2024-06-20 16:24:03,290 INFO [train.py:1028] (0/2) Epoch 13, batch 7750, loss[loss=0.2267, simple_loss=0.2808, pruned_loss=0.08635, over 13268.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.2906, pruned_loss=0.09643, over 2574937.03 frames. ], batch size: 72, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:24:20,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=236806.16666666666, ans=15.0 2024-06-20 16:24:21,233 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=15.0 2024-06-20 16:24:52,573 INFO [train.py:1028] (0/2) Epoch 13, batch 7800, loss[loss=0.2456, simple_loss=0.2885, pruned_loss=0.1013, over 13113.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.291, pruned_loss=0.09666, over 2579077.32 frames. ], batch size: 95, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:25:00,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236897.83333333334, ans=0.1 2024-06-20 16:25:04,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.916e+02 2.044e+02 2.209e+02 2.950e+02, threshold=4.088e+02, percent-clipped=0.0 2024-06-20 16:25:10,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=236916.16666666666, ans=0.125 2024-06-20 16:25:13,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2024-06-20 16:25:15,156 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.742e+00 2024-06-20 16:25:35,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.88 vs. limit=15.0 2024-06-20 16:25:36,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=236952.83333333334, ans=0.125 2024-06-20 16:25:44,081 INFO [train.py:1028] (0/2) Epoch 13, batch 7850, loss[loss=0.2201, simple_loss=0.268, pruned_loss=0.08613, over 10889.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2921, pruned_loss=0.0973, over 2572008.80 frames. ], batch size: 16, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:25:57,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=236989.5, ans=0.025 2024-06-20 16:26:36,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=237044.5, ans=0.035 2024-06-20 16:26:38,443 INFO [train.py:1028] (0/2) Epoch 13, batch 7900, loss[loss=0.2496, simple_loss=0.2986, pruned_loss=0.1003, over 13156.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2923, pruned_loss=0.0974, over 2571930.14 frames. ], batch size: 77, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:26:39,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=237062.83333333334, ans=0.125 2024-06-20 16:26:50,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.034e+02 2.314e+02 2.620e+02 3.741e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-20 16:26:51,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2024-06-20 16:26:53,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=237099.5, ans=0.0 2024-06-20 16:26:54,244 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:27:14,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=237136.16666666666, ans=0.125 2024-06-20 16:27:21,147 INFO [train.py:1028] (0/2) Epoch 13, batch 7950, loss[loss=0.254, simple_loss=0.2982, pruned_loss=0.1049, over 10684.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.2928, pruned_loss=0.09711, over 2576126.35 frames. ], batch size: 303, lr: 4.46e-03, grad_scale: 128.0 2024-06-20 16:27:25,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2024-06-20 16:27:40,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237172.83333333334, ans=0.1 2024-06-20 16:27:49,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.91 vs. limit=10.0 2024-06-20 16:27:58,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=237209.5, ans=0.025 2024-06-20 16:28:03,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=237227.83333333334, ans=0.125 2024-06-20 16:28:07,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=237227.83333333334, ans=0.125 2024-06-20 16:28:07,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=237227.83333333334, ans=0.07 2024-06-20 16:28:10,412 INFO [train.py:1028] (0/2) Epoch 13, batch 8000, loss[loss=0.2071, simple_loss=0.2656, pruned_loss=0.07425, over 12707.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2934, pruned_loss=0.09737, over 2572797.23 frames. ], batch size: 29, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:28:25,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=237246.16666666666, ans=0.125 2024-06-20 16:28:31,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=237264.5, ans=0.0 2024-06-20 16:28:33,153 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 1.930e+02 2.131e+02 2.408e+02 3.572e+02, threshold=4.262e+02, percent-clipped=0.0 2024-06-20 16:28:38,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=237282.83333333334, ans=0.2 2024-06-20 16:28:40,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=237282.83333333334, ans=0.0 2024-06-20 16:29:00,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237319.5, ans=0.1 2024-06-20 16:29:05,822 INFO [train.py:1028] (0/2) Epoch 13, batch 8050, loss[loss=0.2648, simple_loss=0.3093, pruned_loss=0.1102, over 13188.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.2934, pruned_loss=0.09723, over 2572468.27 frames. ], batch size: 83, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:29:14,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2024-06-20 16:29:16,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=237337.83333333334, ans=0.0 2024-06-20 16:29:22,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=237356.16666666666, ans=0.125 2024-06-20 16:29:24,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=237356.16666666666, ans=0.125 2024-06-20 16:29:27,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=237374.5, ans=0.125 2024-06-20 16:29:32,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=237374.5, ans=0.07 2024-06-20 16:29:56,565 INFO [train.py:1028] (0/2) Epoch 13, batch 8100, loss[loss=0.2607, simple_loss=0.3101, pruned_loss=0.1057, over 13141.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2939, pruned_loss=0.09725, over 2577436.39 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:30:12,185 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.972e+02 2.080e+02 2.238e+02 3.114e+02, threshold=4.161e+02, percent-clipped=0.0 2024-06-20 16:30:13,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=237447.83333333334, ans=0.0 2024-06-20 16:30:16,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=237466.16666666666, ans=0.0 2024-06-20 16:30:18,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=237466.16666666666, ans=0.125 2024-06-20 16:30:31,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=237484.5, ans=15.0 2024-06-20 16:30:33,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=237484.5, ans=0.025 2024-06-20 16:30:40,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237502.83333333334, ans=0.1 2024-06-20 16:30:43,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=237502.83333333334, ans=0.0 2024-06-20 16:30:45,539 INFO [train.py:1028] (0/2) Epoch 13, batch 8150, loss[loss=0.219, simple_loss=0.2632, pruned_loss=0.08738, over 13060.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.2942, pruned_loss=0.09704, over 2580746.63 frames. ], batch size: 121, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:30:51,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=237521.16666666666, ans=0.0 2024-06-20 16:31:09,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=237557.83333333334, ans=0.125 2024-06-20 16:31:25,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.05 vs. limit=22.5 2024-06-20 16:31:29,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=237576.16666666666, ans=0.04949747468305833 2024-06-20 16:31:30,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2024-06-20 16:31:30,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2024-06-20 16:31:41,273 INFO [train.py:1028] (0/2) Epoch 13, batch 8200, loss[loss=0.2537, simple_loss=0.3057, pruned_loss=0.1009, over 13180.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.2942, pruned_loss=0.09698, over 2584138.49 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:31:45,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.58 vs. limit=15.0 2024-06-20 16:31:48,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237612.83333333334, ans=0.125 2024-06-20 16:31:50,147 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=12.0 2024-06-20 16:31:50,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=237631.16666666666, ans=0.0 2024-06-20 16:31:58,369 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 1.964e+02 2.094e+02 2.266e+02 2.704e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-20 16:32:19,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=237667.83333333334, ans=0.0 2024-06-20 16:32:29,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.15 vs. limit=15.0 2024-06-20 16:32:36,359 INFO [train.py:1028] (0/2) Epoch 13, batch 8250, loss[loss=0.2356, simple_loss=0.2871, pruned_loss=0.09204, over 13341.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2946, pruned_loss=0.09721, over 2583869.97 frames. ], batch size: 52, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:32:47,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=237722.83333333334, ans=0.025 2024-06-20 16:33:03,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=237759.5, ans=0.125 2024-06-20 16:33:16,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237777.83333333334, ans=0.125 2024-06-20 16:33:21,268 INFO [train.py:1028] (0/2) Epoch 13, batch 8300, loss[loss=0.2502, simple_loss=0.3022, pruned_loss=0.09911, over 13033.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.2942, pruned_loss=0.09685, over 2581964.61 frames. ], batch size: 102, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:33:24,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=237796.16666666666, ans=0.07 2024-06-20 16:33:29,067 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:33:37,037 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.936e+02 2.073e+02 2.269e+02 3.138e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 16:33:50,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=237851.16666666666, ans=0.0 2024-06-20 16:34:17,259 INFO [train.py:1028] (0/2) Epoch 13, batch 8350, loss[loss=0.2392, simple_loss=0.2928, pruned_loss=0.09275, over 13205.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.294, pruned_loss=0.09674, over 2581542.55 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:34:29,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2024-06-20 16:34:34,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=237924.5, ans=0.125 2024-06-20 16:34:49,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=237942.83333333334, ans=0.125 2024-06-20 16:34:51,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.97 vs. limit=15.0 2024-06-20 16:34:56,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=237961.16666666666, ans=0.0 2024-06-20 16:35:13,608 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.62 vs. limit=10.0 2024-06-20 16:35:14,047 INFO [train.py:1028] (0/2) Epoch 13, batch 8400, loss[loss=0.2418, simple_loss=0.2951, pruned_loss=0.0943, over 12971.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.2946, pruned_loss=0.09718, over 2577929.55 frames. ], batch size: 39, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:35:30,452 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.993e+02 2.147e+02 2.339e+02 3.017e+02, threshold=4.294e+02, percent-clipped=0.0 2024-06-20 16:35:40,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=238016.16666666666, ans=0.2 2024-06-20 16:35:41,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=238016.16666666666, ans=0.04949747468305833 2024-06-20 16:35:56,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238052.83333333334, ans=0.1 2024-06-20 16:36:00,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=238052.83333333334, ans=0.0 2024-06-20 16:36:02,872 INFO [train.py:1028] (0/2) Epoch 13, batch 8450, loss[loss=0.2681, simple_loss=0.3186, pruned_loss=0.1089, over 13185.00 frames. ], tot_loss[loss=0.245, simple_loss=0.2954, pruned_loss=0.09726, over 2579500.96 frames. ], batch size: 112, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:36:03,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=238071.16666666666, ans=0.0 2024-06-20 16:36:10,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=238089.5, ans=0.0 2024-06-20 16:36:15,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=238089.5, ans=0.2 2024-06-20 16:36:19,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=238107.83333333334, ans=0.025 2024-06-20 16:36:25,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238126.16666666666, ans=0.1 2024-06-20 16:36:39,461 INFO [train.py:1028] (0/2) Epoch 13, batch 8500, loss[loss=0.2304, simple_loss=0.282, pruned_loss=0.08937, over 12561.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.2967, pruned_loss=0.09809, over 2577666.57 frames. ], batch size: 29, lr: 4.45e-03, grad_scale: 128.0 2024-06-20 16:36:54,789 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.075e+02 2.273e+02 2.463e+02 3.415e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-20 16:36:59,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2024-06-20 16:37:18,516 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2024-06-20 16:37:20,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=238217.83333333334, ans=0.95 2024-06-20 16:37:23,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2024-06-20 16:37:32,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=238236.16666666666, ans=0.2 2024-06-20 16:37:34,890 INFO [train.py:1028] (0/2) Epoch 13, batch 8550, loss[loss=0.2432, simple_loss=0.2962, pruned_loss=0.09515, over 12627.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.2963, pruned_loss=0.09797, over 2575596.46 frames. ], batch size: 22, lr: 4.45e-03, grad_scale: 64.0 2024-06-20 16:37:49,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=238272.83333333334, ans=0.125 2024-06-20 16:37:51,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2024-06-20 16:38:31,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=238346.16666666666, ans=0.1 2024-06-20 16:38:31,776 INFO [train.py:1028] (0/2) Epoch 13, batch 8600, loss[loss=0.2546, simple_loss=0.2969, pruned_loss=0.1062, over 13145.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.2964, pruned_loss=0.09792, over 2573124.34 frames. ], batch size: 112, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:38:34,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=238346.16666666666, ans=0.0 2024-06-20 16:38:35,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=238346.16666666666, ans=0.0 2024-06-20 16:38:44,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=238364.5, ans=0.95 2024-06-20 16:38:47,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2024-06-20 16:38:47,937 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.022e+02 2.224e+02 2.393e+02 4.177e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-20 16:38:53,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=238382.83333333334, ans=0.125 2024-06-20 16:38:59,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=238401.16666666666, ans=0.0 2024-06-20 16:39:10,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=238419.5, ans=0.125 2024-06-20 16:39:14,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=238419.5, ans=0.125 2024-06-20 16:39:20,267 INFO [train.py:1028] (0/2) Epoch 13, batch 8650, loss[loss=0.2185, simple_loss=0.269, pruned_loss=0.08396, over 13060.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.2966, pruned_loss=0.09785, over 2576438.46 frames. ], batch size: 102, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:39:22,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=238437.83333333334, ans=0.0 2024-06-20 16:39:38,702 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:40:09,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=238511.16666666666, ans=0.0 2024-06-20 16:40:14,398 INFO [train.py:1028] (0/2) Epoch 13, batch 8700, loss[loss=0.2561, simple_loss=0.3137, pruned_loss=0.09924, over 13189.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.297, pruned_loss=0.09818, over 2573267.28 frames. ], batch size: 59, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:40:21,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238529.5, ans=0.2 2024-06-20 16:40:24,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=238547.83333333334, ans=0.0 2024-06-20 16:40:30,547 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 1.984e+02 2.125e+02 2.357e+02 3.828e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-20 16:40:35,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=238566.16666666666, ans=0.0 2024-06-20 16:40:53,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.43 vs. limit=10.0 2024-06-20 16:41:06,643 INFO [train.py:1028] (0/2) Epoch 13, batch 8750, loss[loss=0.2451, simple_loss=0.2997, pruned_loss=0.09523, over 13132.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.2973, pruned_loss=0.09847, over 2569940.27 frames. ], batch size: 121, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:41:25,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.45 vs. limit=22.5 2024-06-20 16:41:44,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=238694.5, ans=0.025 2024-06-20 16:41:51,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=238694.5, ans=0.0 2024-06-20 16:41:53,589 INFO [train.py:1028] (0/2) Epoch 13, batch 8800, loss[loss=0.24, simple_loss=0.2906, pruned_loss=0.09468, over 13198.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2978, pruned_loss=0.09887, over 2574780.35 frames. ], batch size: 72, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:42:03,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238731.16666666666, ans=0.2 2024-06-20 16:42:04,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2024-06-20 16:42:10,179 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.958e+02 2.091e+02 2.300e+02 3.040e+02, threshold=4.181e+02, percent-clipped=0.0 2024-06-20 16:42:24,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=238767.83333333334, ans=0.05 2024-06-20 16:42:31,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=238786.16666666666, ans=0.125 2024-06-20 16:42:46,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238786.16666666666, ans=0.2 2024-06-20 16:42:49,483 INFO [train.py:1028] (0/2) Epoch 13, batch 8850, loss[loss=0.2791, simple_loss=0.3262, pruned_loss=0.116, over 12477.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2978, pruned_loss=0.09889, over 2563124.48 frames. ], batch size: 203, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:42:52,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=238804.5, ans=0.125 2024-06-20 16:43:13,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.32 vs. limit=15.0 2024-06-20 16:43:13,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.86 vs. limit=22.5 2024-06-20 16:43:18,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=238859.5, ans=0.125 2024-06-20 16:43:24,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=238859.5, ans=0.95 2024-06-20 16:43:35,926 INFO [train.py:1028] (0/2) Epoch 13, batch 8900, loss[loss=0.2507, simple_loss=0.298, pruned_loss=0.1017, over 13003.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.2985, pruned_loss=0.09936, over 2561387.42 frames. ], batch size: 33, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:43:57,641 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.031e+02 2.197e+02 2.359e+02 2.909e+02, threshold=4.394e+02, percent-clipped=0.0 2024-06-20 16:43:59,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=238932.83333333334, ans=0.125 2024-06-20 16:44:17,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=238951.16666666666, ans=0.0 2024-06-20 16:44:20,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=238969.5, ans=0.0 2024-06-20 16:44:30,392 INFO [train.py:1028] (0/2) Epoch 13, batch 8950, loss[loss=0.2604, simple_loss=0.3011, pruned_loss=0.1098, over 12434.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.2984, pruned_loss=0.09923, over 2561053.42 frames. ], batch size: 202, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:44:38,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=239006.16666666666, ans=0.1 2024-06-20 16:44:39,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.77 vs. limit=15.0 2024-06-20 16:44:54,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=239042.83333333334, ans=0.125 2024-06-20 16:45:02,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=239042.83333333334, ans=15.0 2024-06-20 16:45:13,370 INFO [train.py:1028] (0/2) Epoch 13, batch 9000, loss[loss=0.2389, simple_loss=0.2955, pruned_loss=0.09114, over 13314.00 frames. ], tot_loss[loss=0.248, simple_loss=0.2986, pruned_loss=0.09873, over 2567437.71 frames. ], batch size: 46, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:45:13,372 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 16:45:22,992 INFO [train.py:1060] (0/2) Epoch 13, validation: loss=0.1913, simple_loss=0.2561, pruned_loss=0.06321, over 351949.00 frames. 2024-06-20 16:45:22,993 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 16:45:27,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.08 vs. limit=22.5 2024-06-20 16:45:33,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.44 vs. limit=10.0 2024-06-20 16:45:36,952 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.002e+02 2.128e+02 2.271e+02 3.032e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 16:45:40,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=239116.16666666666, ans=0.125 2024-06-20 16:45:58,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239134.5, ans=0.1 2024-06-20 16:46:13,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=239171.16666666666, ans=0.09899494936611666 2024-06-20 16:46:14,359 INFO [train.py:1028] (0/2) Epoch 13, batch 9050, loss[loss=0.1969, simple_loss=0.2531, pruned_loss=0.07031, over 11211.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2994, pruned_loss=0.09918, over 2566218.52 frames. ], batch size: 16, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:46:15,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=239171.16666666666, ans=10.0 2024-06-20 16:46:19,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=239171.16666666666, ans=0.125 2024-06-20 16:46:38,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=239207.83333333334, ans=0.125 2024-06-20 16:46:58,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=239244.5, ans=0.125 2024-06-20 16:47:02,824 INFO [train.py:1028] (0/2) Epoch 13, batch 9100, loss[loss=0.2579, simple_loss=0.3107, pruned_loss=0.1026, over 13229.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.2997, pruned_loss=0.09907, over 2565076.54 frames. ], batch size: 72, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:47:08,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=239262.83333333334, ans=0.125 2024-06-20 16:47:13,025 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2024-06-20 16:47:14,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=239281.16666666666, ans=0.2 2024-06-20 16:47:18,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.996e+02 2.179e+02 2.407e+02 3.355e+02, threshold=4.358e+02, percent-clipped=0.0 2024-06-20 16:47:25,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=239299.5, ans=0.2 2024-06-20 16:47:26,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=239299.5, ans=0.0 2024-06-20 16:47:30,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2024-06-20 16:47:34,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=239317.83333333334, ans=0.025 2024-06-20 16:47:35,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=239317.83333333334, ans=0.025 2024-06-20 16:47:39,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2024-06-20 16:47:44,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=239336.16666666666, ans=0.125 2024-06-20 16:47:53,692 INFO [train.py:1028] (0/2) Epoch 13, batch 9150, loss[loss=0.2267, simple_loss=0.2898, pruned_loss=0.08182, over 13176.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2994, pruned_loss=0.09897, over 2566944.76 frames. ], batch size: 77, lr: 4.44e-03, grad_scale: 64.0 2024-06-20 16:47:59,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.09 vs. limit=15.0 2024-06-20 16:48:00,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=239372.83333333334, ans=0.125 2024-06-20 16:48:04,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=239372.83333333334, ans=0.0 2024-06-20 16:48:06,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239372.83333333334, ans=0.125 2024-06-20 16:48:09,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=239372.83333333334, ans=0.125 2024-06-20 16:48:18,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=239391.16666666666, ans=0.125 2024-06-20 16:48:23,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.14 vs. limit=10.0 2024-06-20 16:48:28,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=239427.83333333334, ans=0.125 2024-06-20 16:48:35,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239427.83333333334, ans=0.125 2024-06-20 16:48:39,031 INFO [train.py:1028] (0/2) Epoch 13, batch 9200, loss[loss=0.2417, simple_loss=0.2999, pruned_loss=0.09175, over 12858.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.2997, pruned_loss=0.09891, over 2570890.46 frames. ], batch size: 36, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:48:51,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=239464.5, ans=0.125 2024-06-20 16:48:55,094 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.965e+02 2.099e+02 2.286e+02 3.165e+02, threshold=4.198e+02, percent-clipped=0.0 2024-06-20 16:48:55,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=239464.5, ans=0.2 2024-06-20 16:49:00,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=239482.83333333334, ans=0.125 2024-06-20 16:49:09,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=239501.16666666666, ans=0.0 2024-06-20 16:49:19,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=239519.5, ans=0.1 2024-06-20 16:49:21,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=239519.5, ans=0.09899494936611666 2024-06-20 16:49:25,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=12.0 2024-06-20 16:49:25,598 INFO [train.py:1028] (0/2) Epoch 13, batch 9250, loss[loss=0.2514, simple_loss=0.303, pruned_loss=0.09992, over 13205.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.2987, pruned_loss=0.09824, over 2571305.95 frames. ], batch size: 67, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:49:30,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=15.0 2024-06-20 16:49:38,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-20 16:49:39,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=239556.16666666666, ans=0.125 2024-06-20 16:49:47,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.49 vs. limit=6.0 2024-06-20 16:49:48,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239574.5, ans=0.1 2024-06-20 16:50:00,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=239611.16666666666, ans=0.0 2024-06-20 16:50:03,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=239611.16666666666, ans=0.125 2024-06-20 16:50:11,113 INFO [train.py:1028] (0/2) Epoch 13, batch 9300, loss[loss=0.239, simple_loss=0.2943, pruned_loss=0.09188, over 12884.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2986, pruned_loss=0.09808, over 2569346.07 frames. ], batch size: 39, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:50:20,852 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:50:23,233 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.264e+00 2024-06-20 16:50:25,897 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.005e+02 2.145e+02 2.313e+02 3.312e+02, threshold=4.290e+02, percent-clipped=0.0 2024-06-20 16:50:30,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2024-06-20 16:50:42,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=239684.5, ans=0.05 2024-06-20 16:50:53,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=239702.83333333334, ans=0.0 2024-06-20 16:50:55,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.17 vs. limit=22.5 2024-06-20 16:50:55,546 INFO [train.py:1028] (0/2) Epoch 13, batch 9350, loss[loss=0.2624, simple_loss=0.3206, pruned_loss=0.1021, over 12376.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2987, pruned_loss=0.09802, over 2566195.43 frames. ], batch size: 22, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:51:02,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.50 vs. limit=10.0 2024-06-20 16:51:09,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2024-06-20 16:51:15,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=239757.83333333334, ans=0.125 2024-06-20 16:51:41,281 INFO [train.py:1028] (0/2) Epoch 13, batch 9400, loss[loss=0.2446, simple_loss=0.3055, pruned_loss=0.09186, over 13235.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2989, pruned_loss=0.09838, over 2566752.99 frames. ], batch size: 52, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:51:47,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239812.83333333334, ans=0.1 2024-06-20 16:51:56,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 1.992e+02 2.100e+02 2.328e+02 3.409e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-20 16:52:02,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=239849.5, ans=0.2 2024-06-20 16:52:12,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=239867.83333333334, ans=0.125 2024-06-20 16:52:24,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=239904.5, ans=0.125 2024-06-20 16:52:25,325 INFO [train.py:1028] (0/2) Epoch 13, batch 9450, loss[loss=0.287, simple_loss=0.3458, pruned_loss=0.1141, over 12749.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3004, pruned_loss=0.09941, over 2567947.74 frames. ], batch size: 22, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:52:47,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=239941.16666666666, ans=0.125 2024-06-20 16:53:05,033 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:53:11,010 INFO [train.py:1028] (0/2) Epoch 13, batch 9500, loss[loss=0.251, simple_loss=0.3026, pruned_loss=0.09973, over 13253.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.2996, pruned_loss=0.09866, over 2577666.82 frames. ], batch size: 43, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:53:12,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.07 vs. limit=10.0 2024-06-20 16:53:22,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=240014.5, ans=0.125 2024-06-20 16:53:23,496 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.009e+02 2.180e+02 2.373e+02 3.150e+02, threshold=4.359e+02, percent-clipped=0.0 2024-06-20 16:53:25,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=240032.83333333334, ans=0.2 2024-06-20 16:53:28,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240032.83333333334, ans=0.125 2024-06-20 16:53:33,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=240051.16666666666, ans=0.2 2024-06-20 16:53:34,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=240051.16666666666, ans=0.05 2024-06-20 16:53:37,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=240051.16666666666, ans=0.0 2024-06-20 16:53:37,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240051.16666666666, ans=0.125 2024-06-20 16:53:42,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=240069.5, ans=0.125 2024-06-20 16:53:48,280 INFO [train.py:1028] (0/2) Epoch 13, batch 9550, loss[loss=0.222, simple_loss=0.2731, pruned_loss=0.08543, over 12926.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2985, pruned_loss=0.09817, over 2573084.26 frames. ], batch size: 39, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:53:51,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=240087.83333333334, ans=22.5 2024-06-20 16:54:00,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=15.0 2024-06-20 16:54:11,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=240124.5, ans=0.0 2024-06-20 16:54:22,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2024-06-20 16:54:31,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=240161.16666666666, ans=0.125 2024-06-20 16:54:32,956 INFO [train.py:1028] (0/2) Epoch 13, batch 9600, loss[loss=0.2758, simple_loss=0.3112, pruned_loss=0.1202, over 10499.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.2985, pruned_loss=0.09849, over 2571040.89 frames. ], batch size: 304, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:54:36,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2024-06-20 16:54:41,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=240179.5, ans=0.2 2024-06-20 16:54:43,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=240197.83333333334, ans=0.125 2024-06-20 16:54:44,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=240197.83333333334, ans=0.1 2024-06-20 16:54:49,040 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.969e+02 2.119e+02 2.334e+02 3.153e+02, threshold=4.237e+02, percent-clipped=0.0 2024-06-20 16:54:50,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=240216.16666666666, ans=0.2 2024-06-20 16:54:51,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=240216.16666666666, ans=0.0 2024-06-20 16:54:52,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=240216.16666666666, ans=0.0 2024-06-20 16:54:55,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240216.16666666666, ans=0.1 2024-06-20 16:54:56,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=240216.16666666666, ans=0.04949747468305833 2024-06-20 16:55:08,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=240234.5, ans=0.125 2024-06-20 16:55:11,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=240234.5, ans=0.0 2024-06-20 16:55:11,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=240234.5, ans=0.125 2024-06-20 16:55:13,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=240234.5, ans=0.1 2024-06-20 16:55:22,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=240271.16666666666, ans=0.2 2024-06-20 16:55:23,467 INFO [train.py:1028] (0/2) Epoch 13, batch 9650, loss[loss=0.207, simple_loss=0.2564, pruned_loss=0.07882, over 13089.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2986, pruned_loss=0.09895, over 2562849.46 frames. ], batch size: 132, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:55:24,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240271.16666666666, ans=0.1 2024-06-20 16:55:39,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=240307.83333333334, ans=0.2 2024-06-20 16:55:43,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240307.83333333334, ans=0.125 2024-06-20 16:55:48,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=240326.16666666666, ans=0.125 2024-06-20 16:55:52,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=240326.16666666666, ans=0.125 2024-06-20 16:55:52,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=240326.16666666666, ans=0.125 2024-06-20 16:56:06,721 INFO [train.py:1028] (0/2) Epoch 13, batch 9700, loss[loss=0.2862, simple_loss=0.3288, pruned_loss=0.1218, over 13031.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.2975, pruned_loss=0.09862, over 2556763.52 frames. ], batch size: 144, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:56:06,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=240362.83333333334, ans=0.5 2024-06-20 16:56:25,112 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 1.994e+02 2.192e+02 2.512e+02 4.078e+02, threshold=4.384e+02, percent-clipped=0.0 2024-06-20 16:56:25,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=240381.16666666666, ans=0.1 2024-06-20 16:56:35,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240417.83333333334, ans=0.125 2024-06-20 16:56:35,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=240417.83333333334, ans=0.0 2024-06-20 16:56:42,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=240417.83333333334, ans=0.025 2024-06-20 16:56:44,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.30 vs. limit=22.5 2024-06-20 16:56:53,777 INFO [train.py:1028] (0/2) Epoch 13, batch 9750, loss[loss=0.2249, simple_loss=0.2719, pruned_loss=0.08891, over 13079.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.2961, pruned_loss=0.09758, over 2552469.49 frames. ], batch size: 132, lr: 4.43e-03, grad_scale: 64.0 2024-06-20 16:56:54,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=240454.5, ans=0.125 2024-06-20 16:56:56,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2024-06-20 16:57:06,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=240472.83333333334, ans=0.035 2024-06-20 16:57:09,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.02 vs. limit=10.0 2024-06-20 16:57:10,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=240491.16666666666, ans=0.05 2024-06-20 16:57:16,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=240491.16666666666, ans=0.1 2024-06-20 16:57:17,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=240509.5, ans=10.0 2024-06-20 16:57:19,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=15.0 2024-06-20 16:57:38,459 INFO [train.py:1028] (0/2) Epoch 13, batch 9800, loss[loss=0.2397, simple_loss=0.2894, pruned_loss=0.09503, over 12984.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.2955, pruned_loss=0.09713, over 2545683.13 frames. ], batch size: 39, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:57:38,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=240546.16666666666, ans=0.0 2024-06-20 16:57:46,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=240564.5, ans=0.0 2024-06-20 16:57:46,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=240564.5, ans=0.2 2024-06-20 16:57:52,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.957e+02 2.166e+02 2.353e+02 3.246e+02, threshold=4.333e+02, percent-clipped=0.0 2024-06-20 16:57:54,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240582.83333333334, ans=0.125 2024-06-20 16:58:08,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=240601.16666666666, ans=0.0 2024-06-20 16:58:12,827 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 16:58:20,433 INFO [train.py:1028] (0/2) Epoch 13, batch 9850, loss[loss=0.2344, simple_loss=0.2815, pruned_loss=0.09366, over 13019.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2951, pruned_loss=0.0969, over 2537023.66 frames. ], batch size: 102, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:58:38,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=240656.16666666666, ans=0.1 2024-06-20 16:58:52,998 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.61 vs. limit=22.5 2024-06-20 16:59:02,789 INFO [train.py:1028] (0/2) Epoch 13, batch 9900, loss[loss=0.2435, simple_loss=0.2968, pruned_loss=0.09508, over 12851.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.2944, pruned_loss=0.09706, over 2529242.32 frames. ], batch size: 39, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 16:59:03,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2024-06-20 16:59:04,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240729.5, ans=0.1 2024-06-20 16:59:17,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.017e+02 2.227e+02 2.496e+02 3.038e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-20 16:59:19,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240747.83333333334, ans=0.1 2024-06-20 16:59:22,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240766.16666666666, ans=0.125 2024-06-20 16:59:23,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=240766.16666666666, ans=0.2 2024-06-20 16:59:47,317 INFO [train.py:1028] (0/2) Epoch 13, batch 9950, loss[loss=0.2223, simple_loss=0.2725, pruned_loss=0.08605, over 12660.00 frames. ], tot_loss[loss=0.244, simple_loss=0.2935, pruned_loss=0.09724, over 2524157.93 frames. ], batch size: 29, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:00:02,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240839.5, ans=0.1 2024-06-20 17:00:25,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240894.5, ans=0.125 2024-06-20 17:00:32,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.33 vs. limit=22.5 2024-06-20 17:00:32,377 INFO [train.py:1028] (0/2) Epoch 13, batch 10000, loss[loss=0.241, simple_loss=0.297, pruned_loss=0.09249, over 12406.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2935, pruned_loss=0.09766, over 2484795.72 frames. ], batch size: 22, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:00:35,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=240912.83333333334, ans=0.125 2024-06-20 17:00:41,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=240931.16666666666, ans=0.125 2024-06-20 17:00:47,968 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.021e+02 2.165e+02 2.334e+02 2.919e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-20 17:00:53,223 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.95 vs. limit=22.5 2024-06-20 17:00:58,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240949.5, ans=0.1 2024-06-20 17:01:10,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2024-06-20 17:01:15,891 INFO [train.py:1028] (0/2) Epoch 13, batch 10050, loss[loss=0.2348, simple_loss=0.2909, pruned_loss=0.08938, over 12609.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.294, pruned_loss=0.09871, over 2442758.19 frames. ], batch size: 22, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:01:18,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=241004.5, ans=0.0 2024-06-20 17:01:19,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241004.5, ans=0.1 2024-06-20 17:01:20,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=241004.5, ans=0.0 2024-06-20 17:01:32,044 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=22.5 2024-06-20 17:01:37,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241041.16666666666, ans=0.1 2024-06-20 17:01:41,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=241059.5, ans=0.125 2024-06-20 17:01:56,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=241077.83333333334, ans=0.0 2024-06-20 17:01:58,419 INFO [train.py:1028] (0/2) Epoch 13, batch 10100, loss[loss=0.2575, simple_loss=0.3051, pruned_loss=0.1049, over 11354.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.2928, pruned_loss=0.09775, over 2424911.93 frames. ], batch size: 17, lr: 4.42e-03, grad_scale: 64.0 2024-06-20 17:02:16,522 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-13.pt 2024-06-20 17:05:20,038 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.960e+02 2.127e+02 2.308e+02 3.996e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 17:05:20,085 INFO [train.py:1028] (0/2) Epoch 14, batch 0, loss[loss=0.2353, simple_loss=0.2831, pruned_loss=0.09377, over 12957.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2831, pruned_loss=0.09377, over 12957.00 frames. ], batch size: 36, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:05:20,090 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 17:05:30,722 INFO [train.py:1060] (0/2) Epoch 14, validation: loss=0.193, simple_loss=0.2578, pruned_loss=0.06414, over 351949.00 frames. 2024-06-20 17:05:30,723 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 17:06:15,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=241200.66666666666, ans=0.0 2024-06-20 17:06:27,100 INFO [train.py:1028] (0/2) Epoch 14, batch 50, loss[loss=0.2284, simple_loss=0.2855, pruned_loss=0.08568, over 12658.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2747, pruned_loss=0.09098, over 574693.72 frames. ], batch size: 29, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:06:27,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.51 vs. limit=12.0 2024-06-20 17:06:30,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241219.0, ans=0.125 2024-06-20 17:06:41,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241237.33333333334, ans=0.1 2024-06-20 17:06:42,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=241237.33333333334, ans=0.125 2024-06-20 17:06:50,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241255.66666666666, ans=0.125 2024-06-20 17:06:53,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2024-06-20 17:07:07,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.50 vs. limit=22.5 2024-06-20 17:07:11,554 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 1.999e+02 2.159e+02 2.420e+02 3.006e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-20 17:07:11,590 INFO [train.py:1028] (0/2) Epoch 14, batch 100, loss[loss=0.2348, simple_loss=0.29, pruned_loss=0.08975, over 13243.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2742, pruned_loss=0.08986, over 1017721.50 frames. ], batch size: 46, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:07:27,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=241329.0, ans=0.125 2024-06-20 17:07:33,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=15.0 2024-06-20 17:07:53,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241384.0, ans=0.1 2024-06-20 17:08:00,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=241384.0, ans=0.0 2024-06-20 17:08:02,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=241402.33333333334, ans=0.2 2024-06-20 17:08:02,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=241402.33333333334, ans=0.0 2024-06-20 17:08:02,946 INFO [train.py:1028] (0/2) Epoch 14, batch 150, loss[loss=0.1977, simple_loss=0.2609, pruned_loss=0.06723, over 12669.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2733, pruned_loss=0.08796, over 1364880.06 frames. ], batch size: 29, lr: 4.26e-03, grad_scale: 64.0 2024-06-20 17:08:14,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.49 vs. limit=15.0 2024-06-20 17:08:15,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=241420.66666666666, ans=0.125 2024-06-20 17:08:28,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=241439.0, ans=0.125 2024-06-20 17:08:32,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=241457.33333333334, ans=0.125 2024-06-20 17:08:33,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=241457.33333333334, ans=0.125 2024-06-20 17:08:36,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=241457.33333333334, ans=0.025 2024-06-20 17:08:38,495 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2024-06-20 17:08:49,765 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.867e+02 2.017e+02 2.217e+02 2.995e+02, threshold=4.033e+02, percent-clipped=0.0 2024-06-20 17:08:49,799 INFO [train.py:1028] (0/2) Epoch 14, batch 200, loss[loss=0.2349, simple_loss=0.2829, pruned_loss=0.09351, over 12510.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.274, pruned_loss=0.08841, over 1634760.08 frames. ], batch size: 202, lr: 4.25e-03, grad_scale: 64.0 2024-06-20 17:08:51,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.81 vs. limit=22.5 2024-06-20 17:08:54,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=241494.0, ans=0.125 2024-06-20 17:08:56,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=241494.0, ans=0.0 2024-06-20 17:09:28,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=241549.0, ans=0.0 2024-06-20 17:09:31,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=241567.33333333334, ans=0.0 2024-06-20 17:09:40,633 INFO [train.py:1028] (0/2) Epoch 14, batch 250, loss[loss=0.2249, simple_loss=0.2723, pruned_loss=0.08878, over 13049.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2746, pruned_loss=0.08849, over 1846094.53 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 64.0 2024-06-20 17:10:02,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=241622.33333333334, ans=0.125 2024-06-20 17:10:11,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=241640.66666666666, ans=0.0 2024-06-20 17:10:17,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=241659.0, ans=0.125 2024-06-20 17:10:31,512 INFO [train.py:1028] (0/2) Epoch 14, batch 300, loss[loss=0.2555, simple_loss=0.2982, pruned_loss=0.1064, over 13144.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2745, pruned_loss=0.08856, over 2009714.71 frames. ], batch size: 112, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:10:32,100 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.856e+02 1.968e+02 2.112e+02 2.615e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-20 17:10:43,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=241714.0, ans=0.0 2024-06-20 17:10:47,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=22.5 2024-06-20 17:10:49,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241714.0, ans=0.125 2024-06-20 17:10:54,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=241732.33333333334, ans=0.125 2024-06-20 17:11:05,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=241750.66666666666, ans=0.125 2024-06-20 17:11:06,731 INFO [train.py:1028] (0/2) Epoch 14, batch 350, loss[loss=0.2053, simple_loss=0.2617, pruned_loss=0.07442, over 12892.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2739, pruned_loss=0.08837, over 2138465.70 frames. ], batch size: 33, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:11:24,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=241805.66666666666, ans=0.125 2024-06-20 17:11:31,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=241824.0, ans=0.1 2024-06-20 17:11:45,226 INFO [train.py:1028] (0/2) Epoch 14, batch 400, loss[loss=0.1978, simple_loss=0.2547, pruned_loss=0.07041, over 13242.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2736, pruned_loss=0.08814, over 2239861.91 frames. ], batch size: 63, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:11:45,938 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.924e+02 2.079e+02 2.305e+02 3.125e+02, threshold=4.157e+02, percent-clipped=0.0 2024-06-20 17:12:02,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=241879.0, ans=0.125 2024-06-20 17:12:08,523 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2024-06-20 17:12:12,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=241915.66666666666, ans=0.2 2024-06-20 17:12:21,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=241934.0, ans=0.125 2024-06-20 17:12:25,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=241934.0, ans=0.125 2024-06-20 17:12:26,946 INFO [train.py:1028] (0/2) Epoch 14, batch 450, loss[loss=0.2101, simple_loss=0.2667, pruned_loss=0.07676, over 13224.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2737, pruned_loss=0.08793, over 2313113.30 frames. ], batch size: 67, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:12:34,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.52 vs. limit=6.0 2024-06-20 17:12:49,889 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-132000.pt 2024-06-20 17:13:07,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=242025.66666666666, ans=0.2 2024-06-20 17:13:10,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2024-06-20 17:13:14,585 INFO [train.py:1028] (0/2) Epoch 14, batch 500, loss[loss=0.2171, simple_loss=0.2614, pruned_loss=0.08642, over 13084.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2741, pruned_loss=0.08781, over 2375255.60 frames. ], batch size: 121, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:13:15,209 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.863e+02 1.955e+02 2.145e+02 2.797e+02, threshold=3.909e+02, percent-clipped=0.0 2024-06-20 17:13:29,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=15.0 2024-06-20 17:13:31,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=242080.66666666666, ans=0.125 2024-06-20 17:13:40,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=242099.0, ans=0.5 2024-06-20 17:13:53,508 INFO [train.py:1028] (0/2) Epoch 14, batch 550, loss[loss=0.2031, simple_loss=0.2483, pruned_loss=0.07893, over 12917.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2739, pruned_loss=0.08763, over 2421009.48 frames. ], batch size: 158, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:13:55,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=242135.66666666666, ans=0.125 2024-06-20 17:13:56,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=242135.66666666666, ans=0.0 2024-06-20 17:14:05,104 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2024-06-20 17:14:08,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.86 vs. limit=6.0 2024-06-20 17:14:28,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=242209.0, ans=0.07 2024-06-20 17:14:28,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=242209.0, ans=0.0 2024-06-20 17:14:31,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242209.0, ans=0.125 2024-06-20 17:14:34,454 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-20 17:14:34,891 INFO [train.py:1028] (0/2) Epoch 14, batch 600, loss[loss=0.2207, simple_loss=0.2603, pruned_loss=0.0906, over 13017.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2737, pruned_loss=0.08758, over 2458457.60 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:14:35,673 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.980e+02 2.105e+02 2.345e+02 3.080e+02, threshold=4.210e+02, percent-clipped=0.0 2024-06-20 17:14:38,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=242227.33333333334, ans=0.2 2024-06-20 17:14:38,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=242227.33333333334, ans=0.125 2024-06-20 17:14:44,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=242245.66666666666, ans=0.2 2024-06-20 17:14:48,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242245.66666666666, ans=0.1 2024-06-20 17:14:48,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=242245.66666666666, ans=0.125 2024-06-20 17:15:08,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242300.66666666666, ans=0.125 2024-06-20 17:15:13,541 INFO [train.py:1028] (0/2) Epoch 14, batch 650, loss[loss=0.2241, simple_loss=0.2709, pruned_loss=0.08864, over 13135.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2732, pruned_loss=0.08703, over 2489913.07 frames. ], batch size: 59, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:15:15,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=242319.0, ans=0.2 2024-06-20 17:15:16,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.99 vs. limit=15.0 2024-06-20 17:15:41,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242374.0, ans=0.0 2024-06-20 17:15:42,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=242374.0, ans=0.1 2024-06-20 17:15:42,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=242374.0, ans=0.025 2024-06-20 17:15:45,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2024-06-20 17:15:48,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=242392.33333333334, ans=0.2 2024-06-20 17:15:55,715 INFO [train.py:1028] (0/2) Epoch 14, batch 700, loss[loss=0.2256, simple_loss=0.2836, pruned_loss=0.08378, over 13279.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2738, pruned_loss=0.08752, over 2512942.54 frames. ], batch size: 46, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:15:56,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.882e+02 2.032e+02 2.250e+02 3.227e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-20 17:16:02,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=242410.66666666666, ans=0.0 2024-06-20 17:16:03,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=242429.0, ans=0.025 2024-06-20 17:16:05,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=242429.0, ans=0.015 2024-06-20 17:16:05,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=242429.0, ans=0.0 2024-06-20 17:16:11,487 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:16:13,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=242447.33333333334, ans=0.2 2024-06-20 17:16:24,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.64 vs. limit=15.0 2024-06-20 17:16:26,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=242484.0, ans=0.125 2024-06-20 17:16:31,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=242484.0, ans=0.0 2024-06-20 17:16:33,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.80 vs. limit=5.0 2024-06-20 17:16:33,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=242502.33333333334, ans=0.125 2024-06-20 17:16:33,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.00 vs. limit=12.0 2024-06-20 17:16:34,091 INFO [train.py:1028] (0/2) Epoch 14, batch 750, loss[loss=0.2077, simple_loss=0.2727, pruned_loss=0.07139, over 13300.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2735, pruned_loss=0.08716, over 2528502.01 frames. ], batch size: 63, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:16:34,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.60 vs. limit=10.0 2024-06-20 17:16:40,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2024-06-20 17:16:44,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=242520.66666666666, ans=0.025 2024-06-20 17:16:50,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-20 17:16:54,390 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2024-06-20 17:17:15,400 INFO [train.py:1028] (0/2) Epoch 14, batch 800, loss[loss=0.2033, simple_loss=0.2564, pruned_loss=0.0751, over 12928.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.273, pruned_loss=0.08716, over 2541122.21 frames. ], batch size: 36, lr: 4.25e-03, grad_scale: 32.0 2024-06-20 17:17:16,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 1.895e+02 2.039e+02 2.275e+02 3.849e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 17:17:57,067 INFO [train.py:1028] (0/2) Epoch 14, batch 850, loss[loss=0.2118, simple_loss=0.2634, pruned_loss=0.08009, over 13155.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2719, pruned_loss=0.08636, over 2552296.97 frames. ], batch size: 95, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:18:03,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=242685.66666666666, ans=0.025 2024-06-20 17:18:05,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.12 vs. limit=22.5 2024-06-20 17:18:24,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2024-06-20 17:18:25,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=242740.66666666666, ans=0.1 2024-06-20 17:18:34,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=242777.33333333334, ans=0.2 2024-06-20 17:18:35,723 INFO [train.py:1028] (0/2) Epoch 14, batch 900, loss[loss=0.2084, simple_loss=0.2586, pruned_loss=0.07906, over 12919.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2719, pruned_loss=0.08645, over 2557227.18 frames. ], batch size: 36, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:18:36,428 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.560e+02 1.817e+02 1.913e+02 2.086e+02 2.995e+02, threshold=3.826e+02, percent-clipped=0.0 2024-06-20 17:18:37,202 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:18:43,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.94 vs. limit=10.0 2024-06-20 17:18:54,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242814.0, ans=0.1 2024-06-20 17:18:58,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242832.33333333334, ans=0.0 2024-06-20 17:19:11,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2024-06-20 17:19:14,749 INFO [train.py:1028] (0/2) Epoch 14, batch 950, loss[loss=0.2072, simple_loss=0.2642, pruned_loss=0.07507, over 12835.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2719, pruned_loss=0.08635, over 2559878.45 frames. ], batch size: 39, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:19:15,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=242869.0, ans=0.2 2024-06-20 17:19:21,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=242869.0, ans=0.0 2024-06-20 17:19:26,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=242887.33333333334, ans=0.125 2024-06-20 17:19:33,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=242905.66666666666, ans=0.125 2024-06-20 17:19:33,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=242905.66666666666, ans=0.125 2024-06-20 17:19:40,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=242924.0, ans=0.125 2024-06-20 17:19:47,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=242924.0, ans=0.1 2024-06-20 17:19:47,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.84 vs. limit=15.0 2024-06-20 17:19:54,682 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:19:56,757 INFO [train.py:1028] (0/2) Epoch 14, batch 1000, loss[loss=0.2193, simple_loss=0.2629, pruned_loss=0.08783, over 13066.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2718, pruned_loss=0.08661, over 2562128.35 frames. ], batch size: 48, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:19:57,403 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.868e+02 1.995e+02 2.239e+02 2.729e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 17:19:58,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=242960.66666666666, ans=0.0 2024-06-20 17:20:12,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=242979.0, ans=0.1 2024-06-20 17:20:14,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=242979.0, ans=0.2 2024-06-20 17:20:20,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=12.0 2024-06-20 17:20:21,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=242997.33333333334, ans=0.025 2024-06-20 17:20:22,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.51 vs. limit=10.0 2024-06-20 17:20:37,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=243034.0, ans=0.0 2024-06-20 17:20:39,074 INFO [train.py:1028] (0/2) Epoch 14, batch 1050, loss[loss=0.2006, simple_loss=0.255, pruned_loss=0.07311, over 13132.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2721, pruned_loss=0.08664, over 2564310.51 frames. ], batch size: 77, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:20:49,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.46 vs. limit=22.5 2024-06-20 17:20:55,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=243089.0, ans=22.5 2024-06-20 17:20:56,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=243089.0, ans=0.0 2024-06-20 17:21:03,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=243107.33333333334, ans=0.125 2024-06-20 17:21:10,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=243125.66666666666, ans=0.0 2024-06-20 17:21:17,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243144.0, ans=0.1 2024-06-20 17:21:18,264 INFO [train.py:1028] (0/2) Epoch 14, batch 1100, loss[loss=0.2364, simple_loss=0.2833, pruned_loss=0.09478, over 13230.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2726, pruned_loss=0.0868, over 2569913.71 frames. ], batch size: 52, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:21:18,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.890e+02 2.001e+02 2.214e+02 2.965e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-20 17:21:59,882 INFO [train.py:1028] (0/2) Epoch 14, batch 1150, loss[loss=0.2109, simple_loss=0.2634, pruned_loss=0.07921, over 13228.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2724, pruned_loss=0.08673, over 2571393.36 frames. ], batch size: 52, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:22:03,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-20 17:22:06,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=243235.66666666666, ans=0.125 2024-06-20 17:22:23,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=243272.33333333334, ans=0.125 2024-06-20 17:22:26,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=1.79 vs. limit=15.0 2024-06-20 17:22:40,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=243309.0, ans=0.125 2024-06-20 17:22:41,855 INFO [train.py:1028] (0/2) Epoch 14, batch 1200, loss[loss=0.1981, simple_loss=0.2554, pruned_loss=0.07039, over 13153.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2723, pruned_loss=0.08685, over 2573849.58 frames. ], batch size: 77, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:22:42,739 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.901e+02 2.026e+02 2.241e+02 2.826e+02, threshold=4.053e+02, percent-clipped=0.0 2024-06-20 17:22:44,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243327.33333333334, ans=0.1 2024-06-20 17:22:51,335 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:22:52,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=243345.66666666666, ans=0.2 2024-06-20 17:22:57,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=243364.0, ans=0.125 2024-06-20 17:22:57,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=243364.0, ans=0.125 2024-06-20 17:23:01,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=12.0 2024-06-20 17:23:06,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2024-06-20 17:23:16,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=243400.66666666666, ans=0.125 2024-06-20 17:23:20,416 INFO [train.py:1028] (0/2) Epoch 14, batch 1250, loss[loss=0.2354, simple_loss=0.2744, pruned_loss=0.0982, over 13164.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2726, pruned_loss=0.08698, over 2583155.59 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:23:21,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=243419.0, ans=0.125 2024-06-20 17:23:24,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=243419.0, ans=0.2 2024-06-20 17:23:35,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.41 vs. limit=15.0 2024-06-20 17:23:37,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=243455.66666666666, ans=12.0 2024-06-20 17:23:42,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=243455.66666666666, ans=0.2 2024-06-20 17:23:44,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=243474.0, ans=0.125 2024-06-20 17:23:53,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=243492.33333333334, ans=0.5 2024-06-20 17:23:53,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=243492.33333333334, ans=0.125 2024-06-20 17:23:57,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=243492.33333333334, ans=0.125 2024-06-20 17:23:57,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=243492.33333333334, ans=0.1 2024-06-20 17:23:59,548 INFO [train.py:1028] (0/2) Epoch 14, batch 1300, loss[loss=0.2234, simple_loss=0.2671, pruned_loss=0.08985, over 12767.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.273, pruned_loss=0.08704, over 2583603.89 frames. ], batch size: 177, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:24:00,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.853e+02 1.966e+02 2.092e+02 3.282e+02, threshold=3.931e+02, percent-clipped=0.0 2024-06-20 17:24:05,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=243510.66666666666, ans=0.125 2024-06-20 17:24:08,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=243529.0, ans=0.0 2024-06-20 17:24:26,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243565.66666666666, ans=0.1 2024-06-20 17:24:27,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=243565.66666666666, ans=0.0 2024-06-20 17:24:34,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=243584.0, ans=0.07 2024-06-20 17:24:35,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=243584.0, ans=0.125 2024-06-20 17:24:41,471 INFO [train.py:1028] (0/2) Epoch 14, batch 1350, loss[loss=0.2168, simple_loss=0.2734, pruned_loss=0.08016, over 13245.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2727, pruned_loss=0.08679, over 2585023.62 frames. ], batch size: 59, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:24:46,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=243602.33333333334, ans=0.0 2024-06-20 17:25:12,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=243657.33333333334, ans=0.0 2024-06-20 17:25:12,695 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:25:15,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243657.33333333334, ans=0.1 2024-06-20 17:25:16,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=243675.66666666666, ans=0.2 2024-06-20 17:25:24,018 INFO [train.py:1028] (0/2) Epoch 14, batch 1400, loss[loss=0.2424, simple_loss=0.2938, pruned_loss=0.09549, over 12222.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2735, pruned_loss=0.08719, over 2586389.58 frames. ], batch size: 25, lr: 4.24e-03, grad_scale: 32.0 2024-06-20 17:25:24,670 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.610e+02 1.858e+02 1.973e+02 2.145e+02 2.652e+02, threshold=3.946e+02, percent-clipped=0.0 2024-06-20 17:25:28,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=22.5 2024-06-20 17:25:39,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=243730.66666666666, ans=0.125 2024-06-20 17:26:03,155 INFO [train.py:1028] (0/2) Epoch 14, batch 1450, loss[loss=0.2124, simple_loss=0.2557, pruned_loss=0.08454, over 13064.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2735, pruned_loss=0.08751, over 2586194.47 frames. ], batch size: 121, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:26:11,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=12.0 2024-06-20 17:26:21,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=243822.33333333334, ans=0.05 2024-06-20 17:26:24,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2024-06-20 17:26:33,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2024-06-20 17:26:34,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=243840.66666666666, ans=0.125 2024-06-20 17:26:34,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=243840.66666666666, ans=0.0 2024-06-20 17:26:36,585 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:26:38,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243859.0, ans=0.125 2024-06-20 17:26:46,730 INFO [train.py:1028] (0/2) Epoch 14, batch 1500, loss[loss=0.2204, simple_loss=0.2629, pruned_loss=0.089, over 13213.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2738, pruned_loss=0.08808, over 2589075.59 frames. ], batch size: 83, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:26:47,360 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.877e+02 1.995e+02 2.132e+02 3.022e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 17:26:52,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=243877.33333333334, ans=15.0 2024-06-20 17:26:54,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2024-06-20 17:26:55,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=243895.66666666666, ans=0.07 2024-06-20 17:26:59,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=243895.66666666666, ans=0.125 2024-06-20 17:27:01,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=243914.0, ans=0.2 2024-06-20 17:27:04,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=243914.0, ans=0.0 2024-06-20 17:27:06,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243914.0, ans=0.1 2024-06-20 17:27:15,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=243932.33333333334, ans=0.0 2024-06-20 17:27:16,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2024-06-20 17:27:26,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243950.66666666666, ans=0.1 2024-06-20 17:27:28,622 INFO [train.py:1028] (0/2) Epoch 14, batch 1550, loss[loss=0.2112, simple_loss=0.2619, pruned_loss=0.08029, over 13033.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2732, pruned_loss=0.08796, over 2583493.59 frames. ], batch size: 102, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:27:31,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=243969.0, ans=0.1 2024-06-20 17:27:35,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=243987.33333333334, ans=0.125 2024-06-20 17:27:48,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=244005.66666666666, ans=0.125 2024-06-20 17:27:51,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=244005.66666666666, ans=0.125 2024-06-20 17:27:55,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2024-06-20 17:27:59,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=244024.0, ans=0.125 2024-06-20 17:28:06,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=244042.33333333334, ans=0.1 2024-06-20 17:28:07,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=244060.66666666666, ans=0.2 2024-06-20 17:28:08,213 INFO [train.py:1028] (0/2) Epoch 14, batch 1600, loss[loss=0.2082, simple_loss=0.2612, pruned_loss=0.07761, over 13144.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2739, pruned_loss=0.08818, over 2578615.40 frames. ], batch size: 77, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:28:08,955 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.888e+02 2.045e+02 2.263e+02 3.518e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 17:28:09,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244060.66666666666, ans=0.1 2024-06-20 17:28:46,738 INFO [train.py:1028] (0/2) Epoch 14, batch 1650, loss[loss=0.2284, simple_loss=0.2712, pruned_loss=0.09282, over 13179.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2736, pruned_loss=0.08787, over 2575474.21 frames. ], batch size: 95, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:28:54,910 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.03 vs. limit=15.0 2024-06-20 17:28:58,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=244170.66666666666, ans=0.0 2024-06-20 17:29:00,273 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.25 vs. limit=22.5 2024-06-20 17:29:26,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=244225.66666666666, ans=0.125 2024-06-20 17:29:29,391 INFO [train.py:1028] (0/2) Epoch 14, batch 1700, loss[loss=0.214, simple_loss=0.2683, pruned_loss=0.07986, over 12445.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2733, pruned_loss=0.0875, over 2580757.83 frames. ], batch size: 25, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:29:30,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 1.901e+02 2.097e+02 2.403e+02 3.396e+02, threshold=4.194e+02, percent-clipped=0.0 2024-06-20 17:29:55,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=244299.0, ans=0.125 2024-06-20 17:30:00,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=244299.0, ans=0.125 2024-06-20 17:30:03,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=244317.33333333334, ans=0.0 2024-06-20 17:30:11,994 INFO [train.py:1028] (0/2) Epoch 14, batch 1750, loss[loss=0.2552, simple_loss=0.2989, pruned_loss=0.1058, over 12723.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2735, pruned_loss=0.08756, over 2582200.30 frames. ], batch size: 22, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:30:12,155 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.945e+00 2024-06-20 17:30:26,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=244354.0, ans=0.2 2024-06-20 17:30:29,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=244372.33333333334, ans=0.0 2024-06-20 17:30:32,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=244372.33333333334, ans=0.0 2024-06-20 17:30:40,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.25 vs. limit=22.5 2024-06-20 17:30:47,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.47 vs. limit=12.0 2024-06-20 17:30:53,205 INFO [train.py:1028] (0/2) Epoch 14, batch 1800, loss[loss=0.2226, simple_loss=0.2771, pruned_loss=0.08406, over 13248.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2736, pruned_loss=0.0877, over 2582364.61 frames. ], batch size: 67, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:30:53,945 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 1.831e+02 1.987e+02 2.105e+02 3.039e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-20 17:31:00,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.38 vs. limit=22.5 2024-06-20 17:31:07,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244445.66666666666, ans=0.1 2024-06-20 17:31:16,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=244482.33333333334, ans=0.125 2024-06-20 17:31:23,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=244482.33333333334, ans=0.125 2024-06-20 17:31:24,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=244482.33333333334, ans=0.125 2024-06-20 17:31:36,888 INFO [train.py:1028] (0/2) Epoch 14, batch 1850, loss[loss=0.2396, simple_loss=0.2839, pruned_loss=0.09769, over 13201.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2737, pruned_loss=0.0875, over 2583655.66 frames. ], batch size: 83, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:32:12,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=15.0 2024-06-20 17:32:19,631 INFO [train.py:1028] (0/2) Epoch 14, batch 1900, loss[loss=0.2051, simple_loss=0.2546, pruned_loss=0.07782, over 13169.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2735, pruned_loss=0.08741, over 2585873.72 frames. ], batch size: 95, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:32:19,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=244610.66666666666, ans=0.0 2024-06-20 17:32:20,461 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.830e+02 1.923e+02 2.091e+02 2.625e+02, threshold=3.846e+02, percent-clipped=0.0 2024-06-20 17:32:21,635 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.25 vs. limit=10.0 2024-06-20 17:32:32,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=244629.0, ans=0.125 2024-06-20 17:32:45,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=244665.66666666666, ans=0.2 2024-06-20 17:32:55,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244684.0, ans=0.1 2024-06-20 17:32:59,429 INFO [train.py:1028] (0/2) Epoch 14, batch 1950, loss[loss=0.2077, simple_loss=0.263, pruned_loss=0.07624, over 13254.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2728, pruned_loss=0.0873, over 2591955.00 frames. ], batch size: 52, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:33:10,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=244720.66666666666, ans=0.2 2024-06-20 17:33:13,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=244720.66666666666, ans=0.125 2024-06-20 17:33:20,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=244739.0, ans=6.0 2024-06-20 17:33:21,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=244739.0, ans=0.125 2024-06-20 17:33:32,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=244775.66666666666, ans=0.0 2024-06-20 17:33:33,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.13 vs. limit=15.0 2024-06-20 17:33:33,989 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:33:37,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244775.66666666666, ans=0.1 2024-06-20 17:33:38,795 INFO [train.py:1028] (0/2) Epoch 14, batch 2000, loss[loss=0.2339, simple_loss=0.2848, pruned_loss=0.09149, over 12578.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2733, pruned_loss=0.08778, over 2588101.00 frames. ], batch size: 22, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:33:42,875 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.909e+02 2.046e+02 2.282e+02 3.258e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-20 17:33:51,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=244812.33333333334, ans=0.05 2024-06-20 17:33:53,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2024-06-20 17:34:20,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-20 17:34:24,371 INFO [train.py:1028] (0/2) Epoch 14, batch 2050, loss[loss=0.245, simple_loss=0.2996, pruned_loss=0.09525, over 12565.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2738, pruned_loss=0.08798, over 2583429.19 frames. ], batch size: 29, lr: 4.23e-03, grad_scale: 32.0 2024-06-20 17:34:30,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=244885.66666666666, ans=0.0 2024-06-20 17:34:44,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-20 17:34:59,259 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2024-06-20 17:35:03,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=244977.33333333334, ans=0.125 2024-06-20 17:35:04,204 INFO [train.py:1028] (0/2) Epoch 14, batch 2100, loss[loss=0.2154, simple_loss=0.2742, pruned_loss=0.07833, over 13188.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.274, pruned_loss=0.08786, over 2585362.33 frames. ], batch size: 59, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:35:04,938 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 1.880e+02 1.989e+02 2.121e+02 2.737e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 17:35:06,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.51 vs. limit=22.5 2024-06-20 17:35:09,117 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.992e+01 2024-06-20 17:35:20,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245014.0, ans=0.1 2024-06-20 17:35:42,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=245069.0, ans=0.04949747468305833 2024-06-20 17:35:43,340 INFO [train.py:1028] (0/2) Epoch 14, batch 2150, loss[loss=0.2164, simple_loss=0.2719, pruned_loss=0.08041, over 13244.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2731, pruned_loss=0.08712, over 2587497.54 frames. ], batch size: 52, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:36:01,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=245105.66666666666, ans=0.125 2024-06-20 17:36:04,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=245105.66666666666, ans=0.0 2024-06-20 17:36:11,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=245124.0, ans=0.125 2024-06-20 17:36:14,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=245142.33333333334, ans=0.125 2024-06-20 17:36:15,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=245142.33333333334, ans=0.015 2024-06-20 17:36:17,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=245142.33333333334, ans=0.0 2024-06-20 17:36:23,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=245142.33333333334, ans=0.125 2024-06-20 17:36:26,469 INFO [train.py:1028] (0/2) Epoch 14, batch 2200, loss[loss=0.2097, simple_loss=0.2581, pruned_loss=0.08062, over 13171.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2736, pruned_loss=0.08741, over 2587362.68 frames. ], batch size: 83, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:36:27,266 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.848e+02 1.980e+02 2.133e+02 3.134e+02, threshold=3.960e+02, percent-clipped=0.0 2024-06-20 17:36:30,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=245160.66666666666, ans=0.125 2024-06-20 17:36:38,977 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2024-06-20 17:36:55,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=245215.66666666666, ans=0.2 2024-06-20 17:37:01,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=245234.0, ans=0.025 2024-06-20 17:37:02,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=245234.0, ans=0.125 2024-06-20 17:37:05,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2024-06-20 17:37:09,202 INFO [train.py:1028] (0/2) Epoch 14, batch 2250, loss[loss=0.2266, simple_loss=0.2761, pruned_loss=0.08856, over 13285.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2731, pruned_loss=0.08718, over 2586476.51 frames. ], batch size: 63, lr: 4.22e-03, grad_scale: 32.0 2024-06-20 17:37:15,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=245252.33333333334, ans=0.125 2024-06-20 17:37:20,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245270.66666666666, ans=0.125 2024-06-20 17:37:22,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=245270.66666666666, ans=0.125 2024-06-20 17:37:24,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245289.0, ans=0.1 2024-06-20 17:37:29,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=245289.0, ans=0.125 2024-06-20 17:37:31,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=245289.0, ans=0.125 2024-06-20 17:37:48,735 INFO [train.py:1028] (0/2) Epoch 14, batch 2300, loss[loss=0.2113, simple_loss=0.262, pruned_loss=0.08027, over 12871.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2738, pruned_loss=0.08748, over 2581152.33 frames. ], batch size: 33, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:37:49,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.698e+02 1.849e+02 1.988e+02 2.212e+02 3.408e+02, threshold=3.976e+02, percent-clipped=0.0 2024-06-20 17:37:52,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=245344.0, ans=0.125 2024-06-20 17:38:11,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=245399.0, ans=0.2 2024-06-20 17:38:22,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-20 17:38:27,368 INFO [train.py:1028] (0/2) Epoch 14, batch 2350, loss[loss=0.2236, simple_loss=0.2792, pruned_loss=0.08396, over 13232.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2742, pruned_loss=0.08781, over 2584430.14 frames. ], batch size: 67, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:38:31,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=245435.66666666666, ans=0.125 2024-06-20 17:38:47,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=12.0 2024-06-20 17:38:58,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=245490.66666666666, ans=0.0 2024-06-20 17:39:03,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=245509.0, ans=0.125 2024-06-20 17:39:08,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.47 vs. limit=22.5 2024-06-20 17:39:11,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=245509.0, ans=0.125 2024-06-20 17:39:13,578 INFO [train.py:1028] (0/2) Epoch 14, batch 2400, loss[loss=0.2182, simple_loss=0.2709, pruned_loss=0.0827, over 13286.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2733, pruned_loss=0.08739, over 2587161.97 frames. ], batch size: 46, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:39:14,263 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.862e+02 2.005e+02 2.168e+02 2.894e+02, threshold=4.011e+02, percent-clipped=0.0 2024-06-20 17:39:16,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245527.33333333334, ans=0.1 2024-06-20 17:39:17,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2024-06-20 17:39:41,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2024-06-20 17:39:52,066 INFO [train.py:1028] (0/2) Epoch 14, batch 2450, loss[loss=0.2236, simple_loss=0.2709, pruned_loss=0.08813, over 13241.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2729, pruned_loss=0.08769, over 2583130.30 frames. ], batch size: 63, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:39:54,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=245619.0, ans=0.125 2024-06-20 17:40:03,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2024-06-20 17:40:30,257 INFO [train.py:1028] (0/2) Epoch 14, batch 2500, loss[loss=0.2164, simple_loss=0.2714, pruned_loss=0.08072, over 13192.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2715, pruned_loss=0.08706, over 2587326.04 frames. ], batch size: 83, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:40:30,888 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 1.909e+02 2.063e+02 2.265e+02 3.669e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 17:40:39,600 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:40:39,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=245729.0, ans=0.0 2024-06-20 17:40:50,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.34 vs. limit=15.0 2024-06-20 17:40:53,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.21 vs. limit=22.5 2024-06-20 17:40:53,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=245765.66666666666, ans=0.05 2024-06-20 17:41:00,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=245765.66666666666, ans=0.0 2024-06-20 17:41:06,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=245784.0, ans=0.125 2024-06-20 17:41:12,730 INFO [train.py:1028] (0/2) Epoch 14, batch 2550, loss[loss=0.2461, simple_loss=0.2927, pruned_loss=0.09971, over 12564.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2702, pruned_loss=0.08641, over 2587699.45 frames. ], batch size: 22, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:41:15,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.43 vs. limit=15.0 2024-06-20 17:41:19,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=245820.66666666666, ans=0.0 2024-06-20 17:41:26,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.93 vs. limit=22.5 2024-06-20 17:41:31,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=245839.0, ans=0.025 2024-06-20 17:41:34,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=245839.0, ans=0.125 2024-06-20 17:41:39,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.82 vs. limit=10.0 2024-06-20 17:41:45,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=245857.33333333334, ans=0.125 2024-06-20 17:41:53,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=245894.0, ans=0.2 2024-06-20 17:41:54,269 INFO [train.py:1028] (0/2) Epoch 14, batch 2600, loss[loss=0.2075, simple_loss=0.2647, pruned_loss=0.07521, over 13252.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2686, pruned_loss=0.08588, over 2588076.87 frames. ], batch size: 52, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:41:54,989 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.645e+02 1.867e+02 1.977e+02 2.136e+02 2.762e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-20 17:41:59,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.81 vs. limit=10.0 2024-06-20 17:41:59,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=245894.0, ans=0.125 2024-06-20 17:42:08,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=245912.33333333334, ans=0.125 2024-06-20 17:42:18,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=245949.0, ans=0.0 2024-06-20 17:42:29,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.36 vs. limit=10.0 2024-06-20 17:42:30,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=245967.33333333334, ans=0.05 2024-06-20 17:42:31,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=245967.33333333334, ans=0.0 2024-06-20 17:42:31,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.27 vs. limit=22.5 2024-06-20 17:42:33,739 INFO [train.py:1028] (0/2) Epoch 14, batch 2650, loss[loss=0.2202, simple_loss=0.2573, pruned_loss=0.09152, over 13018.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2675, pruned_loss=0.08556, over 2588022.63 frames. ], batch size: 144, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:42:35,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=245985.66666666666, ans=0.125 2024-06-20 17:42:49,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=246022.33333333334, ans=0.125 2024-06-20 17:42:53,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=246022.33333333334, ans=0.0 2024-06-20 17:42:59,444 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:43:15,969 INFO [train.py:1028] (0/2) Epoch 14, batch 2700, loss[loss=0.2189, simple_loss=0.2617, pruned_loss=0.08802, over 13253.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2662, pruned_loss=0.08542, over 2586118.86 frames. ], batch size: 89, lr: 4.22e-03, grad_scale: 64.0 2024-06-20 17:43:16,829 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.870e+02 2.008e+02 2.247e+02 2.788e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-20 17:43:19,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246077.33333333334, ans=0.1 2024-06-20 17:43:24,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=246095.66666666666, ans=0.1 2024-06-20 17:43:32,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=246114.0, ans=0.0 2024-06-20 17:43:52,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=246150.66666666666, ans=0.2 2024-06-20 17:43:54,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2024-06-20 17:43:55,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=246150.66666666666, ans=0.125 2024-06-20 17:44:00,217 INFO [train.py:1028] (0/2) Epoch 14, batch 2750, loss[loss=0.2408, simple_loss=0.2773, pruned_loss=0.1021, over 13285.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2649, pruned_loss=0.08475, over 2584702.67 frames. ], batch size: 43, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:44:02,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=246169.0, ans=0.05 2024-06-20 17:44:08,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.41 vs. limit=15.0 2024-06-20 17:44:09,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2024-06-20 17:44:20,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=246205.66666666666, ans=0.125 2024-06-20 17:44:22,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=246205.66666666666, ans=0.0 2024-06-20 17:44:40,485 INFO [train.py:1028] (0/2) Epoch 14, batch 2800, loss[loss=0.2234, simple_loss=0.257, pruned_loss=0.09488, over 10836.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2644, pruned_loss=0.0848, over 2580962.25 frames. ], batch size: 303, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:44:41,198 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.851e+02 2.043e+02 2.242e+02 2.881e+02, threshold=4.086e+02, percent-clipped=0.0 2024-06-20 17:44:41,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=246260.66666666666, ans=0.2 2024-06-20 17:44:43,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=246260.66666666666, ans=0.125 2024-06-20 17:44:51,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246279.0, ans=0.1 2024-06-20 17:44:58,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.99 vs. limit=15.0 2024-06-20 17:45:02,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=246297.33333333334, ans=0.2 2024-06-20 17:45:07,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=246315.66666666666, ans=0.05 2024-06-20 17:45:10,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=246315.66666666666, ans=0.95 2024-06-20 17:45:10,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=246334.0, ans=0.025 2024-06-20 17:45:22,892 INFO [train.py:1028] (0/2) Epoch 14, batch 2850, loss[loss=0.2046, simple_loss=0.2552, pruned_loss=0.07698, over 13098.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2632, pruned_loss=0.0843, over 2578545.46 frames. ], batch size: 48, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:45:24,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.01 vs. limit=6.0 2024-06-20 17:45:39,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=246389.0, ans=0.125 2024-06-20 17:45:48,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=246389.0, ans=0.125 2024-06-20 17:45:48,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=246407.33333333334, ans=0.125 2024-06-20 17:45:48,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=246407.33333333334, ans=0.125 2024-06-20 17:46:05,278 INFO [train.py:1028] (0/2) Epoch 14, batch 2900, loss[loss=0.2288, simple_loss=0.2766, pruned_loss=0.09046, over 13131.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2613, pruned_loss=0.08364, over 2585559.76 frames. ], batch size: 55, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:46:05,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2024-06-20 17:46:05,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.816e+02 1.932e+02 2.083e+02 2.989e+02, threshold=3.864e+02, percent-clipped=0.0 2024-06-20 17:46:06,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=246444.0, ans=0.2 2024-06-20 17:46:15,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=246462.33333333334, ans=0.05 2024-06-20 17:46:21,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=246480.66666666666, ans=0.125 2024-06-20 17:46:26,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246480.66666666666, ans=0.1 2024-06-20 17:46:39,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.01 vs. limit=15.0 2024-06-20 17:46:45,154 INFO [train.py:1028] (0/2) Epoch 14, batch 2950, loss[loss=0.1964, simple_loss=0.2466, pruned_loss=0.07309, over 13284.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2608, pruned_loss=0.08337, over 2579786.37 frames. ], batch size: 43, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:47:01,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246572.33333333334, ans=0.1 2024-06-20 17:47:25,115 INFO [train.py:1028] (0/2) Epoch 14, batch 3000, loss[loss=0.2281, simple_loss=0.2713, pruned_loss=0.09246, over 13216.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2601, pruned_loss=0.08308, over 2579005.75 frames. ], batch size: 59, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:47:25,118 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 17:47:34,049 INFO [train.py:1060] (0/2) Epoch 14, validation: loss=0.1902, simple_loss=0.2549, pruned_loss=0.06279, over 351949.00 frames. 2024-06-20 17:47:34,050 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 17:47:34,753 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.824e+02 1.907e+02 2.070e+02 3.060e+02, threshold=3.814e+02, percent-clipped=0.0 2024-06-20 17:47:45,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=246645.66666666666, ans=0.125 2024-06-20 17:47:45,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=246645.66666666666, ans=0.05 2024-06-20 17:48:20,358 INFO [train.py:1028] (0/2) Epoch 14, batch 3050, loss[loss=0.2184, simple_loss=0.2662, pruned_loss=0.08532, over 13259.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2594, pruned_loss=0.08327, over 2579666.30 frames. ], batch size: 46, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:48:21,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2024-06-20 17:48:26,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.23 vs. limit=5.0 2024-06-20 17:48:31,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=246737.33333333334, ans=0.035 2024-06-20 17:48:58,531 INFO [train.py:1028] (0/2) Epoch 14, batch 3100, loss[loss=0.2044, simple_loss=0.2477, pruned_loss=0.08049, over 13008.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.259, pruned_loss=0.0831, over 2580249.26 frames. ], batch size: 144, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:48:59,347 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.877e+02 2.020e+02 2.192e+02 2.658e+02, threshold=4.041e+02, percent-clipped=0.0 2024-06-20 17:49:00,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=246810.66666666666, ans=0.125 2024-06-20 17:49:14,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=246847.33333333334, ans=0.125 2024-06-20 17:49:21,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=246865.66666666666, ans=0.125 2024-06-20 17:49:36,775 INFO [train.py:1028] (0/2) Epoch 14, batch 3150, loss[loss=0.2315, simple_loss=0.2755, pruned_loss=0.09381, over 12898.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2584, pruned_loss=0.08291, over 2580600.08 frames. ], batch size: 158, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:49:41,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=246902.33333333334, ans=0.2 2024-06-20 17:49:47,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=246920.66666666666, ans=0.125 2024-06-20 17:49:48,828 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:50:02,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=246939.0, ans=0.025 2024-06-20 17:50:04,689 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.77 vs. limit=15.0 2024-06-20 17:50:08,094 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:50:12,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246975.66666666666, ans=0.1 2024-06-20 17:50:12,226 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:50:18,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=246975.66666666666, ans=0.125 2024-06-20 17:50:19,867 INFO [train.py:1028] (0/2) Epoch 14, batch 3200, loss[loss=0.198, simple_loss=0.2473, pruned_loss=0.07433, over 13147.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2574, pruned_loss=0.08213, over 2582317.57 frames. ], batch size: 55, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:50:20,677 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.739e+02 1.839e+02 1.976e+02 2.345e+02, threshold=3.678e+02, percent-clipped=0.0 2024-06-20 17:50:49,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=247049.0, ans=0.07 2024-06-20 17:50:53,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=247067.33333333334, ans=0.2 2024-06-20 17:50:55,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-20 17:50:56,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=247067.33333333334, ans=0.2 2024-06-20 17:51:02,116 INFO [train.py:1028] (0/2) Epoch 14, batch 3250, loss[loss=0.1869, simple_loss=0.2405, pruned_loss=0.06666, over 13272.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.257, pruned_loss=0.08196, over 2586091.99 frames. ], batch size: 72, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:51:06,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=247085.66666666666, ans=0.125 2024-06-20 17:51:38,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=247159.0, ans=0.0 2024-06-20 17:51:43,113 INFO [train.py:1028] (0/2) Epoch 14, batch 3300, loss[loss=0.2188, simple_loss=0.263, pruned_loss=0.08726, over 12765.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2574, pruned_loss=0.08214, over 2582432.87 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 64.0 2024-06-20 17:51:43,917 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 1.809e+02 1.956e+02 2.108e+02 2.658e+02, threshold=3.912e+02, percent-clipped=0.0 2024-06-20 17:52:04,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=15.0 2024-06-20 17:52:05,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=247214.0, ans=0.0 2024-06-20 17:52:12,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=247232.33333333334, ans=0.125 2024-06-20 17:52:17,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=247250.66666666666, ans=0.0 2024-06-20 17:52:18,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=247250.66666666666, ans=0.125 2024-06-20 17:52:25,706 INFO [train.py:1028] (0/2) Epoch 14, batch 3350, loss[loss=0.2042, simple_loss=0.2469, pruned_loss=0.08079, over 12994.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2568, pruned_loss=0.08216, over 2576865.63 frames. ], batch size: 158, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:52:31,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2024-06-20 17:52:39,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=247287.33333333334, ans=0.2 2024-06-20 17:52:46,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=247305.66666666666, ans=0.125 2024-06-20 17:53:06,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=247342.33333333334, ans=0.0 2024-06-20 17:53:08,243 INFO [train.py:1028] (0/2) Epoch 14, batch 3400, loss[loss=0.2423, simple_loss=0.2796, pruned_loss=0.1025, over 12636.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2568, pruned_loss=0.08224, over 2575136.53 frames. ], batch size: 22, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:53:08,878 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.844e+02 1.985e+02 2.161e+02 2.619e+02, threshold=3.970e+02, percent-clipped=0.0 2024-06-20 17:53:27,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=247397.33333333334, ans=0.0 2024-06-20 17:53:29,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.69 vs. limit=10.0 2024-06-20 17:53:34,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=247415.66666666666, ans=0.015 2024-06-20 17:53:42,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=247434.0, ans=0.0 2024-06-20 17:53:47,477 INFO [train.py:1028] (0/2) Epoch 14, batch 3450, loss[loss=0.2077, simple_loss=0.2524, pruned_loss=0.08153, over 12668.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.256, pruned_loss=0.08183, over 2575231.72 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:54:21,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247525.66666666666, ans=0.1 2024-06-20 17:54:27,548 INFO [train.py:1028] (0/2) Epoch 14, batch 3500, loss[loss=0.2264, simple_loss=0.2693, pruned_loss=0.09178, over 12949.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2556, pruned_loss=0.08177, over 2574507.87 frames. ], batch size: 33, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:54:28,261 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.800e+02 1.877e+02 2.003e+02 3.203e+02, threshold=3.754e+02, percent-clipped=0.0 2024-06-20 17:54:30,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=247544.0, ans=0.1 2024-06-20 17:54:32,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=247544.0, ans=0.125 2024-06-20 17:54:43,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=247562.33333333334, ans=0.125 2024-06-20 17:54:45,884 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=15.0 2024-06-20 17:54:55,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=247599.0, ans=0.125 2024-06-20 17:55:11,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=247617.33333333334, ans=0.0 2024-06-20 17:55:13,872 INFO [train.py:1028] (0/2) Epoch 14, batch 3550, loss[loss=0.2029, simple_loss=0.2455, pruned_loss=0.08018, over 13118.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2542, pruned_loss=0.08107, over 2576726.05 frames. ], batch size: 95, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:55:14,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=247635.66666666666, ans=0.0 2024-06-20 17:55:21,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=247654.0, ans=0.2 2024-06-20 17:55:34,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=247672.33333333334, ans=0.125 2024-06-20 17:55:36,878 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:55:44,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=247709.0, ans=0.125 2024-06-20 17:55:47,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=247709.0, ans=0.07 2024-06-20 17:55:51,700 INFO [train.py:1028] (0/2) Epoch 14, batch 3600, loss[loss=0.2091, simple_loss=0.2588, pruned_loss=0.07971, over 13200.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2537, pruned_loss=0.08081, over 2579863.77 frames. ], batch size: 49, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:55:52,437 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.568e+02 1.829e+02 1.993e+02 2.209e+02 3.683e+02, threshold=3.987e+02, percent-clipped=0.0 2024-06-20 17:56:01,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=247745.66666666666, ans=0.125 2024-06-20 17:56:12,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=247764.0, ans=0.0 2024-06-20 17:56:21,162 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.21 vs. limit=15.0 2024-06-20 17:56:27,810 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:56:30,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=247819.0, ans=0.0 2024-06-20 17:56:31,504 INFO [train.py:1028] (0/2) Epoch 14, batch 3650, loss[loss=0.1941, simple_loss=0.2428, pruned_loss=0.07268, over 13050.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2539, pruned_loss=0.08087, over 2578480.47 frames. ], batch size: 102, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:56:39,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.46 vs. limit=15.0 2024-06-20 17:56:53,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=247855.66666666666, ans=0.0 2024-06-20 17:57:10,353 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 17:57:14,939 INFO [train.py:1028] (0/2) Epoch 14, batch 3700, loss[loss=0.2041, simple_loss=0.2547, pruned_loss=0.07679, over 13265.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2531, pruned_loss=0.08024, over 2584225.22 frames. ], batch size: 72, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:57:15,656 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.485e+02 1.802e+02 1.902e+02 2.098e+02 3.278e+02, threshold=3.804e+02, percent-clipped=0.0 2024-06-20 17:57:31,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=247947.33333333334, ans=0.0 2024-06-20 17:57:48,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=247965.66666666666, ans=0.0 2024-06-20 17:57:49,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=247984.0, ans=0.1 2024-06-20 17:57:50,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=247984.0, ans=0.125 2024-06-20 17:57:54,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=247984.0, ans=0.0 2024-06-20 17:57:57,804 INFO [train.py:1028] (0/2) Epoch 14, batch 3750, loss[loss=0.2032, simple_loss=0.2519, pruned_loss=0.07724, over 12684.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2525, pruned_loss=0.08004, over 2586574.21 frames. ], batch size: 22, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:58:09,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248020.66666666666, ans=0.1 2024-06-20 17:58:17,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=15.0 2024-06-20 17:58:22,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=248057.33333333334, ans=0.04949747468305833 2024-06-20 17:58:29,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=248075.66666666666, ans=0.05 2024-06-20 17:58:31,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=248075.66666666666, ans=0.0 2024-06-20 17:58:36,468 INFO [train.py:1028] (0/2) Epoch 14, batch 3800, loss[loss=0.1997, simple_loss=0.2483, pruned_loss=0.07553, over 13261.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2523, pruned_loss=0.07976, over 2584801.79 frames. ], batch size: 83, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:58:37,170 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.591e+02 1.805e+02 1.945e+02 2.090e+02 2.810e+02, threshold=3.890e+02, percent-clipped=0.0 2024-06-20 17:58:37,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=248094.0, ans=10.0 2024-06-20 17:58:50,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=248112.33333333334, ans=0.0 2024-06-20 17:59:01,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=248149.0, ans=0.125 2024-06-20 17:59:06,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2024-06-20 17:59:13,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=248167.33333333334, ans=0.125 2024-06-20 17:59:14,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=248167.33333333334, ans=0.125 2024-06-20 17:59:16,306 INFO [train.py:1028] (0/2) Epoch 14, batch 3850, loss[loss=0.1957, simple_loss=0.235, pruned_loss=0.07822, over 13037.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2518, pruned_loss=0.0795, over 2584177.23 frames. ], batch size: 144, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:59:17,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=248185.66666666666, ans=0.0 2024-06-20 17:59:21,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=248185.66666666666, ans=0.0 2024-06-20 17:59:25,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=248204.0, ans=0.125 2024-06-20 17:59:36,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=248222.33333333334, ans=0.0 2024-06-20 17:59:47,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=248240.66666666666, ans=0.125 2024-06-20 17:59:57,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2024-06-20 17:59:58,662 INFO [train.py:1028] (0/2) Epoch 14, batch 3900, loss[loss=0.2058, simple_loss=0.2461, pruned_loss=0.0827, over 13191.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2514, pruned_loss=0.07954, over 2587340.60 frames. ], batch size: 83, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 17:59:59,349 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.746e+02 1.871e+02 2.041e+02 2.633e+02, threshold=3.742e+02, percent-clipped=0.0 2024-06-20 18:00:06,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=248295.66666666666, ans=0.2 2024-06-20 18:00:20,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2024-06-20 18:00:27,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=248332.33333333334, ans=0.125 2024-06-20 18:00:29,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=248332.33333333334, ans=0.2 2024-06-20 18:00:31,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=12.0 2024-06-20 18:00:32,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=248332.33333333334, ans=0.125 2024-06-20 18:00:35,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=248350.66666666666, ans=0.125 2024-06-20 18:00:42,954 INFO [train.py:1028] (0/2) Epoch 14, batch 3950, loss[loss=0.2002, simple_loss=0.2382, pruned_loss=0.08109, over 13099.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2512, pruned_loss=0.07932, over 2588540.26 frames. ], batch size: 132, lr: 4.20e-03, grad_scale: 64.0 2024-06-20 18:00:46,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=248369.0, ans=0.0 2024-06-20 18:01:02,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=248405.66666666666, ans=0.0 2024-06-20 18:01:13,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=248424.0, ans=0.025 2024-06-20 18:01:17,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=248442.33333333334, ans=0.09899494936611666 2024-06-20 18:01:17,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=248442.33333333334, ans=0.125 2024-06-20 18:01:23,463 INFO [train.py:1028] (0/2) Epoch 14, batch 4000, loss[loss=0.2112, simple_loss=0.2607, pruned_loss=0.08085, over 12971.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2509, pruned_loss=0.07947, over 2584017.93 frames. ], batch size: 39, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:01:24,236 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.829e+02 2.110e+02 2.284e+02 3.853e+02, threshold=4.220e+02, percent-clipped=1.0 2024-06-20 18:01:59,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=248534.0, ans=0.025 2024-06-20 18:02:00,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=248534.0, ans=0.125 2024-06-20 18:02:04,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.whiten.whitening_limit, batch_count=248552.33333333334, ans=12.0 2024-06-20 18:02:04,821 INFO [train.py:1028] (0/2) Epoch 14, batch 4050, loss[loss=0.2439, simple_loss=0.2746, pruned_loss=0.1066, over 10964.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2514, pruned_loss=0.07978, over 2580785.29 frames. ], batch size: 303, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:02:14,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=248552.33333333334, ans=0.2 2024-06-20 18:02:33,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=248607.33333333334, ans=0.0 2024-06-20 18:02:42,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=248625.66666666666, ans=0.2 2024-06-20 18:02:46,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=248625.66666666666, ans=0.0 2024-06-20 18:02:52,181 INFO [train.py:1028] (0/2) Epoch 14, batch 4100, loss[loss=0.214, simple_loss=0.2523, pruned_loss=0.08785, over 13118.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2511, pruned_loss=0.07982, over 2576678.76 frames. ], batch size: 103, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:02:52,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.810e+02 1.962e+02 2.166e+02 3.347e+02, threshold=3.924e+02, percent-clipped=0.0 2024-06-20 18:02:55,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=248644.0, ans=0.0 2024-06-20 18:03:07,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=248680.66666666666, ans=0.0 2024-06-20 18:03:08,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248680.66666666666, ans=0.1 2024-06-20 18:03:15,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=248699.0, ans=0.1 2024-06-20 18:03:20,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.93 vs. limit=15.0 2024-06-20 18:03:21,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248699.0, ans=0.1 2024-06-20 18:03:23,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=248699.0, ans=0.125 2024-06-20 18:03:24,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=248717.33333333334, ans=0.125 2024-06-20 18:03:32,676 INFO [train.py:1028] (0/2) Epoch 14, batch 4150, loss[loss=0.1939, simple_loss=0.2409, pruned_loss=0.07342, over 13118.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2506, pruned_loss=0.07975, over 2574248.96 frames. ], batch size: 55, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:03:36,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=248735.66666666666, ans=0.0 2024-06-20 18:03:45,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=248754.0, ans=0.1 2024-06-20 18:04:02,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2024-06-20 18:04:07,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=248809.0, ans=0.0 2024-06-20 18:04:11,926 INFO [train.py:1028] (0/2) Epoch 14, batch 4200, loss[loss=0.1967, simple_loss=0.2384, pruned_loss=0.07755, over 13012.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2506, pruned_loss=0.07992, over 2577895.72 frames. ], batch size: 102, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:04:12,602 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.774e+02 1.842e+02 1.990e+02 2.594e+02, threshold=3.684e+02, percent-clipped=0.0 2024-06-20 18:04:26,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=248864.0, ans=0.125 2024-06-20 18:04:32,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=248864.0, ans=0.125 2024-06-20 18:04:33,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=248864.0, ans=0.2 2024-06-20 18:04:45,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=248882.33333333334, ans=0.0 2024-06-20 18:04:46,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2024-06-20 18:04:53,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.87 vs. limit=15.0 2024-06-20 18:04:55,675 INFO [train.py:1028] (0/2) Epoch 14, batch 4250, loss[loss=0.206, simple_loss=0.2593, pruned_loss=0.07631, over 13316.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2507, pruned_loss=0.0798, over 2580383.63 frames. ], batch size: 46, lr: 4.19e-03, grad_scale: 64.0 2024-06-20 18:05:02,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.96 vs. limit=10.0 2024-06-20 18:05:13,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=12.0 2024-06-20 18:05:23,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=248974.0, ans=0.07 2024-06-20 18:05:23,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=248974.0, ans=0.025 2024-06-20 18:05:25,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=248974.0, ans=15.0 2024-06-20 18:05:38,362 INFO [train.py:1028] (0/2) Epoch 14, batch 4300, loss[loss=0.2022, simple_loss=0.2496, pruned_loss=0.0774, over 13146.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2507, pruned_loss=0.08002, over 2580847.50 frames. ], batch size: 59, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:05:39,072 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.829e+02 1.977e+02 2.267e+02 3.051e+02, threshold=3.953e+02, percent-clipped=0.0 2024-06-20 18:05:48,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.53 vs. limit=22.5 2024-06-20 18:05:55,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=249047.33333333334, ans=0.2 2024-06-20 18:06:08,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=249084.0, ans=0.125 2024-06-20 18:06:16,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=249102.33333333334, ans=0.04949747468305833 2024-06-20 18:06:16,821 INFO [train.py:1028] (0/2) Epoch 14, batch 4350, loss[loss=0.1782, simple_loss=0.2323, pruned_loss=0.06204, over 13221.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2501, pruned_loss=0.0798, over 2585344.19 frames. ], batch size: 59, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:06:37,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=249139.0, ans=0.125 2024-06-20 18:06:44,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2024-06-20 18:07:00,205 INFO [train.py:1028] (0/2) Epoch 14, batch 4400, loss[loss=0.1832, simple_loss=0.2339, pruned_loss=0.06624, over 13240.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2494, pruned_loss=0.07939, over 2585228.09 frames. ], batch size: 83, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:07:01,101 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.737e+02 1.848e+02 2.010e+02 2.772e+02, threshold=3.696e+02, percent-clipped=0.0 2024-06-20 18:07:04,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.97 vs. limit=6.0 2024-06-20 18:07:05,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=249194.0, ans=0.125 2024-06-20 18:07:07,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=249212.33333333334, ans=0.2 2024-06-20 18:07:30,623 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:07:37,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=249267.33333333334, ans=0.025 2024-06-20 18:07:43,394 INFO [train.py:1028] (0/2) Epoch 14, batch 4450, loss[loss=0.206, simple_loss=0.2504, pruned_loss=0.08083, over 12836.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2496, pruned_loss=0.07953, over 2579728.16 frames. ], batch size: 33, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:08:02,648 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-136000.pt 2024-06-20 18:08:17,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=249340.66666666666, ans=0.0 2024-06-20 18:08:19,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=249359.0, ans=0.0 2024-06-20 18:08:19,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249359.0, ans=0.125 2024-06-20 18:08:22,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=249359.0, ans=0.125 2024-06-20 18:08:27,677 INFO [train.py:1028] (0/2) Epoch 14, batch 4500, loss[loss=0.2093, simple_loss=0.2558, pruned_loss=0.08143, over 13217.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2484, pruned_loss=0.0788, over 2584620.05 frames. ], batch size: 89, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:08:28,420 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.796e+02 1.938e+02 2.168e+02 3.017e+02, threshold=3.877e+02, percent-clipped=0.0 2024-06-20 18:08:28,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=249377.33333333334, ans=0.0 2024-06-20 18:08:58,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.47 vs. limit=22.5 2024-06-20 18:09:04,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=249450.66666666666, ans=0.125 2024-06-20 18:09:07,294 INFO [train.py:1028] (0/2) Epoch 14, batch 4550, loss[loss=0.2083, simple_loss=0.254, pruned_loss=0.08127, over 13292.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2489, pruned_loss=0.07896, over 2588550.97 frames. ], batch size: 52, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:09:07,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=249469.0, ans=0.025 2024-06-20 18:09:16,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2024-06-20 18:09:18,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249487.33333333334, ans=0.1 2024-06-20 18:09:22,104 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2024-06-20 18:09:44,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=249542.33333333334, ans=0.0 2024-06-20 18:09:46,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249542.33333333334, ans=0.1 2024-06-20 18:09:49,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=249542.33333333334, ans=0.2 2024-06-20 18:09:51,208 INFO [train.py:1028] (0/2) Epoch 14, batch 4600, loss[loss=0.2271, simple_loss=0.2632, pruned_loss=0.0955, over 12549.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.249, pruned_loss=0.07879, over 2585024.88 frames. ], batch size: 202, lr: 4.19e-03, grad_scale: 128.0 2024-06-20 18:09:51,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.845e+02 1.990e+02 2.231e+02 3.373e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 18:09:59,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=249560.66666666666, ans=0.125 2024-06-20 18:10:03,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=249579.0, ans=0.0 2024-06-20 18:10:04,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=249579.0, ans=0.2 2024-06-20 18:10:10,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.33 vs. limit=15.0 2024-06-20 18:10:22,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=249615.66666666666, ans=0.125 2024-06-20 18:10:23,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=249615.66666666666, ans=0.125 2024-06-20 18:10:33,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.20 vs. limit=22.5 2024-06-20 18:10:33,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=249652.33333333334, ans=0.125 2024-06-20 18:10:34,398 INFO [train.py:1028] (0/2) Epoch 14, batch 4650, loss[loss=0.1784, simple_loss=0.223, pruned_loss=0.06694, over 13133.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2488, pruned_loss=0.07879, over 2587838.02 frames. ], batch size: 132, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:10:57,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=249689.0, ans=0.0 2024-06-20 18:11:15,417 INFO [train.py:1028] (0/2) Epoch 14, batch 4700, loss[loss=0.2171, simple_loss=0.2682, pruned_loss=0.08301, over 12361.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2487, pruned_loss=0.07877, over 2584221.04 frames. ], batch size: 25, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:11:16,188 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.841e+02 1.961e+02 2.198e+02 2.972e+02, threshold=3.922e+02, percent-clipped=0.0 2024-06-20 18:11:24,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=249762.33333333334, ans=0.125 2024-06-20 18:11:26,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=249762.33333333334, ans=0.125 2024-06-20 18:11:28,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=249762.33333333334, ans=0.125 2024-06-20 18:11:37,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=249780.66666666666, ans=0.125 2024-06-20 18:11:38,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=249799.0, ans=0.125 2024-06-20 18:11:49,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=15.0 2024-06-20 18:12:06,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249835.66666666666, ans=0.125 2024-06-20 18:12:06,889 INFO [train.py:1028] (0/2) Epoch 14, batch 4750, loss[loss=0.2235, simple_loss=0.2617, pruned_loss=0.09269, over 12489.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2491, pruned_loss=0.07934, over 2581105.87 frames. ], batch size: 202, lr: 4.18e-03, grad_scale: 128.0 2024-06-20 18:12:08,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=249835.66666666666, ans=0.1 2024-06-20 18:12:09,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=249835.66666666666, ans=0.0 2024-06-20 18:12:43,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=249890.66666666666, ans=0.0 2024-06-20 18:12:55,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=249909.0, ans=0.1 2024-06-20 18:13:02,199 INFO [train.py:1028] (0/2) Epoch 14, batch 4800, loss[loss=0.1806, simple_loss=0.2279, pruned_loss=0.06665, over 13247.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2486, pruned_loss=0.07909, over 2577650.31 frames. ], batch size: 63, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:13:02,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.79 vs. limit=22.5 2024-06-20 18:13:04,161 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 1.855e+02 2.078e+02 2.342e+02 3.031e+02, threshold=4.156e+02, percent-clipped=0.0 2024-06-20 18:13:13,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=249945.66666666666, ans=0.5 2024-06-20 18:13:15,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2024-06-20 18:13:17,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=249945.66666666666, ans=0.125 2024-06-20 18:13:23,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249964.0, ans=0.1 2024-06-20 18:13:25,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.03 vs. limit=6.0 2024-06-20 18:13:32,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249982.33333333334, ans=0.1 2024-06-20 18:13:33,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=249982.33333333334, ans=0.125 2024-06-20 18:13:49,178 INFO [train.py:1028] (0/2) Epoch 14, batch 4850, loss[loss=0.1938, simple_loss=0.2392, pruned_loss=0.07418, over 13259.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2485, pruned_loss=0.07911, over 2575204.19 frames. ], batch size: 89, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:13:49,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:13:49,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.26 vs. limit=15.0 2024-06-20 18:13:54,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=250019.0, ans=0.5 2024-06-20 18:13:54,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:13:56,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=250019.0, ans=0.125 2024-06-20 18:14:02,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250037.33333333334, ans=0.1 2024-06-20 18:14:06,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=250037.33333333334, ans=0.125 2024-06-20 18:14:31,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=250092.33333333334, ans=0.0 2024-06-20 18:14:34,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=250092.33333333334, ans=0.125 2024-06-20 18:14:37,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2024-06-20 18:14:38,460 INFO [train.py:1028] (0/2) Epoch 14, batch 4900, loss[loss=0.1821, simple_loss=0.24, pruned_loss=0.06214, over 13242.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.248, pruned_loss=0.07884, over 2575721.06 frames. ], batch size: 59, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:14:40,340 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.761e+02 1.892e+02 2.045e+02 2.723e+02, threshold=3.784e+02, percent-clipped=0.0 2024-06-20 18:14:41,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2024-06-20 18:14:44,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.57 vs. limit=15.0 2024-06-20 18:14:45,046 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:14:54,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=250129.0, ans=0.025 2024-06-20 18:14:59,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=250129.0, ans=0.02 2024-06-20 18:15:07,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=250147.33333333334, ans=0.125 2024-06-20 18:15:13,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.45 vs. limit=22.5 2024-06-20 18:15:28,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=250165.66666666666, ans=0.2 2024-06-20 18:15:29,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=250165.66666666666, ans=0.0 2024-06-20 18:15:35,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=250184.0, ans=0.1 2024-06-20 18:15:35,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250184.0, ans=0.1 2024-06-20 18:15:40,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=250202.33333333334, ans=0.025 2024-06-20 18:15:41,460 INFO [train.py:1028] (0/2) Epoch 14, batch 4950, loss[loss=0.2038, simple_loss=0.243, pruned_loss=0.08232, over 11091.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2486, pruned_loss=0.07929, over 2569868.86 frames. ], batch size: 304, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:15:43,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=250202.33333333334, ans=0.125 2024-06-20 18:16:19,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250275.66666666666, ans=0.1 2024-06-20 18:16:23,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=250275.66666666666, ans=0.125 2024-06-20 18:16:27,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250294.0, ans=0.1 2024-06-20 18:16:28,127 INFO [train.py:1028] (0/2) Epoch 14, batch 5000, loss[loss=0.1947, simple_loss=0.2402, pruned_loss=0.07462, over 13104.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2482, pruned_loss=0.07877, over 2574735.75 frames. ], batch size: 95, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:16:29,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.773e+02 1.909e+02 2.061e+02 2.857e+02, threshold=3.818e+02, percent-clipped=0.0 2024-06-20 18:16:30,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.28 vs. limit=10.0 2024-06-20 18:16:33,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250294.0, ans=0.1 2024-06-20 18:16:35,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=250294.0, ans=0.125 2024-06-20 18:16:39,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=250312.33333333334, ans=0.2 2024-06-20 18:16:56,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=250330.66666666666, ans=0.2 2024-06-20 18:17:02,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=250349.0, ans=0.07 2024-06-20 18:17:20,215 INFO [train.py:1028] (0/2) Epoch 14, batch 5050, loss[loss=0.2123, simple_loss=0.2593, pruned_loss=0.08271, over 12901.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2478, pruned_loss=0.07851, over 2573149.70 frames. ], batch size: 36, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:17:46,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.02 vs. limit=15.0 2024-06-20 18:17:47,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=250422.33333333334, ans=0.125 2024-06-20 18:18:08,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250459.0, ans=0.125 2024-06-20 18:18:19,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=250477.33333333334, ans=0.0 2024-06-20 18:18:19,947 INFO [train.py:1028] (0/2) Epoch 14, batch 5100, loss[loss=0.2185, simple_loss=0.2701, pruned_loss=0.08344, over 12982.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2473, pruned_loss=0.07854, over 2568157.76 frames. ], batch size: 39, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:18:21,749 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.779e+02 1.925e+02 2.149e+02 3.112e+02, threshold=3.850e+02, percent-clipped=0.0 2024-06-20 18:18:34,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=250495.66666666666, ans=0.0 2024-06-20 18:18:38,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=250514.0, ans=0.125 2024-06-20 18:19:04,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=250550.66666666666, ans=0.1 2024-06-20 18:19:08,570 INFO [train.py:1028] (0/2) Epoch 14, batch 5150, loss[loss=0.1983, simple_loss=0.2334, pruned_loss=0.08156, over 13100.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2473, pruned_loss=0.07883, over 2570429.50 frames. ], batch size: 132, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:19:22,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=250587.33333333334, ans=0.125 2024-06-20 18:19:30,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=250605.66666666666, ans=0.0 2024-06-20 18:19:31,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=250605.66666666666, ans=0.125 2024-06-20 18:19:51,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=250642.33333333334, ans=0.05 2024-06-20 18:19:55,803 INFO [train.py:1028] (0/2) Epoch 14, batch 5200, loss[loss=0.1722, simple_loss=0.2175, pruned_loss=0.06349, over 13125.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.247, pruned_loss=0.07866, over 2574632.37 frames. ], batch size: 95, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:19:57,630 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.828e+02 1.960e+02 2.126e+02 3.107e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 18:19:57,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=250660.66666666666, ans=0.125 2024-06-20 18:20:05,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=250679.0, ans=0.04949747468305833 2024-06-20 18:20:13,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=250697.33333333334, ans=0.2 2024-06-20 18:20:15,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.78 vs. limit=22.5 2024-06-20 18:20:17,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=250697.33333333334, ans=0.125 2024-06-20 18:20:21,792 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.48 vs. limit=10.0 2024-06-20 18:20:22,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250697.33333333334, ans=0.1 2024-06-20 18:20:40,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250734.0, ans=0.1 2024-06-20 18:20:51,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=250752.33333333334, ans=0.2 2024-06-20 18:20:52,204 INFO [train.py:1028] (0/2) Epoch 14, batch 5250, loss[loss=0.1866, simple_loss=0.2382, pruned_loss=0.06749, over 13295.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2469, pruned_loss=0.07841, over 2571886.62 frames. ], batch size: 52, lr: 4.18e-03, grad_scale: 64.0 2024-06-20 18:21:00,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250770.66666666666, ans=0.125 2024-06-20 18:21:10,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=250770.66666666666, ans=0.125 2024-06-20 18:21:17,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=250789.0, ans=0.125 2024-06-20 18:21:18,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.64 vs. limit=22.5 2024-06-20 18:21:39,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250825.66666666666, ans=0.1 2024-06-20 18:21:42,759 INFO [train.py:1028] (0/2) Epoch 14, batch 5300, loss[loss=0.1926, simple_loss=0.2368, pruned_loss=0.0742, over 13026.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2463, pruned_loss=0.07806, over 2568058.56 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:21:44,943 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.809e+02 1.959e+02 2.124e+02 3.333e+02, threshold=3.918e+02, percent-clipped=0.0 2024-06-20 18:21:52,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=250862.33333333334, ans=0.125 2024-06-20 18:21:53,165 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.89 vs. limit=6.0 2024-06-20 18:21:54,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=250862.33333333334, ans=0.0 2024-06-20 18:21:57,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250862.33333333334, ans=0.1 2024-06-20 18:22:02,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250880.66666666666, ans=0.1 2024-06-20 18:22:15,393 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.231e+00 2024-06-20 18:22:17,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=12.0 2024-06-20 18:22:22,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=250917.33333333334, ans=0.0 2024-06-20 18:22:26,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=250917.33333333334, ans=0.0 2024-06-20 18:22:32,925 INFO [train.py:1028] (0/2) Epoch 14, batch 5350, loss[loss=0.2298, simple_loss=0.2772, pruned_loss=0.09123, over 11641.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2465, pruned_loss=0.07865, over 2574411.23 frames. ], batch size: 16, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:22:50,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=250954.0, ans=0.125 2024-06-20 18:22:51,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=250972.33333333334, ans=0.2 2024-06-20 18:22:56,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250972.33333333334, ans=0.125 2024-06-20 18:23:02,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=250990.66666666666, ans=0.125 2024-06-20 18:23:19,066 INFO [train.py:1028] (0/2) Epoch 14, batch 5400, loss[loss=0.2124, simple_loss=0.2492, pruned_loss=0.08778, over 12267.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2467, pruned_loss=0.07903, over 2566804.82 frames. ], batch size: 241, lr: 4.17e-03, grad_scale: 64.0 2024-06-20 18:23:21,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.763e+02 1.882e+02 2.068e+02 2.790e+02, threshold=3.764e+02, percent-clipped=0.0 2024-06-20 18:23:25,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=251027.33333333334, ans=0.125 2024-06-20 18:23:27,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.12 vs. limit=10.0 2024-06-20 18:23:42,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=251045.66666666666, ans=0.2 2024-06-20 18:24:06,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=251082.33333333334, ans=0.0 2024-06-20 18:24:08,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=251082.33333333334, ans=0.125 2024-06-20 18:24:18,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=251100.66666666666, ans=0.2 2024-06-20 18:24:20,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=251100.66666666666, ans=10.0 2024-06-20 18:24:22,458 INFO [train.py:1028] (0/2) Epoch 14, batch 5450, loss[loss=0.2126, simple_loss=0.2568, pruned_loss=0.08425, over 12454.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2469, pruned_loss=0.07931, over 2569917.90 frames. ], batch size: 25, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:24:58,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.97 vs. limit=22.5 2024-06-20 18:25:04,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-20 18:25:08,391 INFO [train.py:1028] (0/2) Epoch 14, batch 5500, loss[loss=0.2374, simple_loss=0.268, pruned_loss=0.1034, over 12115.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2469, pruned_loss=0.07923, over 2561929.00 frames. ], batch size: 240, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:25:09,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2024-06-20 18:25:11,322 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.547e+02 1.761e+02 1.888e+02 2.090e+02 2.925e+02, threshold=3.776e+02, percent-clipped=0.0 2024-06-20 18:25:25,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251229.0, ans=0.1 2024-06-20 18:25:56,951 INFO [train.py:1028] (0/2) Epoch 14, batch 5550, loss[loss=0.2105, simple_loss=0.2582, pruned_loss=0.08146, over 13256.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2468, pruned_loss=0.07854, over 2566551.13 frames. ], batch size: 43, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:25:58,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.82 vs. limit=22.5 2024-06-20 18:25:59,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=251302.33333333334, ans=0.2 2024-06-20 18:26:04,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=251302.33333333334, ans=0.0 2024-06-20 18:26:27,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=251339.0, ans=0.0 2024-06-20 18:26:35,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=251357.33333333334, ans=0.07 2024-06-20 18:26:37,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251357.33333333334, ans=0.125 2024-06-20 18:26:50,693 INFO [train.py:1028] (0/2) Epoch 14, batch 5600, loss[loss=0.1965, simple_loss=0.2463, pruned_loss=0.07332, over 13205.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2463, pruned_loss=0.07842, over 2569579.83 frames. ], batch size: 89, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:26:53,342 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.561e+02 1.786e+02 1.897e+02 2.090e+02 3.050e+02, threshold=3.794e+02, percent-clipped=0.0 2024-06-20 18:26:54,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=251394.0, ans=0.0 2024-06-20 18:27:07,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=251412.33333333334, ans=0.0 2024-06-20 18:27:19,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=15.0 2024-06-20 18:27:20,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.16 vs. limit=15.0 2024-06-20 18:27:34,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=251467.33333333334, ans=0.125 2024-06-20 18:27:44,428 INFO [train.py:1028] (0/2) Epoch 14, batch 5650, loss[loss=0.2111, simple_loss=0.2591, pruned_loss=0.08156, over 12511.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2462, pruned_loss=0.078, over 2575458.06 frames. ], batch size: 202, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:27:46,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.29 vs. limit=22.5 2024-06-20 18:27:50,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=251485.66666666666, ans=0.0 2024-06-20 18:27:56,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2024-06-20 18:28:15,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=251540.66666666666, ans=0.0 2024-06-20 18:28:16,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=251540.66666666666, ans=0.1 2024-06-20 18:28:30,448 INFO [train.py:1028] (0/2) Epoch 14, batch 5700, loss[loss=0.201, simple_loss=0.25, pruned_loss=0.07603, over 13288.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2463, pruned_loss=0.07822, over 2579573.66 frames. ], batch size: 63, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:28:32,741 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.761e+02 1.841e+02 2.020e+02 2.713e+02, threshold=3.682e+02, percent-clipped=0.0 2024-06-20 18:28:43,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.73 vs. limit=15.0 2024-06-20 18:28:45,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251595.66666666666, ans=0.1 2024-06-20 18:28:50,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.69 vs. limit=15.0 2024-06-20 18:29:07,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.73 vs. limit=12.0 2024-06-20 18:29:08,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251650.66666666666, ans=0.1 2024-06-20 18:29:14,867 INFO [train.py:1028] (0/2) Epoch 14, batch 5750, loss[loss=0.2012, simple_loss=0.2458, pruned_loss=0.07832, over 12735.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.247, pruned_loss=0.07834, over 2579822.50 frames. ], batch size: 176, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:29:49,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=251705.66666666666, ans=0.05 2024-06-20 18:30:14,646 INFO [train.py:1028] (0/2) Epoch 14, batch 5800, loss[loss=0.2487, simple_loss=0.28, pruned_loss=0.1087, over 12750.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2483, pruned_loss=0.07923, over 2578258.96 frames. ], batch size: 176, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:30:14,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251760.66666666666, ans=0.1 2024-06-20 18:30:17,325 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.606e+02 1.822e+02 1.912e+02 2.053e+02 2.592e+02, threshold=3.823e+02, percent-clipped=0.0 2024-06-20 18:30:20,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=251760.66666666666, ans=0.0 2024-06-20 18:30:21,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=251760.66666666666, ans=0.125 2024-06-20 18:30:28,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=251779.0, ans=0.125 2024-06-20 18:30:48,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=251815.66666666666, ans=0.125 2024-06-20 18:30:49,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=251834.0, ans=0.1 2024-06-20 18:30:57,284 INFO [train.py:1028] (0/2) Epoch 14, batch 5850, loss[loss=0.2152, simple_loss=0.2551, pruned_loss=0.08761, over 12590.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2501, pruned_loss=0.08021, over 2576462.37 frames. ], batch size: 202, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:31:01,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251852.33333333334, ans=0.1 2024-06-20 18:31:02,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.18 vs. limit=15.0 2024-06-20 18:31:05,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=251870.66666666666, ans=0.2 2024-06-20 18:31:06,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=251870.66666666666, ans=0.125 2024-06-20 18:31:06,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=251870.66666666666, ans=0.1 2024-06-20 18:31:10,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=251870.66666666666, ans=15.0 2024-06-20 18:31:23,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=251907.33333333334, ans=0.2 2024-06-20 18:31:27,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=251907.33333333334, ans=0.025 2024-06-20 18:31:27,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2024-06-20 18:31:38,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=251925.66666666666, ans=0.0 2024-06-20 18:31:43,622 INFO [train.py:1028] (0/2) Epoch 14, batch 5900, loss[loss=0.1864, simple_loss=0.2303, pruned_loss=0.07123, over 13075.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.252, pruned_loss=0.0808, over 2577109.11 frames. ], batch size: 121, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:31:47,010 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.859e+02 2.040e+02 2.234e+02 3.471e+02, threshold=4.080e+02, percent-clipped=0.0 2024-06-20 18:32:21,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=251999.0, ans=0.125 2024-06-20 18:32:24,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=251999.0, ans=0.0 2024-06-20 18:32:46,961 INFO [train.py:1028] (0/2) Epoch 14, batch 5950, loss[loss=0.1847, simple_loss=0.2336, pruned_loss=0.06787, over 13107.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2533, pruned_loss=0.08099, over 2581428.23 frames. ], batch size: 121, lr: 4.17e-03, grad_scale: 32.0 2024-06-20 18:32:51,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=252035.66666666666, ans=0.025 2024-06-20 18:32:52,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=252035.66666666666, ans=0.125 2024-06-20 18:32:53,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=252035.66666666666, ans=0.125 2024-06-20 18:33:01,029 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.72 vs. limit=22.5 2024-06-20 18:33:02,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=252054.0, ans=0.125 2024-06-20 18:33:10,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=252072.33333333334, ans=0.0 2024-06-20 18:33:10,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=252072.33333333334, ans=0.125 2024-06-20 18:33:13,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=252072.33333333334, ans=0.125 2024-06-20 18:33:20,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=252090.66666666666, ans=0.2 2024-06-20 18:33:33,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=252109.0, ans=0.125 2024-06-20 18:33:34,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.29 vs. limit=22.5 2024-06-20 18:33:37,161 INFO [train.py:1028] (0/2) Epoch 14, batch 6000, loss[loss=0.255, simple_loss=0.2901, pruned_loss=0.1099, over 12264.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2549, pruned_loss=0.0817, over 2574893.21 frames. ], batch size: 240, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:33:37,162 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 18:33:48,560 INFO [train.py:1060] (0/2) Epoch 14, validation: loss=0.1905, simple_loss=0.255, pruned_loss=0.06294, over 351949.00 frames. 2024-06-20 18:33:48,561 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 18:33:51,849 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.650e+02 1.829e+02 1.927e+02 2.147e+02 2.908e+02, threshold=3.854e+02, percent-clipped=0.0 2024-06-20 18:34:05,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.85 vs. limit=15.0 2024-06-20 18:34:06,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=252145.66666666666, ans=0.95 2024-06-20 18:34:29,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=252200.66666666666, ans=0.125 2024-06-20 18:34:37,928 INFO [train.py:1028] (0/2) Epoch 14, batch 6050, loss[loss=0.2061, simple_loss=0.2533, pruned_loss=0.0794, over 12963.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2571, pruned_loss=0.08237, over 2577747.70 frames. ], batch size: 39, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:34:44,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=252219.0, ans=0.2 2024-06-20 18:35:14,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=252274.0, ans=0.0 2024-06-20 18:35:19,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.09 vs. limit=10.0 2024-06-20 18:35:35,582 INFO [train.py:1028] (0/2) Epoch 14, batch 6100, loss[loss=0.1794, simple_loss=0.2275, pruned_loss=0.06571, over 13106.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2573, pruned_loss=0.0823, over 2579696.63 frames. ], batch size: 121, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:35:36,488 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=12.0 2024-06-20 18:35:38,347 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.865e+02 2.039e+02 2.256e+02 2.881e+02, threshold=4.078e+02, percent-clipped=0.0 2024-06-20 18:36:00,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=252347.33333333334, ans=0.1 2024-06-20 18:36:13,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=252365.66666666666, ans=0.0 2024-06-20 18:36:15,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.46 vs. limit=15.0 2024-06-20 18:36:28,944 INFO [train.py:1028] (0/2) Epoch 14, batch 6150, loss[loss=0.2351, simple_loss=0.2723, pruned_loss=0.09893, over 10950.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2589, pruned_loss=0.08317, over 2577773.98 frames. ], batch size: 304, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:36:58,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=252457.33333333334, ans=0.125 2024-06-20 18:37:12,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2024-06-20 18:37:16,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2024-06-20 18:37:18,169 INFO [train.py:1028] (0/2) Epoch 14, batch 6200, loss[loss=0.2477, simple_loss=0.2916, pruned_loss=0.102, over 13228.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2605, pruned_loss=0.08394, over 2575507.93 frames. ], batch size: 89, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:37:21,169 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.916e+02 2.075e+02 2.395e+02 3.221e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-20 18:37:26,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.27 vs. limit=22.5 2024-06-20 18:37:32,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=252512.33333333334, ans=0.0 2024-06-20 18:37:34,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=252512.33333333334, ans=0.125 2024-06-20 18:37:38,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=252530.66666666666, ans=0.125 2024-06-20 18:37:45,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2024-06-20 18:37:49,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=252549.0, ans=0.0 2024-06-20 18:37:54,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=252567.33333333334, ans=0.0 2024-06-20 18:37:55,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=252567.33333333334, ans=0.125 2024-06-20 18:38:03,086 INFO [train.py:1028] (0/2) Epoch 14, batch 6250, loss[loss=0.2261, simple_loss=0.2722, pruned_loss=0.09004, over 13215.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2615, pruned_loss=0.08451, over 2567690.89 frames. ], batch size: 83, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:38:04,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=252585.66666666666, ans=10.0 2024-06-20 18:38:04,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=252585.66666666666, ans=0.07 2024-06-20 18:38:09,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2024-06-20 18:38:18,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=252585.66666666666, ans=0.125 2024-06-20 18:38:50,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=252640.66666666666, ans=0.05 2024-06-20 18:38:54,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.47 vs. limit=22.5 2024-06-20 18:39:04,917 INFO [train.py:1028] (0/2) Epoch 14, batch 6300, loss[loss=0.216, simple_loss=0.2571, pruned_loss=0.08745, over 11318.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2632, pruned_loss=0.08527, over 2562793.32 frames. ], batch size: 16, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:39:07,393 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.662e+02 1.936e+02 2.114e+02 2.455e+02 3.862e+02, threshold=4.229e+02, percent-clipped=0.0 2024-06-20 18:39:17,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=252695.66666666666, ans=0.1 2024-06-20 18:39:22,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=252714.0, ans=0.0 2024-06-20 18:39:39,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2024-06-20 18:39:45,843 INFO [train.py:1028] (0/2) Epoch 14, batch 6350, loss[loss=0.2241, simple_loss=0.2712, pruned_loss=0.08849, over 12531.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2648, pruned_loss=0.0856, over 2572992.38 frames. ], batch size: 202, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:39:59,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=252787.33333333334, ans=0.2 2024-06-20 18:40:03,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=12.0 2024-06-20 18:40:03,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.29 vs. limit=5.0 2024-06-20 18:40:03,561 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.555e+00 2024-06-20 18:40:11,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=252824.0, ans=0.125 2024-06-20 18:40:15,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=252824.0, ans=0.125 2024-06-20 18:40:18,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=252824.0, ans=0.2 2024-06-20 18:40:27,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.18 vs. limit=22.5 2024-06-20 18:40:30,012 INFO [train.py:1028] (0/2) Epoch 14, batch 6400, loss[loss=0.2145, simple_loss=0.2576, pruned_loss=0.08575, over 13235.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2667, pruned_loss=0.08631, over 2573790.68 frames. ], batch size: 67, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:40:32,803 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.590e+02 1.934e+02 2.095e+02 2.383e+02 3.300e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 18:40:32,971 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:41:02,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2024-06-20 18:41:27,412 INFO [train.py:1028] (0/2) Epoch 14, batch 6450, loss[loss=0.2715, simple_loss=0.304, pruned_loss=0.1195, over 12524.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2686, pruned_loss=0.08698, over 2579940.98 frames. ], batch size: 202, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:41:49,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=252989.0, ans=0.125 2024-06-20 18:41:59,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=253007.33333333334, ans=0.125 2024-06-20 18:42:02,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=253007.33333333334, ans=0.125 2024-06-20 18:42:09,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.03 vs. limit=10.0 2024-06-20 18:42:14,914 INFO [train.py:1028] (0/2) Epoch 14, batch 6500, loss[loss=0.2285, simple_loss=0.2627, pruned_loss=0.09714, over 10682.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2706, pruned_loss=0.08737, over 2583659.49 frames. ], batch size: 303, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:42:18,013 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.907e+02 2.117e+02 2.353e+02 3.080e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-20 18:42:30,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=253062.33333333334, ans=0.0 2024-06-20 18:42:35,820 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.322e+00 2024-06-20 18:42:44,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=253099.0, ans=0.125 2024-06-20 18:42:45,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=253099.0, ans=0.0 2024-06-20 18:42:56,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.58 vs. limit=10.0 2024-06-20 18:43:00,315 INFO [train.py:1028] (0/2) Epoch 14, batch 6550, loss[loss=0.2097, simple_loss=0.2673, pruned_loss=0.07608, over 12513.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2714, pruned_loss=0.08731, over 2587983.97 frames. ], batch size: 22, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:43:08,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2024-06-20 18:43:16,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253154.0, ans=0.125 2024-06-20 18:43:18,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=253172.33333333334, ans=0.0 2024-06-20 18:43:30,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=253190.66666666666, ans=0.125 2024-06-20 18:43:47,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=253227.33333333334, ans=0.125 2024-06-20 18:43:47,911 INFO [train.py:1028] (0/2) Epoch 14, batch 6600, loss[loss=0.2082, simple_loss=0.2596, pruned_loss=0.07844, over 13237.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2717, pruned_loss=0.08738, over 2589834.01 frames. ], batch size: 72, lr: 4.16e-03, grad_scale: 32.0 2024-06-20 18:43:50,955 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.959e+02 2.119e+02 2.253e+02 3.068e+02, threshold=4.238e+02, percent-clipped=0.0 2024-06-20 18:43:54,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.16 vs. limit=22.5 2024-06-20 18:43:56,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=253245.66666666666, ans=0.0 2024-06-20 18:44:08,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=253264.0, ans=0.1 2024-06-20 18:44:31,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=253282.33333333334, ans=0.0 2024-06-20 18:44:47,947 INFO [train.py:1028] (0/2) Epoch 14, batch 6650, loss[loss=0.2467, simple_loss=0.2905, pruned_loss=0.1015, over 12922.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2729, pruned_loss=0.08808, over 2584482.33 frames. ], batch size: 158, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:44:53,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=253319.0, ans=0.2 2024-06-20 18:44:58,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=253337.33333333334, ans=0.125 2024-06-20 18:45:08,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.65 vs. limit=22.5 2024-06-20 18:45:30,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=253392.33333333334, ans=0.0 2024-06-20 18:45:37,471 INFO [train.py:1028] (0/2) Epoch 14, batch 6700, loss[loss=0.2534, simple_loss=0.2992, pruned_loss=0.1038, over 12695.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2743, pruned_loss=0.08888, over 2583859.93 frames. ], batch size: 176, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:45:38,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=12.0 2024-06-20 18:45:40,192 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.976e+02 2.151e+02 2.424e+02 4.483e+02, threshold=4.302e+02, percent-clipped=1.0 2024-06-20 18:45:50,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=253429.0, ans=0.125 2024-06-20 18:45:58,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=253447.33333333334, ans=0.125 2024-06-20 18:46:00,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=253447.33333333334, ans=10.0 2024-06-20 18:46:19,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2024-06-20 18:46:26,109 INFO [train.py:1028] (0/2) Epoch 14, batch 6750, loss[loss=0.2911, simple_loss=0.3245, pruned_loss=0.1289, over 12236.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2751, pruned_loss=0.08968, over 2578320.36 frames. ], batch size: 241, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:46:51,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.72 vs. limit=15.0 2024-06-20 18:46:57,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253557.33333333334, ans=0.1 2024-06-20 18:46:57,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253557.33333333334, ans=0.1 2024-06-20 18:46:59,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.32 vs. limit=22.5 2024-06-20 18:47:00,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=253557.33333333334, ans=0.0 2024-06-20 18:47:06,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=253575.66666666666, ans=0.125 2024-06-20 18:47:21,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=253594.0, ans=0.0 2024-06-20 18:47:21,789 INFO [train.py:1028] (0/2) Epoch 14, batch 6800, loss[loss=0.2115, simple_loss=0.2633, pruned_loss=0.07987, over 13240.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2763, pruned_loss=0.0898, over 2580241.46 frames. ], batch size: 67, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:47:24,652 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.916e+02 2.061e+02 2.317e+02 3.955e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-20 18:47:35,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=15.0 2024-06-20 18:48:05,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.36 vs. limit=15.0 2024-06-20 18:48:06,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253667.33333333334, ans=0.125 2024-06-20 18:48:07,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=253667.33333333334, ans=0.125 2024-06-20 18:48:09,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=253667.33333333334, ans=0.125 2024-06-20 18:48:14,028 INFO [train.py:1028] (0/2) Epoch 14, batch 6850, loss[loss=0.2495, simple_loss=0.3036, pruned_loss=0.0977, over 13294.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2763, pruned_loss=0.08938, over 2584225.15 frames. ], batch size: 63, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:48:28,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253704.0, ans=0.1 2024-06-20 18:48:51,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=253740.66666666666, ans=0.0 2024-06-20 18:48:54,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=253759.0, ans=0.05 2024-06-20 18:49:03,140 INFO [train.py:1028] (0/2) Epoch 14, batch 6900, loss[loss=0.2444, simple_loss=0.2904, pruned_loss=0.09916, over 13035.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2775, pruned_loss=0.09003, over 2585260.80 frames. ], batch size: 48, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:49:05,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 1.943e+02 2.084e+02 2.319e+02 3.046e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 18:49:06,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=253777.33333333334, ans=0.125 2024-06-20 18:49:11,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=253795.66666666666, ans=0.2 2024-06-20 18:49:19,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=253795.66666666666, ans=0.125 2024-06-20 18:49:21,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2024-06-20 18:49:28,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253814.0, ans=0.125 2024-06-20 18:49:36,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-20 18:49:38,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253832.33333333334, ans=0.1 2024-06-20 18:49:44,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=253850.66666666666, ans=0.2 2024-06-20 18:49:47,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.51 vs. limit=15.0 2024-06-20 18:49:48,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=253850.66666666666, ans=0.125 2024-06-20 18:49:49,498 INFO [train.py:1028] (0/2) Epoch 14, batch 6950, loss[loss=0.237, simple_loss=0.2834, pruned_loss=0.09532, over 11686.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2774, pruned_loss=0.08969, over 2578954.28 frames. ], batch size: 17, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:49:50,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253869.0, ans=0.1 2024-06-20 18:49:55,732 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2024-06-20 18:50:02,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2024-06-20 18:50:06,391 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.653e+00 2024-06-20 18:50:10,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.73 vs. limit=5.0 2024-06-20 18:50:22,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253924.0, ans=0.125 2024-06-20 18:50:28,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253924.0, ans=0.1 2024-06-20 18:50:33,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253942.33333333334, ans=0.125 2024-06-20 18:50:37,123 INFO [train.py:1028] (0/2) Epoch 14, batch 7000, loss[loss=0.2522, simple_loss=0.2961, pruned_loss=0.1041, over 12923.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2774, pruned_loss=0.08925, over 2575382.10 frames. ], batch size: 158, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:50:39,098 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.963e+02 2.089e+02 2.283e+02 3.555e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-20 18:51:12,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=254015.66666666666, ans=0.0 2024-06-20 18:51:21,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=12.0 2024-06-20 18:51:21,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=15.0 2024-06-20 18:51:23,594 INFO [train.py:1028] (0/2) Epoch 14, batch 7050, loss[loss=0.2537, simple_loss=0.2998, pruned_loss=0.1038, over 12796.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2787, pruned_loss=0.08966, over 2582609.25 frames. ], batch size: 177, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:51:24,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=254052.33333333334, ans=0.125 2024-06-20 18:51:28,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=254052.33333333334, ans=0.125 2024-06-20 18:51:32,576 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:51:40,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=254070.66666666666, ans=0.125 2024-06-20 18:51:47,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=254089.0, ans=0.025 2024-06-20 18:51:54,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=15.0 2024-06-20 18:52:06,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.46 vs. limit=15.0 2024-06-20 18:52:10,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=254144.0, ans=0.125 2024-06-20 18:52:11,364 INFO [train.py:1028] (0/2) Epoch 14, batch 7100, loss[loss=0.2523, simple_loss=0.3025, pruned_loss=0.1011, over 13176.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2797, pruned_loss=0.09066, over 2574350.65 frames. ], batch size: 112, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:52:14,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.013e+02 2.250e+02 2.530e+02 3.490e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-20 18:52:20,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=254162.33333333334, ans=0.0 2024-06-20 18:52:23,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.85 vs. limit=22.5 2024-06-20 18:52:31,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=254180.66666666666, ans=0.125 2024-06-20 18:52:46,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=254199.0, ans=0.125 2024-06-20 18:52:48,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=254217.33333333334, ans=0.0 2024-06-20 18:52:58,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=254235.66666666666, ans=0.125 2024-06-20 18:52:59,112 INFO [train.py:1028] (0/2) Epoch 14, batch 7150, loss[loss=0.2747, simple_loss=0.3188, pruned_loss=0.1154, over 12587.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2801, pruned_loss=0.09062, over 2572964.02 frames. ], batch size: 202, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:52:59,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254235.66666666666, ans=0.1 2024-06-20 18:53:04,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=254235.66666666666, ans=0.125 2024-06-20 18:53:31,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=254254.0, ans=0.125 2024-06-20 18:53:49,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=254290.66666666666, ans=0.125 2024-06-20 18:54:02,618 INFO [train.py:1028] (0/2) Epoch 14, batch 7200, loss[loss=0.2443, simple_loss=0.295, pruned_loss=0.09681, over 13155.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.282, pruned_loss=0.09127, over 2578576.50 frames. ], batch size: 112, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:54:04,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=254327.33333333334, ans=0.125 2024-06-20 18:54:05,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.013e+02 2.149e+02 2.413e+02 3.309e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-20 18:54:06,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=254327.33333333334, ans=0.125 2024-06-20 18:54:11,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=254345.66666666666, ans=0.125 2024-06-20 18:54:20,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=254364.0, ans=0.125 2024-06-20 18:54:31,206 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 18:54:33,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254382.33333333334, ans=0.1 2024-06-20 18:54:36,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=254382.33333333334, ans=0.05 2024-06-20 18:54:37,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=12.0 2024-06-20 18:54:49,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2024-06-20 18:54:50,779 INFO [train.py:1028] (0/2) Epoch 14, batch 7250, loss[loss=0.2222, simple_loss=0.2756, pruned_loss=0.08442, over 12934.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2824, pruned_loss=0.09117, over 2579225.33 frames. ], batch size: 36, lr: 4.15e-03, grad_scale: 32.0 2024-06-20 18:54:52,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=254419.0, ans=0.125 2024-06-20 18:55:05,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=254437.33333333334, ans=0.0 2024-06-20 18:55:12,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254455.66666666666, ans=0.1 2024-06-20 18:55:25,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2024-06-20 18:55:37,692 INFO [train.py:1028] (0/2) Epoch 14, batch 7300, loss[loss=0.2273, simple_loss=0.2798, pruned_loss=0.08734, over 12837.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.283, pruned_loss=0.09118, over 2579649.25 frames. ], batch size: 36, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:55:40,203 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.911e+02 2.056e+02 2.245e+02 3.705e+02, threshold=4.112e+02, percent-clipped=0.0 2024-06-20 18:55:42,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=254510.66666666666, ans=0.125 2024-06-20 18:55:59,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=254547.33333333334, ans=0.1 2024-06-20 18:56:38,306 INFO [train.py:1028] (0/2) Epoch 14, batch 7350, loss[loss=0.2257, simple_loss=0.2785, pruned_loss=0.08648, over 13378.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2835, pruned_loss=0.09145, over 2581986.94 frames. ], batch size: 46, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:56:45,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.94 vs. limit=15.0 2024-06-20 18:56:46,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=254602.33333333334, ans=0.125 2024-06-20 18:57:00,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=254639.0, ans=15.0 2024-06-20 18:57:01,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=254639.0, ans=0.0 2024-06-20 18:57:27,226 INFO [train.py:1028] (0/2) Epoch 14, batch 7400, loss[loss=0.2522, simple_loss=0.311, pruned_loss=0.09673, over 13257.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2835, pruned_loss=0.09165, over 2587475.67 frames. ], batch size: 63, lr: 4.14e-03, grad_scale: 32.0 2024-06-20 18:57:30,121 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.941e+02 2.018e+02 2.195e+02 3.257e+02, threshold=4.036e+02, percent-clipped=0.0 2024-06-20 18:57:40,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.81 vs. limit=22.5 2024-06-20 18:57:44,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=254712.33333333334, ans=0.125 2024-06-20 18:57:49,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2024-06-20 18:57:49,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=254730.66666666666, ans=0.5 2024-06-20 18:57:57,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=254749.0, ans=0.125 2024-06-20 18:58:19,444 INFO [train.py:1028] (0/2) Epoch 14, batch 7450, loss[loss=0.2295, simple_loss=0.2799, pruned_loss=0.08953, over 12684.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2841, pruned_loss=0.09184, over 2581056.95 frames. ], batch size: 29, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 18:58:20,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=12.0 2024-06-20 18:58:27,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=254785.66666666666, ans=0.0 2024-06-20 18:58:42,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=254822.33333333334, ans=0.125 2024-06-20 18:58:49,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=254840.66666666666, ans=0.0 2024-06-20 18:58:51,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=254840.66666666666, ans=0.125 2024-06-20 18:59:07,408 INFO [train.py:1028] (0/2) Epoch 14, batch 7500, loss[loss=0.2479, simple_loss=0.2855, pruned_loss=0.1051, over 10541.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2848, pruned_loss=0.09223, over 2578691.32 frames. ], batch size: 303, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 18:59:10,264 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.914e+02 2.015e+02 2.177e+02 3.048e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-20 18:59:44,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=254914.0, ans=0.125 2024-06-20 18:59:45,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=254914.0, ans=0.125 2024-06-20 18:59:58,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=254950.66666666666, ans=0.125 2024-06-20 19:00:00,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254950.66666666666, ans=0.1 2024-06-20 19:00:07,232 INFO [train.py:1028] (0/2) Epoch 14, batch 7550, loss[loss=0.2574, simple_loss=0.2973, pruned_loss=0.1088, over 12946.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2852, pruned_loss=0.09266, over 2578106.95 frames. ], batch size: 158, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:00:42,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=255042.33333333334, ans=0.0 2024-06-20 19:00:52,202 INFO [train.py:1028] (0/2) Epoch 14, batch 7600, loss[loss=0.2307, simple_loss=0.2752, pruned_loss=0.09312, over 13215.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.2866, pruned_loss=0.09356, over 2577877.29 frames. ], batch size: 83, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:00:55,325 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 1.979e+02 2.176e+02 2.438e+02 3.465e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-20 19:00:59,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=255060.66666666666, ans=0.125 2024-06-20 19:01:09,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.29 vs. limit=22.5 2024-06-20 19:01:12,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=255097.33333333334, ans=0.125 2024-06-20 19:01:43,859 INFO [train.py:1028] (0/2) Epoch 14, batch 7650, loss[loss=0.2593, simple_loss=0.3138, pruned_loss=0.1024, over 12958.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.2867, pruned_loss=0.0935, over 2573446.04 frames. ], batch size: 33, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:02:20,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2024-06-20 19:02:31,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2024-06-20 19:02:38,857 INFO [train.py:1028] (0/2) Epoch 14, batch 7700, loss[loss=0.2412, simple_loss=0.3019, pruned_loss=0.09021, over 13288.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.2873, pruned_loss=0.09355, over 2569404.67 frames. ], batch size: 63, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:02:41,954 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.996e+02 2.113e+02 2.316e+02 3.307e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-20 19:03:21,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=255317.33333333334, ans=0.0 2024-06-20 19:03:26,875 INFO [train.py:1028] (0/2) Epoch 14, batch 7750, loss[loss=0.2576, simple_loss=0.307, pruned_loss=0.1041, over 13293.00 frames. ], tot_loss[loss=0.238, simple_loss=0.2879, pruned_loss=0.09399, over 2574739.04 frames. ], batch size: 72, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:03:32,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=255335.66666666666, ans=0.125 2024-06-20 19:03:54,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=255372.33333333334, ans=0.0 2024-06-20 19:04:00,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=255390.66666666666, ans=0.125 2024-06-20 19:04:15,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2024-06-20 19:04:15,321 INFO [train.py:1028] (0/2) Epoch 14, batch 7800, loss[loss=0.2682, simple_loss=0.3112, pruned_loss=0.1126, over 13126.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.2885, pruned_loss=0.0938, over 2579685.15 frames. ], batch size: 95, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:04:17,990 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.003e+02 2.206e+02 2.477e+02 3.692e+02, threshold=4.412e+02, percent-clipped=0.0 2024-06-20 19:04:25,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.70 vs. limit=15.0 2024-06-20 19:04:26,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=255445.66666666666, ans=0.0 2024-06-20 19:04:38,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=255464.0, ans=0.125 2024-06-20 19:05:07,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=255500.66666666666, ans=0.0 2024-06-20 19:05:13,591 INFO [train.py:1028] (0/2) Epoch 14, batch 7850, loss[loss=0.2299, simple_loss=0.2773, pruned_loss=0.09128, over 11157.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2898, pruned_loss=0.09449, over 2573906.80 frames. ], batch size: 16, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:05:15,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=255519.0, ans=0.125 2024-06-20 19:05:26,707 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:05:39,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=255555.66666666666, ans=0.125 2024-06-20 19:06:02,118 INFO [train.py:1028] (0/2) Epoch 14, batch 7900, loss[loss=0.2111, simple_loss=0.2692, pruned_loss=0.0765, over 13115.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.2896, pruned_loss=0.09464, over 2573813.51 frames. ], batch size: 77, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:06:04,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255610.66666666666, ans=0.125 2024-06-20 19:06:05,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.009e+02 2.195e+02 2.375e+02 3.072e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-20 19:06:07,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=15.66 vs. limit=15.0 2024-06-20 19:06:10,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=255610.66666666666, ans=0.125 2024-06-20 19:06:22,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=255647.33333333334, ans=0.0 2024-06-20 19:06:51,362 INFO [train.py:1028] (0/2) Epoch 14, batch 7950, loss[loss=0.243, simple_loss=0.2815, pruned_loss=0.1023, over 10552.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2898, pruned_loss=0.09432, over 2576258.78 frames. ], batch size: 303, lr: 4.14e-03, grad_scale: 64.0 2024-06-20 19:06:51,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=255702.33333333334, ans=0.0 2024-06-20 19:07:39,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=255775.66666666666, ans=0.125 2024-06-20 19:07:40,823 INFO [train.py:1028] (0/2) Epoch 14, batch 8000, loss[loss=0.2254, simple_loss=0.2784, pruned_loss=0.08619, over 12631.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.2905, pruned_loss=0.09457, over 2573316.35 frames. ], batch size: 29, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:07:43,844 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.074e+02 2.204e+02 2.421e+02 3.264e+02, threshold=4.407e+02, percent-clipped=0.0 2024-06-20 19:08:13,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=255830.66666666666, ans=0.09899494936611666 2024-06-20 19:08:20,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=255849.0, ans=0.125 2024-06-20 19:08:31,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=255867.33333333334, ans=0.125 2024-06-20 19:08:37,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=255867.33333333334, ans=0.2 2024-06-20 19:08:42,076 INFO [train.py:1028] (0/2) Epoch 14, batch 8050, loss[loss=0.2384, simple_loss=0.2822, pruned_loss=0.09731, over 13176.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2902, pruned_loss=0.0946, over 2573023.15 frames. ], batch size: 83, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:09:08,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2024-06-20 19:09:20,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2024-06-20 19:09:28,703 INFO [train.py:1028] (0/2) Epoch 14, batch 8100, loss[loss=0.2353, simple_loss=0.287, pruned_loss=0.09177, over 13138.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2905, pruned_loss=0.09466, over 2577473.60 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:09:30,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=255977.33333333334, ans=0.125 2024-06-20 19:09:31,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.952e+02 2.074e+02 2.243e+02 3.015e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 19:09:37,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=15.0 2024-06-20 19:09:53,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=256014.0, ans=0.0 2024-06-20 19:10:06,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2024-06-20 19:10:06,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=256050.66666666666, ans=0.125 2024-06-20 19:10:07,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.50 vs. limit=6.0 2024-06-20 19:10:12,800 INFO [train.py:1028] (0/2) Epoch 14, batch 8150, loss[loss=0.2348, simple_loss=0.282, pruned_loss=0.09382, over 13118.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.291, pruned_loss=0.09476, over 2581326.47 frames. ], batch size: 121, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:10:15,242 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=1.671e+00 2024-06-20 19:10:18,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=256069.0, ans=0.5 2024-06-20 19:10:20,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=256069.0, ans=0.2 2024-06-20 19:10:21,698 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:10:25,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-20 19:10:36,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=256105.66666666666, ans=0.0 2024-06-20 19:10:53,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=256124.0, ans=22.5 2024-06-20 19:11:11,897 INFO [train.py:1028] (0/2) Epoch 14, batch 8200, loss[loss=0.2705, simple_loss=0.3117, pruned_loss=0.1146, over 13159.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.2919, pruned_loss=0.09517, over 2584846.80 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:11:14,583 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.958e+02 2.114e+02 2.345e+02 3.112e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 19:11:23,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=256179.0, ans=0.125 2024-06-20 19:11:24,801 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.35 vs. limit=15.0 2024-06-20 19:11:28,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=256179.0, ans=0.125 2024-06-20 19:11:34,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=256197.33333333334, ans=0.1 2024-06-20 19:11:40,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=256215.66666666666, ans=0.07 2024-06-20 19:11:42,559 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.44 vs. limit=22.5 2024-06-20 19:11:43,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=256215.66666666666, ans=0.125 2024-06-20 19:11:47,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=256234.0, ans=0.1 2024-06-20 19:11:49,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=256234.0, ans=0.05 2024-06-20 19:11:51,077 INFO [train.py:1028] (0/2) Epoch 14, batch 8250, loss[loss=0.2492, simple_loss=0.3018, pruned_loss=0.09832, over 13227.00 frames. ], tot_loss[loss=0.242, simple_loss=0.2927, pruned_loss=0.09559, over 2584769.21 frames. ], batch size: 52, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:11:53,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=256252.33333333334, ans=0.125 2024-06-20 19:12:04,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=256270.66666666666, ans=0.125 2024-06-20 19:12:17,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=256307.33333333334, ans=0.0 2024-06-20 19:12:33,841 INFO [train.py:1028] (0/2) Epoch 14, batch 8300, loss[loss=0.235, simple_loss=0.2784, pruned_loss=0.09578, over 13021.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.2917, pruned_loss=0.09501, over 2580738.58 frames. ], batch size: 102, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:12:36,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.000e+02 2.132e+02 2.417e+02 3.636e+02, threshold=4.264e+02, percent-clipped=0.0 2024-06-20 19:12:50,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.68 vs. limit=10.0 2024-06-20 19:12:51,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=256380.66666666666, ans=0.125 2024-06-20 19:13:09,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=256399.0, ans=0.125 2024-06-20 19:13:15,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2024-06-20 19:13:29,535 INFO [train.py:1028] (0/2) Epoch 14, batch 8350, loss[loss=0.2478, simple_loss=0.2963, pruned_loss=0.09959, over 13164.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.2915, pruned_loss=0.09456, over 2580896.95 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:13:35,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=256435.66666666666, ans=0.0 2024-06-20 19:13:38,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.58 vs. limit=5.0 2024-06-20 19:13:48,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=256454.0, ans=0.0 2024-06-20 19:13:55,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.23 vs. limit=15.0 2024-06-20 19:14:02,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=256490.66666666666, ans=0.0 2024-06-20 19:14:04,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=256490.66666666666, ans=0.125 2024-06-20 19:14:07,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=256490.66666666666, ans=0.125 2024-06-20 19:14:18,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=256509.0, ans=0.125 2024-06-20 19:14:19,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=256509.0, ans=0.09899494936611666 2024-06-20 19:14:20,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=256509.0, ans=0.125 2024-06-20 19:14:21,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=256527.33333333334, ans=0.125 2024-06-20 19:14:21,939 INFO [train.py:1028] (0/2) Epoch 14, batch 8400, loss[loss=0.2384, simple_loss=0.2905, pruned_loss=0.09309, over 12862.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2917, pruned_loss=0.0948, over 2577514.07 frames. ], batch size: 39, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:14:24,452 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 1.985e+02 2.142e+02 2.354e+02 3.037e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-20 19:14:33,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=256545.66666666666, ans=0.125 2024-06-20 19:14:37,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=256545.66666666666, ans=0.125 2024-06-20 19:14:38,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=256545.66666666666, ans=0.2 2024-06-20 19:14:40,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=256564.0, ans=0.125 2024-06-20 19:14:43,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=256564.0, ans=0.125 2024-06-20 19:14:44,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=256564.0, ans=0.125 2024-06-20 19:14:51,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=256582.33333333334, ans=0.1 2024-06-20 19:14:59,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=256600.66666666666, ans=0.025 2024-06-20 19:15:07,196 INFO [train.py:1028] (0/2) Epoch 14, batch 8450, loss[loss=0.2336, simple_loss=0.2881, pruned_loss=0.08953, over 13204.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.2922, pruned_loss=0.09473, over 2579078.66 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:15:16,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=256637.33333333334, ans=0.125 2024-06-20 19:15:23,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.33 vs. limit=15.0 2024-06-20 19:15:25,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256637.33333333334, ans=0.1 2024-06-20 19:15:29,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=256655.66666666666, ans=0.2 2024-06-20 19:15:32,287 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-140000.pt 2024-06-20 19:15:43,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.64 vs. limit=15.0 2024-06-20 19:16:01,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=256692.33333333334, ans=0.125 2024-06-20 19:16:02,877 INFO [train.py:1028] (0/2) Epoch 14, batch 8500, loss[loss=0.2506, simple_loss=0.3027, pruned_loss=0.0993, over 12717.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.2931, pruned_loss=0.09517, over 2576471.85 frames. ], batch size: 29, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:16:05,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.023e+02 2.176e+02 2.396e+02 3.327e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-20 19:16:06,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=256710.66666666666, ans=0.0 2024-06-20 19:16:31,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2024-06-20 19:16:51,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=256784.0, ans=0.125 2024-06-20 19:16:57,109 INFO [train.py:1028] (0/2) Epoch 14, batch 8550, loss[loss=0.2371, simple_loss=0.2867, pruned_loss=0.09373, over 12683.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2924, pruned_loss=0.0948, over 2575747.11 frames. ], batch size: 22, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:17:04,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=256802.33333333334, ans=0.125 2024-06-20 19:17:04,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2024-06-20 19:17:19,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=256839.0, ans=0.0 2024-06-20 19:17:22,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=256839.0, ans=0.2 2024-06-20 19:17:27,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=256857.33333333334, ans=0.2 2024-06-20 19:17:34,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=256857.33333333334, ans=0.2 2024-06-20 19:17:38,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=256875.66666666666, ans=0.2 2024-06-20 19:17:43,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=256875.66666666666, ans=0.125 2024-06-20 19:17:45,768 INFO [train.py:1028] (0/2) Epoch 14, batch 8600, loss[loss=0.2512, simple_loss=0.2936, pruned_loss=0.1044, over 13085.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2925, pruned_loss=0.09446, over 2573659.13 frames. ], batch size: 112, lr: 4.13e-03, grad_scale: 64.0 2024-06-20 19:17:48,593 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.985e+02 2.114e+02 2.262e+02 3.380e+02, threshold=4.228e+02, percent-clipped=0.0 2024-06-20 19:17:51,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=256894.0, ans=0.0 2024-06-20 19:17:52,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=256894.0, ans=0.125 2024-06-20 19:17:58,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2024-06-20 19:18:31,403 INFO [train.py:1028] (0/2) Epoch 14, batch 8650, loss[loss=0.2299, simple_loss=0.28, pruned_loss=0.08988, over 13041.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2931, pruned_loss=0.09474, over 2576805.45 frames. ], batch size: 102, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:18:51,017 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:18:56,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=257022.33333333334, ans=0.035 2024-06-20 19:19:12,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=257059.0, ans=0.2 2024-06-20 19:19:19,855 INFO [train.py:1028] (0/2) Epoch 14, batch 8700, loss[loss=0.2405, simple_loss=0.2915, pruned_loss=0.09476, over 13185.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.2938, pruned_loss=0.09528, over 2574170.32 frames. ], batch size: 59, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:19:22,567 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 1.970e+02 2.087e+02 2.261e+02 2.793e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 19:20:06,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257132.33333333334, ans=0.125 2024-06-20 19:20:17,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257150.66666666666, ans=0.125 2024-06-20 19:20:19,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257150.66666666666, ans=0.125 2024-06-20 19:20:19,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=257150.66666666666, ans=0.2 2024-06-20 19:20:21,398 INFO [train.py:1028] (0/2) Epoch 14, batch 8750, loss[loss=0.2404, simple_loss=0.2801, pruned_loss=0.1003, over 13097.00 frames. ], tot_loss[loss=0.242, simple_loss=0.2936, pruned_loss=0.0952, over 2569493.31 frames. ], batch size: 121, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:20:31,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257187.33333333334, ans=0.125 2024-06-20 19:20:32,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=257187.33333333334, ans=0.125 2024-06-20 19:20:34,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=257187.33333333334, ans=0.1 2024-06-20 19:20:36,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=257187.33333333334, ans=0.0 2024-06-20 19:20:38,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=257205.66666666666, ans=0.125 2024-06-20 19:20:51,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=257224.0, ans=0.1 2024-06-20 19:20:55,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=257242.33333333334, ans=0.025 2024-06-20 19:21:04,142 INFO [train.py:1028] (0/2) Epoch 14, batch 8800, loss[loss=0.2109, simple_loss=0.2697, pruned_loss=0.07608, over 13240.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2944, pruned_loss=0.09557, over 2574071.65 frames. ], batch size: 72, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:21:06,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.005e+02 2.127e+02 2.367e+02 3.102e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-20 19:21:23,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2024-06-20 19:21:32,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=257297.33333333334, ans=0.0 2024-06-20 19:21:47,338 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:21:55,310 INFO [train.py:1028] (0/2) Epoch 14, batch 8850, loss[loss=0.2604, simple_loss=0.3116, pruned_loss=0.1046, over 12566.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2941, pruned_loss=0.09556, over 2565959.83 frames. ], batch size: 202, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:22:11,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=257352.33333333334, ans=0.04949747468305833 2024-06-20 19:22:12,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=257370.66666666666, ans=0.0 2024-06-20 19:22:33,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=15.0 2024-06-20 19:22:35,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=257389.0, ans=0.125 2024-06-20 19:22:55,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=257425.66666666666, ans=15.0 2024-06-20 19:22:58,215 INFO [train.py:1028] (0/2) Epoch 14, batch 8900, loss[loss=0.2455, simple_loss=0.2986, pruned_loss=0.09624, over 12913.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.2949, pruned_loss=0.09597, over 2563343.23 frames. ], batch size: 33, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:23:00,880 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.021e+02 2.216e+02 2.416e+02 3.364e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 19:23:05,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257444.0, ans=0.125 2024-06-20 19:23:19,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=257480.66666666666, ans=0.02 2024-06-20 19:23:32,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=257517.33333333334, ans=0.2 2024-06-20 19:23:43,400 INFO [train.py:1028] (0/2) Epoch 14, batch 8950, loss[loss=0.2407, simple_loss=0.2891, pruned_loss=0.09612, over 12606.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.2945, pruned_loss=0.09542, over 2563348.90 frames. ], batch size: 202, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:23:48,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.49 vs. limit=22.5 2024-06-20 19:24:00,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=257554.0, ans=0.025 2024-06-20 19:24:00,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=257554.0, ans=0.125 2024-06-20 19:24:27,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257609.0, ans=0.1 2024-06-20 19:24:33,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=257609.0, ans=0.025 2024-06-20 19:24:34,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257627.33333333334, ans=0.1 2024-06-20 19:24:35,247 INFO [train.py:1028] (0/2) Epoch 14, batch 9000, loss[loss=0.2383, simple_loss=0.2955, pruned_loss=0.0906, over 13268.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.2947, pruned_loss=0.0952, over 2569638.09 frames. ], batch size: 46, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:24:35,248 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 19:24:46,179 INFO [train.py:1060] (0/2) Epoch 14, validation: loss=0.1901, simple_loss=0.255, pruned_loss=0.06264, over 351949.00 frames. 2024-06-20 19:24:46,179 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 19:24:47,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=257627.33333333334, ans=0.0 2024-06-20 19:24:49,200 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.039e+02 2.186e+02 2.416e+02 3.658e+02, threshold=4.372e+02, percent-clipped=0.0 2024-06-20 19:24:54,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.26 vs. limit=22.5 2024-06-20 19:24:57,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.04 vs. limit=15.0 2024-06-20 19:25:03,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=257645.66666666666, ans=0.125 2024-06-20 19:25:15,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=257682.33333333334, ans=0.125 2024-06-20 19:25:16,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=257682.33333333334, ans=0.09899494936611666 2024-06-20 19:25:20,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=8.17 vs. limit=12.0 2024-06-20 19:25:39,990 INFO [train.py:1028] (0/2) Epoch 14, batch 9050, loss[loss=0.2046, simple_loss=0.2655, pruned_loss=0.07182, over 11039.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2952, pruned_loss=0.09527, over 2569110.12 frames. ], batch size: 16, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:25:56,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=257737.33333333334, ans=0.0 2024-06-20 19:26:03,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=257737.33333333334, ans=0.125 2024-06-20 19:26:04,129 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:26:09,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.84 vs. limit=10.0 2024-06-20 19:26:18,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=257774.0, ans=0.09899494936611666 2024-06-20 19:26:27,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=257792.33333333334, ans=0.125 2024-06-20 19:26:32,989 INFO [train.py:1028] (0/2) Epoch 14, batch 9100, loss[loss=0.2303, simple_loss=0.2859, pruned_loss=0.08735, over 13266.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.294, pruned_loss=0.09472, over 2568289.10 frames. ], batch size: 72, lr: 4.12e-03, grad_scale: 64.0 2024-06-20 19:26:36,015 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 1.932e+02 2.085e+02 2.234e+02 3.364e+02, threshold=4.170e+02, percent-clipped=0.0 2024-06-20 19:26:40,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=257810.66666666666, ans=0.07 2024-06-20 19:26:44,532 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:26:54,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=257847.33333333334, ans=0.125 2024-06-20 19:26:58,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257847.33333333334, ans=0.125 2024-06-20 19:27:00,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=257865.66666666666, ans=15.0 2024-06-20 19:27:03,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=257865.66666666666, ans=0.1 2024-06-20 19:27:19,243 INFO [train.py:1028] (0/2) Epoch 14, batch 9150, loss[loss=0.2485, simple_loss=0.3024, pruned_loss=0.09733, over 13107.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.2945, pruned_loss=0.09515, over 2569250.05 frames. ], batch size: 77, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:27:27,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257902.33333333334, ans=0.125 2024-06-20 19:27:31,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.00 vs. limit=22.5 2024-06-20 19:27:31,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2024-06-20 19:27:33,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=257920.66666666666, ans=0.125 2024-06-20 19:27:50,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257957.33333333334, ans=0.1 2024-06-20 19:28:07,143 INFO [train.py:1028] (0/2) Epoch 14, batch 9200, loss[loss=0.2203, simple_loss=0.2814, pruned_loss=0.07965, over 12915.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.2941, pruned_loss=0.09442, over 2571272.36 frames. ], batch size: 36, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:28:10,608 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.086e+02 2.246e+02 2.501e+02 3.798e+02, threshold=4.492e+02, percent-clipped=0.0 2024-06-20 19:28:11,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=257994.0, ans=0.0 2024-06-20 19:28:12,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257994.0, ans=0.1 2024-06-20 19:28:23,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=12.0 2024-06-20 19:28:44,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=22.5 2024-06-20 19:28:48,512 INFO [train.py:1028] (0/2) Epoch 14, batch 9250, loss[loss=0.2374, simple_loss=0.2962, pruned_loss=0.08929, over 13271.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.2941, pruned_loss=0.09434, over 2572721.64 frames. ], batch size: 67, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:29:00,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=258104.0, ans=0.0 2024-06-20 19:29:06,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=258122.33333333334, ans=0.025 2024-06-20 19:29:14,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=258140.66666666666, ans=0.0 2024-06-20 19:29:31,521 INFO [train.py:1028] (0/2) Epoch 14, batch 9300, loss[loss=0.2322, simple_loss=0.2836, pruned_loss=0.09044, over 12972.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.2945, pruned_loss=0.0945, over 2570542.79 frames. ], batch size: 39, lr: 4.12e-03, grad_scale: 32.0 2024-06-20 19:29:34,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.58 vs. limit=15.0 2024-06-20 19:29:35,235 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.044e+02 2.199e+02 2.506e+02 3.756e+02, threshold=4.397e+02, percent-clipped=0.0 2024-06-20 19:29:40,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=258195.66666666666, ans=0.125 2024-06-20 19:29:48,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=12.0 2024-06-20 19:29:50,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.54 vs. limit=22.5 2024-06-20 19:29:59,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=258232.33333333334, ans=0.0 2024-06-20 19:30:01,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=258232.33333333334, ans=0.2 2024-06-20 19:30:05,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=258232.33333333334, ans=0.125 2024-06-20 19:30:10,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258250.66666666666, ans=0.1 2024-06-20 19:30:12,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=258250.66666666666, ans=0.04949747468305833 2024-06-20 19:30:16,867 INFO [train.py:1028] (0/2) Epoch 14, batch 9350, loss[loss=0.2422, simple_loss=0.3001, pruned_loss=0.09217, over 12737.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.2938, pruned_loss=0.09418, over 2567992.09 frames. ], batch size: 22, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:30:17,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=258269.0, ans=0.2 2024-06-20 19:30:30,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=258287.33333333334, ans=0.0 2024-06-20 19:30:37,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=258305.66666666666, ans=0.125 2024-06-20 19:30:56,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=258342.33333333334, ans=0.025 2024-06-20 19:30:58,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=258342.33333333334, ans=0.125 2024-06-20 19:31:05,925 INFO [train.py:1028] (0/2) Epoch 14, batch 9400, loss[loss=0.2525, simple_loss=0.3076, pruned_loss=0.09871, over 13322.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2951, pruned_loss=0.09528, over 2566856.42 frames. ], batch size: 52, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:31:06,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=258360.66666666666, ans=0.125 2024-06-20 19:31:14,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 1.993e+02 2.233e+02 2.411e+02 3.028e+02, threshold=4.466e+02, percent-clipped=0.0 2024-06-20 19:31:36,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=258415.66666666666, ans=0.125 2024-06-20 19:31:47,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=258434.0, ans=0.025 2024-06-20 19:31:53,894 INFO [train.py:1028] (0/2) Epoch 14, batch 9450, loss[loss=0.2607, simple_loss=0.3129, pruned_loss=0.1043, over 12598.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.2964, pruned_loss=0.09617, over 2567796.49 frames. ], batch size: 22, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:32:09,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=258470.66666666666, ans=0.2 2024-06-20 19:32:10,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=258489.0, ans=0.125 2024-06-20 19:32:12,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=258489.0, ans=0.125 2024-06-20 19:32:27,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=258507.33333333334, ans=0.025 2024-06-20 19:32:33,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=258525.66666666666, ans=0.0 2024-06-20 19:32:35,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=258525.66666666666, ans=0.05 2024-06-20 19:32:36,637 INFO [train.py:1028] (0/2) Epoch 14, batch 9500, loss[loss=0.2366, simple_loss=0.295, pruned_loss=0.08907, over 13288.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.2959, pruned_loss=0.09567, over 2577745.84 frames. ], batch size: 43, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:32:39,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.974e+02 2.064e+02 2.227e+02 2.891e+02, threshold=4.129e+02, percent-clipped=0.0 2024-06-20 19:32:49,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=258562.33333333334, ans=0.125 2024-06-20 19:32:49,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=258562.33333333334, ans=0.125 2024-06-20 19:32:52,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-20 19:32:56,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=258580.66666666666, ans=0.09899494936611666 2024-06-20 19:33:02,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2024-06-20 19:33:17,821 INFO [train.py:1028] (0/2) Epoch 14, batch 9550, loss[loss=0.2284, simple_loss=0.2806, pruned_loss=0.08812, over 12936.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.2951, pruned_loss=0.09564, over 2572093.95 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:33:21,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=258635.66666666666, ans=0.025 2024-06-20 19:33:23,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=258635.66666666666, ans=0.1 2024-06-20 19:33:26,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=258654.0, ans=0.02 2024-06-20 19:33:32,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=258654.0, ans=0.125 2024-06-20 19:33:45,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=258690.66666666666, ans=0.2 2024-06-20 19:33:47,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=258690.66666666666, ans=0.125 2024-06-20 19:34:01,576 INFO [train.py:1028] (0/2) Epoch 14, batch 9600, loss[loss=0.2579, simple_loss=0.2975, pruned_loss=0.1092, over 10496.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.2948, pruned_loss=0.09536, over 2571160.69 frames. ], batch size: 305, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:34:05,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.982e+02 2.122e+02 2.367e+02 3.147e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-20 19:34:17,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=15.0 2024-06-20 19:34:18,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=258764.0, ans=0.1 2024-06-20 19:34:19,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=258764.0, ans=0.025 2024-06-20 19:34:20,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=258764.0, ans=0.0 2024-06-20 19:34:20,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=258764.0, ans=0.2 2024-06-20 19:34:43,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=258800.66666666666, ans=0.125 2024-06-20 19:34:49,243 INFO [train.py:1028] (0/2) Epoch 14, batch 9650, loss[loss=0.2453, simple_loss=0.2864, pruned_loss=0.102, over 13046.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.2949, pruned_loss=0.0957, over 2560967.12 frames. ], batch size: 132, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:34:53,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=258819.0, ans=0.1 2024-06-20 19:34:57,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2024-06-20 19:34:59,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=258837.33333333334, ans=0.125 2024-06-20 19:35:02,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.58 vs. limit=10.0 2024-06-20 19:35:08,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=258855.66666666666, ans=0.0 2024-06-20 19:35:27,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=258892.33333333334, ans=0.125 2024-06-20 19:35:27,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=258892.33333333334, ans=0.0 2024-06-20 19:35:31,659 INFO [train.py:1028] (0/2) Epoch 14, batch 9700, loss[loss=0.2488, simple_loss=0.2912, pruned_loss=0.1032, over 13052.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.2943, pruned_loss=0.09574, over 2556387.45 frames. ], batch size: 144, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:35:34,856 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 2.017e+02 2.177e+02 2.428e+02 3.338e+02, threshold=4.355e+02, percent-clipped=0.0 2024-06-20 19:36:04,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=258984.0, ans=0.125 2024-06-20 19:36:13,737 INFO [train.py:1028] (0/2) Epoch 14, batch 9750, loss[loss=0.2369, simple_loss=0.288, pruned_loss=0.09291, over 13038.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.2929, pruned_loss=0.0948, over 2552490.66 frames. ], batch size: 132, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:36:17,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.09 vs. limit=15.0 2024-06-20 19:36:27,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=259020.66666666666, ans=0.125 2024-06-20 19:36:28,965 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=12.0 2024-06-20 19:36:29,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259020.66666666666, ans=0.125 2024-06-20 19:36:38,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-20 19:36:49,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=259075.66666666666, ans=0.0 2024-06-20 19:36:54,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=259075.66666666666, ans=0.025 2024-06-20 19:36:56,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259075.66666666666, ans=0.125 2024-06-20 19:36:59,457 INFO [train.py:1028] (0/2) Epoch 14, batch 9800, loss[loss=0.2415, simple_loss=0.2951, pruned_loss=0.09397, over 12877.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.2925, pruned_loss=0.09435, over 2545260.67 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:37:03,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.016e+02 2.191e+02 2.436e+02 3.252e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-20 19:37:38,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=259167.33333333334, ans=0.015 2024-06-20 19:37:38,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=259167.33333333334, ans=0.125 2024-06-20 19:37:47,542 INFO [train.py:1028] (0/2) Epoch 14, batch 9850, loss[loss=0.2504, simple_loss=0.3022, pruned_loss=0.09933, over 13029.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.2922, pruned_loss=0.09435, over 2538095.90 frames. ], batch size: 102, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:38:03,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=259204.0, ans=0.0 2024-06-20 19:38:04,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=259222.33333333334, ans=0.95 2024-06-20 19:38:04,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=259222.33333333334, ans=0.125 2024-06-20 19:38:17,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=259240.66666666666, ans=0.125 2024-06-20 19:38:24,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.84 vs. limit=22.5 2024-06-20 19:38:28,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=259259.0, ans=0.2 2024-06-20 19:38:31,385 INFO [train.py:1028] (0/2) Epoch 14, batch 9900, loss[loss=0.2436, simple_loss=0.2988, pruned_loss=0.09422, over 12901.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.291, pruned_loss=0.09402, over 2529724.31 frames. ], batch size: 39, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:38:34,425 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 2.009e+02 2.184e+02 2.415e+02 3.341e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-20 19:38:36,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=259277.33333333334, ans=0.025 2024-06-20 19:38:36,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=259277.33333333334, ans=0.125 2024-06-20 19:38:42,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=259295.66666666666, ans=0.125 2024-06-20 19:38:49,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=259314.0, ans=0.125 2024-06-20 19:38:52,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=259314.0, ans=0.2 2024-06-20 19:39:10,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2024-06-20 19:39:13,207 INFO [train.py:1028] (0/2) Epoch 14, batch 9950, loss[loss=0.251, simple_loss=0.3018, pruned_loss=0.1001, over 12630.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2901, pruned_loss=0.09418, over 2525712.54 frames. ], batch size: 29, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:39:14,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=259369.0, ans=0.125 2024-06-20 19:39:23,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259387.33333333334, ans=0.1 2024-06-20 19:39:25,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=259387.33333333334, ans=0.125 2024-06-20 19:39:36,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=259405.66666666666, ans=0.125 2024-06-20 19:39:52,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=259442.33333333334, ans=0.125 2024-06-20 19:39:55,523 INFO [train.py:1028] (0/2) Epoch 14, batch 10000, loss[loss=0.2618, simple_loss=0.3176, pruned_loss=0.103, over 12686.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.2908, pruned_loss=0.0947, over 2489730.67 frames. ], batch size: 22, lr: 4.11e-03, grad_scale: 32.0 2024-06-20 19:40:00,745 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.067e+02 2.229e+02 2.383e+02 3.389e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-20 19:40:18,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=259497.33333333334, ans=0.025 2024-06-20 19:40:22,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=259497.33333333334, ans=0.2 2024-06-20 19:40:41,479 INFO [train.py:1028] (0/2) Epoch 14, batch 10050, loss[loss=0.2479, simple_loss=0.3101, pruned_loss=0.09289, over 12292.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.2913, pruned_loss=0.09564, over 2445937.36 frames. ], batch size: 22, lr: 4.10e-03, grad_scale: 32.0 2024-06-20 19:40:43,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=259552.33333333334, ans=0.2 2024-06-20 19:40:45,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=259552.33333333334, ans=0.0 2024-06-20 19:40:45,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259552.33333333334, ans=0.125 2024-06-20 19:40:56,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=259589.0, ans=0.95 2024-06-20 19:40:57,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=15.0 2024-06-20 19:41:25,659 INFO [train.py:1028] (0/2) Epoch 14, batch 10100, loss[loss=0.2292, simple_loss=0.2846, pruned_loss=0.08693, over 11291.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.2905, pruned_loss=0.09465, over 2426353.79 frames. ], batch size: 17, lr: 4.10e-03, grad_scale: 32.0 2024-06-20 19:41:27,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=259644.0, ans=0.125 2024-06-20 19:41:29,425 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.556e+02 2.008e+02 2.174e+02 2.390e+02 3.197e+02, threshold=4.349e+02, percent-clipped=0.0 2024-06-20 19:41:39,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.39 vs. limit=15.0 2024-06-20 19:41:44,990 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-14.pt 2024-06-20 19:44:43,769 INFO [train.py:1028] (0/2) Epoch 15, batch 0, loss[loss=0.2034, simple_loss=0.2528, pruned_loss=0.07699, over 12915.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2528, pruned_loss=0.07699, over 12915.00 frames. ], batch size: 36, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:44:43,771 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 19:44:55,098 INFO [train.py:1060] (0/2) Epoch 15, validation: loss=0.1909, simple_loss=0.2562, pruned_loss=0.06283, over 351949.00 frames. 2024-06-20 19:44:55,099 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 19:44:57,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=259673.33333333334, ans=0.125 2024-06-20 19:44:58,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=259673.33333333334, ans=0.2 2024-06-20 19:45:24,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=259710.0, ans=15.0 2024-06-20 19:45:26,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=259710.0, ans=0.0 2024-06-20 19:45:34,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=259728.33333333334, ans=0.0 2024-06-20 19:45:48,124 INFO [train.py:1028] (0/2) Epoch 15, batch 50, loss[loss=0.2056, simple_loss=0.2604, pruned_loss=0.07542, over 12781.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2703, pruned_loss=0.08649, over 574269.11 frames. ], batch size: 29, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:45:48,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=259765.0, ans=0.2 2024-06-20 19:45:51,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=259765.0, ans=0.125 2024-06-20 19:45:57,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2024-06-20 19:46:03,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.52 vs. limit=15.0 2024-06-20 19:46:16,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=259820.0, ans=0.125 2024-06-20 19:46:20,173 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.926e+02 2.151e+02 2.418e+02 3.077e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-20 19:46:28,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=259838.33333333334, ans=0.125 2024-06-20 19:46:31,711 INFO [train.py:1028] (0/2) Epoch 15, batch 100, loss[loss=0.2143, simple_loss=0.2738, pruned_loss=0.07738, over 13309.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2703, pruned_loss=0.086, over 1016486.84 frames. ], batch size: 46, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:46:36,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259856.66666666666, ans=0.125 2024-06-20 19:46:42,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=259875.0, ans=0.125 2024-06-20 19:47:18,224 INFO [train.py:1028] (0/2) Epoch 15, batch 150, loss[loss=0.233, simple_loss=0.2794, pruned_loss=0.09326, over 12635.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2697, pruned_loss=0.0852, over 1364234.91 frames. ], batch size: 29, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:47:21,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=259948.33333333334, ans=0.125 2024-06-20 19:47:26,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=259966.66666666666, ans=0.2 2024-06-20 19:47:31,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2024-06-20 19:47:31,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=259966.66666666666, ans=0.05 2024-06-20 19:47:41,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=259985.0, ans=0.125 2024-06-20 19:48:03,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=260003.33333333334, ans=0.125 2024-06-20 19:48:06,289 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.621e+02 1.899e+02 2.029e+02 2.225e+02 2.992e+02, threshold=4.058e+02, percent-clipped=0.0 2024-06-20 19:48:14,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=260021.66666666666, ans=0.125 2024-06-20 19:48:16,821 INFO [train.py:1028] (0/2) Epoch 15, batch 200, loss[loss=0.2315, simple_loss=0.2724, pruned_loss=0.09526, over 12617.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2695, pruned_loss=0.08493, over 1634163.61 frames. ], batch size: 202, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:48:34,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=260076.66666666666, ans=0.2 2024-06-20 19:48:34,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=260076.66666666666, ans=0.025 2024-06-20 19:48:44,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260095.0, ans=0.1 2024-06-20 19:49:01,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=260131.66666666666, ans=0.2 2024-06-20 19:49:02,215 INFO [train.py:1028] (0/2) Epoch 15, batch 250, loss[loss=0.2309, simple_loss=0.2625, pruned_loss=0.09964, over 13032.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2691, pruned_loss=0.08449, over 1845461.58 frames. ], batch size: 144, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:49:13,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.51 vs. limit=10.0 2024-06-20 19:49:28,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=260168.33333333334, ans=0.0 2024-06-20 19:49:30,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=260168.33333333334, ans=0.125 2024-06-20 19:49:39,307 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.669e+02 1.915e+02 2.085e+02 2.392e+02 3.311e+02, threshold=4.169e+02, percent-clipped=0.0 2024-06-20 19:49:41,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=260205.0, ans=0.125 2024-06-20 19:49:51,256 INFO [train.py:1028] (0/2) Epoch 15, batch 300, loss[loss=0.2299, simple_loss=0.275, pruned_loss=0.0924, over 13169.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2708, pruned_loss=0.08552, over 2008564.65 frames. ], batch size: 112, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:49:55,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=260223.33333333334, ans=0.125 2024-06-20 19:50:15,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=260260.0, ans=0.125 2024-06-20 19:50:20,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2024-06-20 19:50:33,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.80 vs. limit=22.5 2024-06-20 19:50:37,227 INFO [train.py:1028] (0/2) Epoch 15, batch 350, loss[loss=0.2411, simple_loss=0.2925, pruned_loss=0.09488, over 12968.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2709, pruned_loss=0.08571, over 2137780.32 frames. ], batch size: 33, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:51:03,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2024-06-20 19:51:08,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=260351.66666666666, ans=0.2 2024-06-20 19:51:24,625 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.858e+02 1.982e+02 2.170e+02 2.582e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 19:51:36,029 INFO [train.py:1028] (0/2) Epoch 15, batch 400, loss[loss=0.1985, simple_loss=0.2563, pruned_loss=0.07039, over 13334.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2704, pruned_loss=0.0849, over 2239031.84 frames. ], batch size: 63, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:51:38,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=260406.66666666666, ans=0.2 2024-06-20 19:51:44,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=260425.0, ans=0.125 2024-06-20 19:51:59,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=260443.33333333334, ans=0.0 2024-06-20 19:52:19,423 INFO [train.py:1028] (0/2) Epoch 15, batch 450, loss[loss=0.197, simple_loss=0.2533, pruned_loss=0.07039, over 13251.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2699, pruned_loss=0.08482, over 2313561.80 frames. ], batch size: 67, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:52:23,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=260498.33333333334, ans=0.2 2024-06-20 19:52:26,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=260516.66666666666, ans=0.2 2024-06-20 19:52:30,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2024-06-20 19:52:53,585 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.904e+02 2.022e+02 2.165e+02 2.681e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-20 19:53:04,202 INFO [train.py:1028] (0/2) Epoch 15, batch 500, loss[loss=0.1961, simple_loss=0.2465, pruned_loss=0.07285, over 13104.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2701, pruned_loss=0.08463, over 2375186.63 frames. ], batch size: 121, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:53:07,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=260590.0, ans=0.2 2024-06-20 19:53:10,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=22.5 2024-06-20 19:53:14,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=260608.33333333334, ans=0.125 2024-06-20 19:53:18,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=260608.33333333334, ans=0.125 2024-06-20 19:53:21,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260626.66666666666, ans=0.125 2024-06-20 19:53:45,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=260645.0, ans=0.0 2024-06-20 19:53:52,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2024-06-20 19:53:53,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=260663.33333333334, ans=0.125 2024-06-20 19:54:03,743 INFO [train.py:1028] (0/2) Epoch 15, batch 550, loss[loss=0.2243, simple_loss=0.2699, pruned_loss=0.08936, over 12963.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2689, pruned_loss=0.0841, over 2420278.43 frames. ], batch size: 158, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:54:04,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=260681.66666666666, ans=0.125 2024-06-20 19:54:05,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=260681.66666666666, ans=0.015 2024-06-20 19:54:05,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2024-06-20 19:54:15,653 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.92 vs. limit=6.0 2024-06-20 19:54:18,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=260700.0, ans=0.125 2024-06-20 19:54:21,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=260718.33333333334, ans=0.2 2024-06-20 19:54:26,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=260718.33333333334, ans=0.1 2024-06-20 19:54:35,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.848e+02 1.989e+02 2.135e+02 3.037e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 19:54:40,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.40 vs. limit=6.0 2024-06-20 19:54:44,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=260755.0, ans=0.0 2024-06-20 19:54:47,073 INFO [train.py:1028] (0/2) Epoch 15, batch 600, loss[loss=0.2207, simple_loss=0.268, pruned_loss=0.08668, over 13047.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2691, pruned_loss=0.0841, over 2457988.44 frames. ], batch size: 144, lr: 3.96e-03, grad_scale: 32.0 2024-06-20 19:54:58,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=260791.66666666666, ans=0.125 2024-06-20 19:55:00,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=260791.66666666666, ans=0.125 2024-06-20 19:55:05,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=260810.0, ans=0.1 2024-06-20 19:55:24,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=260846.66666666666, ans=0.2 2024-06-20 19:55:33,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=260846.66666666666, ans=0.125 2024-06-20 19:55:34,869 INFO [train.py:1028] (0/2) Epoch 15, batch 650, loss[loss=0.2372, simple_loss=0.2872, pruned_loss=0.09356, over 13161.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2695, pruned_loss=0.08416, over 2489836.46 frames. ], batch size: 59, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:55:41,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.80 vs. limit=10.0 2024-06-20 19:55:42,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=260865.0, ans=0.0 2024-06-20 19:55:44,067 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:55:45,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260883.33333333334, ans=0.1 2024-06-20 19:55:56,177 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:56:05,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=260920.0, ans=0.125 2024-06-20 19:56:07,226 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.924e+02 2.087e+02 2.236e+02 3.152e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 19:56:07,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.17 vs. limit=15.0 2024-06-20 19:56:17,558 INFO [train.py:1028] (0/2) Epoch 15, batch 700, loss[loss=0.2206, simple_loss=0.284, pruned_loss=0.07859, over 13272.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2691, pruned_loss=0.08391, over 2512393.25 frames. ], batch size: 46, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:56:24,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260956.66666666666, ans=0.1 2024-06-20 19:56:32,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=260975.0, ans=0.0 2024-06-20 19:56:38,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=260993.33333333334, ans=10.0 2024-06-20 19:56:38,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=260993.33333333334, ans=0.125 2024-06-20 19:56:46,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=260993.33333333334, ans=0.0 2024-06-20 19:56:49,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=260993.33333333334, ans=0.125 2024-06-20 19:57:11,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261030.0, ans=0.1 2024-06-20 19:57:17,730 INFO [train.py:1028] (0/2) Epoch 15, batch 750, loss[loss=0.2157, simple_loss=0.2684, pruned_loss=0.0815, over 13269.00 frames. ], tot_loss[loss=0.218, simple_loss=0.269, pruned_loss=0.08348, over 2527716.58 frames. ], batch size: 63, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:57:30,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=261066.66666666666, ans=0.125 2024-06-20 19:57:39,391 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=15.0 2024-06-20 19:57:41,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=261085.0, ans=0.125 2024-06-20 19:57:41,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=261085.0, ans=0.2 2024-06-20 19:57:49,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2024-06-20 19:57:51,520 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.880e+02 1.964e+02 2.128e+02 3.494e+02, threshold=3.927e+02, percent-clipped=0.0 2024-06-20 19:58:02,800 INFO [train.py:1028] (0/2) Epoch 15, batch 800, loss[loss=0.2145, simple_loss=0.2724, pruned_loss=0.07836, over 12928.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2687, pruned_loss=0.08357, over 2540625.30 frames. ], batch size: 36, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:58:05,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=261140.0, ans=0.125 2024-06-20 19:58:06,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.26 vs. limit=22.5 2024-06-20 19:58:18,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2024-06-20 19:58:18,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=261158.33333333334, ans=0.125 2024-06-20 19:58:36,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=15.0 2024-06-20 19:58:40,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=261213.33333333334, ans=0.125 2024-06-20 19:58:50,650 INFO [train.py:1028] (0/2) Epoch 15, batch 850, loss[loss=0.2269, simple_loss=0.2734, pruned_loss=0.09021, over 13160.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2685, pruned_loss=0.0835, over 2550394.86 frames. ], batch size: 95, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:58:53,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.38 vs. limit=22.5 2024-06-20 19:58:56,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261231.66666666666, ans=0.125 2024-06-20 19:59:00,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=261250.0, ans=0.2 2024-06-20 19:59:08,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=12.0 2024-06-20 19:59:11,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261268.33333333334, ans=0.125 2024-06-20 19:59:14,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=261268.33333333334, ans=0.0 2024-06-20 19:59:25,707 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.625e+02 1.894e+02 2.082e+02 2.268e+02 3.992e+02, threshold=4.164e+02, percent-clipped=1.0 2024-06-20 19:59:33,654 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 19:59:34,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=261323.33333333334, ans=0.0 2024-06-20 19:59:35,084 INFO [train.py:1028] (0/2) Epoch 15, batch 900, loss[loss=0.2225, simple_loss=0.2802, pruned_loss=0.08237, over 12899.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2681, pruned_loss=0.08342, over 2555606.85 frames. ], batch size: 36, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 19:59:55,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=261341.66666666666, ans=0.0 2024-06-20 20:00:01,488 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2024-06-20 20:00:16,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=261396.66666666666, ans=0.125 2024-06-20 20:00:17,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=261396.66666666666, ans=0.0 2024-06-20 20:00:21,436 INFO [train.py:1028] (0/2) Epoch 15, batch 950, loss[loss=0.197, simple_loss=0.2526, pruned_loss=0.07073, over 12956.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.268, pruned_loss=0.0832, over 2558971.32 frames. ], batch size: 39, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 20:00:22,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261415.0, ans=0.125 2024-06-20 20:00:29,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261433.33333333334, ans=0.125 2024-06-20 20:00:40,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=261451.66666666666, ans=0.2 2024-06-20 20:00:45,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.22 vs. limit=15.0 2024-06-20 20:00:46,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=261451.66666666666, ans=0.125 2024-06-20 20:00:56,555 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.926e+02 2.023e+02 2.207e+02 2.761e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 20:01:03,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=261488.33333333334, ans=0.2 2024-06-20 20:01:05,789 INFO [train.py:1028] (0/2) Epoch 15, batch 1000, loss[loss=0.2213, simple_loss=0.2777, pruned_loss=0.08246, over 13027.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2682, pruned_loss=0.08367, over 2561257.00 frames. ], batch size: 48, lr: 3.95e-03, grad_scale: 32.0 2024-06-20 20:01:17,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=261525.0, ans=10.0 2024-06-20 20:01:42,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=261580.0, ans=0.125 2024-06-20 20:01:48,465 INFO [train.py:1028] (0/2) Epoch 15, batch 1050, loss[loss=0.2012, simple_loss=0.2625, pruned_loss=0.06993, over 13215.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2695, pruned_loss=0.08419, over 2564520.10 frames. ], batch size: 77, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:01:54,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=12.0 2024-06-20 20:02:05,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=261635.0, ans=0.0 2024-06-20 20:02:07,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=261635.0, ans=0.2 2024-06-20 20:02:33,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2024-06-20 20:02:36,114 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.913e+02 2.045e+02 2.185e+02 3.085e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 20:02:48,044 INFO [train.py:1028] (0/2) Epoch 15, batch 1100, loss[loss=0.2205, simple_loss=0.2761, pruned_loss=0.0824, over 13308.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.27, pruned_loss=0.08438, over 2569264.72 frames. ], batch size: 52, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:02:55,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=261690.0, ans=0.125 2024-06-20 20:02:56,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=261708.33333333334, ans=0.125 2024-06-20 20:02:58,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=261708.33333333334, ans=0.125 2024-06-20 20:03:01,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=261708.33333333334, ans=0.0 2024-06-20 20:03:20,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=261745.0, ans=0.2 2024-06-20 20:03:26,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=261763.33333333334, ans=0.125 2024-06-20 20:03:31,471 INFO [train.py:1028] (0/2) Epoch 15, batch 1150, loss[loss=0.2064, simple_loss=0.2629, pruned_loss=0.07496, over 13243.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2702, pruned_loss=0.08452, over 2570822.18 frames. ], batch size: 52, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:03:36,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261781.66666666666, ans=0.1 2024-06-20 20:03:58,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=261836.66666666666, ans=0.125 2024-06-20 20:04:06,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.935e+02 2.092e+02 2.352e+02 3.092e+02, threshold=4.184e+02, percent-clipped=0.0 2024-06-20 20:04:18,338 INFO [train.py:1028] (0/2) Epoch 15, batch 1200, loss[loss=0.2006, simple_loss=0.257, pruned_loss=0.07216, over 13160.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2705, pruned_loss=0.08475, over 2572952.61 frames. ], batch size: 77, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:04:21,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=261873.33333333334, ans=0.125 2024-06-20 20:04:24,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=261873.33333333334, ans=0.04949747468305833 2024-06-20 20:04:32,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=261891.66666666666, ans=0.0 2024-06-20 20:04:47,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=261928.33333333334, ans=0.125 2024-06-20 20:04:51,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=261946.66666666666, ans=0.0 2024-06-20 20:05:05,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=261946.66666666666, ans=0.0 2024-06-20 20:05:07,683 INFO [train.py:1028] (0/2) Epoch 15, batch 1250, loss[loss=0.2118, simple_loss=0.2604, pruned_loss=0.08167, over 13144.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2704, pruned_loss=0.08482, over 2582413.38 frames. ], batch size: 112, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:05:19,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=261983.33333333334, ans=0.0 2024-06-20 20:05:49,142 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.894e+02 1.992e+02 2.183e+02 2.926e+02, threshold=3.983e+02, percent-clipped=0.0 2024-06-20 20:05:50,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2024-06-20 20:05:57,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-20 20:05:58,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262056.66666666666, ans=0.1 2024-06-20 20:05:59,643 INFO [train.py:1028] (0/2) Epoch 15, batch 1300, loss[loss=0.2273, simple_loss=0.273, pruned_loss=0.0908, over 12746.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2708, pruned_loss=0.0849, over 2582655.29 frames. ], batch size: 177, lr: 3.95e-03, grad_scale: 64.0 2024-06-20 20:05:59,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=262056.66666666666, ans=0.0 2024-06-20 20:06:04,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=262056.66666666666, ans=0.05 2024-06-20 20:06:43,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=262148.3333333333, ans=0.0 2024-06-20 20:06:43,948 INFO [train.py:1028] (0/2) Epoch 15, batch 1350, loss[loss=0.2333, simple_loss=0.2899, pruned_loss=0.08836, over 13245.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.271, pruned_loss=0.08494, over 2585298.73 frames. ], batch size: 59, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:06:47,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=262148.3333333333, ans=0.125 2024-06-20 20:07:08,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2024-06-20 20:07:16,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=15.0 2024-06-20 20:07:20,326 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.887e+02 2.003e+02 2.204e+02 2.767e+02, threshold=4.006e+02, percent-clipped=0.0 2024-06-20 20:07:25,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=262221.6666666667, ans=0.0 2024-06-20 20:07:29,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=262221.6666666667, ans=0.125 2024-06-20 20:07:31,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=262240.0, ans=0.2 2024-06-20 20:07:32,013 INFO [train.py:1028] (0/2) Epoch 15, batch 1400, loss[loss=0.2019, simple_loss=0.2542, pruned_loss=0.07476, over 12909.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.271, pruned_loss=0.08518, over 2587274.77 frames. ], batch size: 26, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:07:43,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=262258.3333333333, ans=0.0 2024-06-20 20:07:51,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=262276.6666666667, ans=0.125 2024-06-20 20:08:03,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=262295.0, ans=0.125 2024-06-20 20:08:19,027 INFO [train.py:1028] (0/2) Epoch 15, batch 1450, loss[loss=0.2099, simple_loss=0.2516, pruned_loss=0.0841, over 13124.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2709, pruned_loss=0.08519, over 2588143.05 frames. ], batch size: 121, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:08:27,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=262350.0, ans=15.0 2024-06-20 20:08:31,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.86 vs. limit=6.0 2024-06-20 20:08:49,860 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.940e+02 2.041e+02 2.225e+02 3.024e+02, threshold=4.082e+02, percent-clipped=0.0 2024-06-20 20:08:59,969 INFO [train.py:1028] (0/2) Epoch 15, batch 1500, loss[loss=0.2256, simple_loss=0.2725, pruned_loss=0.08934, over 13222.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2707, pruned_loss=0.08488, over 2590381.80 frames. ], batch size: 83, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:09:00,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=262423.3333333333, ans=0.0 2024-06-20 20:09:07,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=262441.6666666667, ans=0.125 2024-06-20 20:09:37,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.84 vs. limit=15.0 2024-06-20 20:09:38,081 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:09:40,008 INFO [train.py:1028] (0/2) Epoch 15, batch 1550, loss[loss=0.2146, simple_loss=0.2618, pruned_loss=0.08374, over 13018.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2713, pruned_loss=0.08531, over 2584930.72 frames. ], batch size: 102, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:09:52,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=262533.3333333333, ans=0.0 2024-06-20 20:10:06,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=262551.6666666667, ans=0.125 2024-06-20 20:10:08,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=262570.0, ans=0.125 2024-06-20 20:10:09,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=262570.0, ans=0.2 2024-06-20 20:10:22,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 1.886e+02 2.023e+02 2.217e+02 2.941e+02, threshold=4.046e+02, percent-clipped=0.0 2024-06-20 20:10:30,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262588.3333333333, ans=0.1 2024-06-20 20:10:33,999 INFO [train.py:1028] (0/2) Epoch 15, batch 1600, loss[loss=0.2061, simple_loss=0.259, pruned_loss=0.07663, over 13254.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2719, pruned_loss=0.0854, over 2581072.49 frames. ], batch size: 77, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:10:55,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2024-06-20 20:10:55,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=262625.0, ans=0.0 2024-06-20 20:11:00,702 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:11:02,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=262643.3333333333, ans=0.125 2024-06-20 20:11:14,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=262661.6666666667, ans=0.2 2024-06-20 20:11:20,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=262680.0, ans=0.025 2024-06-20 20:11:26,154 INFO [train.py:1028] (0/2) Epoch 15, batch 1650, loss[loss=0.2198, simple_loss=0.2643, pruned_loss=0.08762, over 13152.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.272, pruned_loss=0.08574, over 2577260.96 frames. ], batch size: 95, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:11:34,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=262716.6666666667, ans=0.025 2024-06-20 20:11:38,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=262716.6666666667, ans=0.125 2024-06-20 20:11:42,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=262735.0, ans=0.0 2024-06-20 20:11:56,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=262753.3333333333, ans=0.125 2024-06-20 20:11:59,445 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.935e+02 2.155e+02 2.394e+02 3.573e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-20 20:12:11,174 INFO [train.py:1028] (0/2) Epoch 15, batch 1700, loss[loss=0.2353, simple_loss=0.2919, pruned_loss=0.08929, over 12539.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2718, pruned_loss=0.08505, over 2582837.24 frames. ], batch size: 25, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:12:24,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.48 vs. limit=22.5 2024-06-20 20:12:24,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=262808.3333333333, ans=0.1 2024-06-20 20:12:25,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=262808.3333333333, ans=0.125 2024-06-20 20:12:26,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=262808.3333333333, ans=0.125 2024-06-20 20:12:27,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=262808.3333333333, ans=0.025 2024-06-20 20:12:27,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262808.3333333333, ans=0.1 2024-06-20 20:12:40,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=262845.0, ans=0.05 2024-06-20 20:12:44,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=262845.0, ans=0.125 2024-06-20 20:12:59,158 INFO [train.py:1028] (0/2) Epoch 15, batch 1750, loss[loss=0.2262, simple_loss=0.2959, pruned_loss=0.07829, over 12491.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2714, pruned_loss=0.08474, over 2583180.51 frames. ], batch size: 22, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:13:01,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=262881.6666666667, ans=6.0 2024-06-20 20:13:02,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=262881.6666666667, ans=0.09899494936611666 2024-06-20 20:13:03,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=262881.6666666667, ans=0.035 2024-06-20 20:13:13,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=262900.0, ans=0.0 2024-06-20 20:13:30,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=262936.6666666667, ans=0.125 2024-06-20 20:13:33,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.52 vs. limit=22.5 2024-06-20 20:13:34,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=262936.6666666667, ans=0.125 2024-06-20 20:13:34,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262936.6666666667, ans=0.125 2024-06-20 20:13:35,020 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.928e+02 2.050e+02 2.259e+02 3.168e+02, threshold=4.101e+02, percent-clipped=0.0 2024-06-20 20:13:43,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=262955.0, ans=0.0 2024-06-20 20:13:54,760 INFO [train.py:1028] (0/2) Epoch 15, batch 1800, loss[loss=0.1996, simple_loss=0.2539, pruned_loss=0.07264, over 13232.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.271, pruned_loss=0.08458, over 2582893.79 frames. ], batch size: 67, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:13:59,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=262973.3333333333, ans=0.125 2024-06-20 20:14:00,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.66 vs. limit=22.5 2024-06-20 20:14:22,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263010.0, ans=0.125 2024-06-20 20:14:33,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=263046.6666666667, ans=0.1 2024-06-20 20:14:40,767 INFO [train.py:1028] (0/2) Epoch 15, batch 1850, loss[loss=0.2369, simple_loss=0.274, pruned_loss=0.09987, over 13264.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2713, pruned_loss=0.0847, over 2583435.84 frames. ], batch size: 83, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:15:00,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=263101.6666666667, ans=0.2 2024-06-20 20:15:04,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=263101.6666666667, ans=0.125 2024-06-20 20:15:15,926 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.904e+02 2.041e+02 2.213e+02 2.706e+02, threshold=4.081e+02, percent-clipped=0.0 2024-06-20 20:15:16,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263120.0, ans=0.0 2024-06-20 20:15:18,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=263138.3333333333, ans=0.125 2024-06-20 20:15:26,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.35 vs. limit=15.0 2024-06-20 20:15:27,976 INFO [train.py:1028] (0/2) Epoch 15, batch 1900, loss[loss=0.2166, simple_loss=0.2651, pruned_loss=0.08406, over 13130.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2712, pruned_loss=0.08502, over 2585733.24 frames. ], batch size: 95, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:15:34,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2024-06-20 20:15:37,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=263175.0, ans=0.2 2024-06-20 20:15:47,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=263193.3333333333, ans=0.125 2024-06-20 20:15:51,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=12.0 2024-06-20 20:15:57,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2024-06-20 20:16:20,278 INFO [train.py:1028] (0/2) Epoch 15, batch 1950, loss[loss=0.2263, simple_loss=0.282, pruned_loss=0.08531, over 13301.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2704, pruned_loss=0.08505, over 2591479.47 frames. ], batch size: 52, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:16:25,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263248.3333333333, ans=0.1 2024-06-20 20:16:40,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=263285.0, ans=0.125 2024-06-20 20:16:51,987 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.924e+02 2.038e+02 2.224e+02 2.942e+02, threshold=4.076e+02, percent-clipped=0.0 2024-06-20 20:16:52,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=263303.3333333333, ans=0.125 2024-06-20 20:17:01,447 INFO [train.py:1028] (0/2) Epoch 15, batch 2000, loss[loss=0.2259, simple_loss=0.2796, pruned_loss=0.08607, over 12551.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2704, pruned_loss=0.08516, over 2587186.74 frames. ], batch size: 22, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:17:09,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263358.3333333333, ans=0.0 2024-06-20 20:17:14,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=263358.3333333333, ans=0.0 2024-06-20 20:17:17,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.50 vs. limit=22.5 2024-06-20 20:17:38,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=263413.3333333333, ans=0.125 2024-06-20 20:17:41,675 INFO [train.py:1028] (0/2) Epoch 15, batch 2050, loss[loss=0.2073, simple_loss=0.2603, pruned_loss=0.07717, over 12452.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2703, pruned_loss=0.08509, over 2582643.79 frames. ], batch size: 29, lr: 3.94e-03, grad_scale: 64.0 2024-06-20 20:17:57,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=263450.0, ans=0.125 2024-06-20 20:18:07,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=263468.3333333333, ans=0.025 2024-06-20 20:18:07,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2024-06-20 20:18:17,376 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.889e+02 2.063e+02 2.171e+02 2.727e+02, threshold=4.126e+02, percent-clipped=0.0 2024-06-20 20:18:17,818 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.38 vs. limit=12.0 2024-06-20 20:18:19,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=263505.0, ans=0.125 2024-06-20 20:18:24,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-20 20:18:28,513 INFO [train.py:1028] (0/2) Epoch 15, batch 2100, loss[loss=0.219, simple_loss=0.2763, pruned_loss=0.08079, over 13207.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2706, pruned_loss=0.08502, over 2585036.43 frames. ], batch size: 59, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:18:42,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=263541.6666666667, ans=0.125 2024-06-20 20:18:44,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=263541.6666666667, ans=0.0 2024-06-20 20:18:48,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=263560.0, ans=0.125 2024-06-20 20:19:11,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263578.3333333333, ans=0.1 2024-06-20 20:19:18,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=263596.6666666667, ans=0.125 2024-06-20 20:19:22,952 INFO [train.py:1028] (0/2) Epoch 15, batch 2150, loss[loss=0.204, simple_loss=0.2608, pruned_loss=0.07361, over 13199.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2704, pruned_loss=0.08463, over 2587889.99 frames. ], batch size: 52, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:19:35,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=263615.0, ans=0.0 2024-06-20 20:19:43,075 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.91 vs. limit=22.5 2024-06-20 20:19:47,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=263651.6666666667, ans=0.0 2024-06-20 20:19:52,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.44 vs. limit=22.5 2024-06-20 20:19:54,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=263670.0, ans=0.125 2024-06-20 20:19:56,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.92 vs. limit=6.0 2024-06-20 20:20:00,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=263670.0, ans=0.0 2024-06-20 20:20:01,813 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.960e+02 2.157e+02 2.365e+02 2.902e+02, threshold=4.314e+02, percent-clipped=0.0 2024-06-20 20:20:03,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=263688.3333333333, ans=0.09899494936611666 2024-06-20 20:20:05,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=263688.3333333333, ans=0.07 2024-06-20 20:20:05,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=263688.3333333333, ans=0.0 2024-06-20 20:20:08,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=263688.3333333333, ans=0.125 2024-06-20 20:20:13,701 INFO [train.py:1028] (0/2) Epoch 15, batch 2200, loss[loss=0.22, simple_loss=0.2591, pruned_loss=0.09039, over 13220.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2705, pruned_loss=0.08497, over 2588153.79 frames. ], batch size: 83, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:20:16,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=263706.6666666667, ans=0.0 2024-06-20 20:20:19,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.35 vs. limit=10.0 2024-06-20 20:20:23,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=263725.0, ans=0.0 2024-06-20 20:20:32,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=263743.3333333333, ans=0.125 2024-06-20 20:20:46,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=263761.6666666667, ans=0.1 2024-06-20 20:20:49,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=263761.6666666667, ans=0.2 2024-06-20 20:21:01,256 INFO [train.py:1028] (0/2) Epoch 15, batch 2250, loss[loss=0.2178, simple_loss=0.2704, pruned_loss=0.08257, over 13279.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2709, pruned_loss=0.08497, over 2586865.27 frames. ], batch size: 63, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:21:09,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=263816.6666666667, ans=0.025 2024-06-20 20:21:14,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=263816.6666666667, ans=0.125 2024-06-20 20:21:30,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=263853.3333333333, ans=0.0 2024-06-20 20:21:31,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=263853.3333333333, ans=0.125 2024-06-20 20:21:34,981 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.883e+02 2.003e+02 2.199e+02 2.864e+02, threshold=4.005e+02, percent-clipped=0.0 2024-06-20 20:21:38,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2024-06-20 20:21:46,756 INFO [train.py:1028] (0/2) Epoch 15, batch 2300, loss[loss=0.2067, simple_loss=0.2536, pruned_loss=0.07995, over 12898.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2704, pruned_loss=0.08464, over 2581460.02 frames. ], batch size: 33, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:21:50,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=263890.0, ans=0.125 2024-06-20 20:22:14,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.90 vs. limit=10.0 2024-06-20 20:22:15,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=263926.6666666667, ans=0.1 2024-06-20 20:22:24,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-20 20:22:33,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=263963.3333333333, ans=0.025 2024-06-20 20:22:44,030 INFO [train.py:1028] (0/2) Epoch 15, batch 2350, loss[loss=0.2139, simple_loss=0.2693, pruned_loss=0.07924, over 13240.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2705, pruned_loss=0.08518, over 2585993.33 frames. ], batch size: 67, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:22:48,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.01 vs. limit=15.0 2024-06-20 20:22:53,450 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-144000.pt 2024-06-20 20:23:05,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264000.0, ans=0.1 2024-06-20 20:23:06,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=264000.0, ans=0.125 2024-06-20 20:23:13,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264018.3333333333, ans=0.1 2024-06-20 20:23:19,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=264036.6666666667, ans=0.0 2024-06-20 20:23:27,974 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.676e+02 1.876e+02 2.011e+02 2.183e+02 3.099e+02, threshold=4.022e+02, percent-clipped=0.0 2024-06-20 20:23:35,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.40 vs. limit=22.5 2024-06-20 20:23:39,499 INFO [train.py:1028] (0/2) Epoch 15, batch 2400, loss[loss=0.2177, simple_loss=0.2673, pruned_loss=0.08401, over 13316.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2698, pruned_loss=0.08494, over 2589221.59 frames. ], batch size: 46, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:23:40,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264073.3333333333, ans=0.1 2024-06-20 20:24:00,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=264110.0, ans=0.125 2024-06-20 20:24:01,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=264110.0, ans=0.0 2024-06-20 20:24:07,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264128.3333333333, ans=0.125 2024-06-20 20:24:15,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264146.6666666667, ans=0.1 2024-06-20 20:24:19,280 INFO [train.py:1028] (0/2) Epoch 15, batch 2450, loss[loss=0.183, simple_loss=0.2367, pruned_loss=0.06464, over 13295.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2689, pruned_loss=0.08504, over 2584902.90 frames. ], batch size: 63, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:24:43,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=264201.6666666667, ans=0.125 2024-06-20 20:24:52,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=15.0 2024-06-20 20:24:53,808 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.953e+02 2.081e+02 2.249e+02 3.079e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-20 20:24:54,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=264238.3333333333, ans=0.0 2024-06-20 20:24:54,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=264238.3333333333, ans=0.125 2024-06-20 20:24:57,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=264238.3333333333, ans=0.125 2024-06-20 20:25:01,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=264238.3333333333, ans=0.0 2024-06-20 20:25:01,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2024-06-20 20:25:03,863 INFO [train.py:1028] (0/2) Epoch 15, batch 2500, loss[loss=0.2099, simple_loss=0.2556, pruned_loss=0.08206, over 13244.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2679, pruned_loss=0.08464, over 2588651.11 frames. ], batch size: 83, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:25:09,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264256.6666666667, ans=0.125 2024-06-20 20:25:23,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=264293.3333333333, ans=0.0 2024-06-20 20:25:28,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=12.0 2024-06-20 20:25:32,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=264311.6666666667, ans=0.125 2024-06-20 20:25:32,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=264311.6666666667, ans=0.2 2024-06-20 20:25:34,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.37 vs. limit=10.0 2024-06-20 20:25:39,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=264330.0, ans=0.0 2024-06-20 20:25:39,762 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-20 20:25:42,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=264330.0, ans=0.125 2024-06-20 20:25:45,692 INFO [train.py:1028] (0/2) Epoch 15, batch 2550, loss[loss=0.2038, simple_loss=0.2597, pruned_loss=0.07394, over 12555.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2669, pruned_loss=0.08428, over 2588118.84 frames. ], batch size: 22, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:25:51,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=264348.3333333333, ans=0.125 2024-06-20 20:25:52,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=264348.3333333333, ans=0.125 2024-06-20 20:25:58,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=264366.6666666667, ans=0.025 2024-06-20 20:26:04,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=264385.0, ans=0.125 2024-06-20 20:26:06,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=264385.0, ans=15.0 2024-06-20 20:26:22,651 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 1.954e+02 2.071e+02 2.323e+02 3.838e+02, threshold=4.141e+02, percent-clipped=0.0 2024-06-20 20:26:25,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.08 vs. limit=15.0 2024-06-20 20:26:25,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=264421.6666666667, ans=0.0 2024-06-20 20:26:33,481 INFO [train.py:1028] (0/2) Epoch 15, batch 2600, loss[loss=0.2482, simple_loss=0.2973, pruned_loss=0.09955, over 13283.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2658, pruned_loss=0.08402, over 2588643.85 frames. ], batch size: 52, lr: 3.93e-03, grad_scale: 64.0 2024-06-20 20:26:36,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2024-06-20 20:26:50,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264458.3333333333, ans=0.1 2024-06-20 20:26:51,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264476.6666666667, ans=0.125 2024-06-20 20:27:17,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=264513.3333333333, ans=0.07 2024-06-20 20:27:26,788 INFO [train.py:1028] (0/2) Epoch 15, batch 2650, loss[loss=0.2001, simple_loss=0.2425, pruned_loss=0.07891, over 13001.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2641, pruned_loss=0.08317, over 2588913.77 frames. ], batch size: 144, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:27:28,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2024-06-20 20:27:29,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264531.6666666667, ans=0.1 2024-06-20 20:27:29,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=264531.6666666667, ans=0.07 2024-06-20 20:27:31,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=264531.6666666667, ans=0.5 2024-06-20 20:27:37,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=264550.0, ans=0.09899494936611666 2024-06-20 20:27:39,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2024-06-20 20:27:39,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=264550.0, ans=0.0 2024-06-20 20:27:48,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=264568.3333333333, ans=0.2 2024-06-20 20:27:55,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=264586.6666666667, ans=0.0 2024-06-20 20:27:59,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 1.920e+02 2.110e+02 2.307e+02 2.909e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-20 20:28:10,856 INFO [train.py:1028] (0/2) Epoch 15, batch 2700, loss[loss=0.2179, simple_loss=0.2591, pruned_loss=0.08834, over 13241.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2627, pruned_loss=0.0828, over 2586606.44 frames. ], batch size: 89, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:28:18,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.98 vs. limit=22.5 2024-06-20 20:28:32,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=264660.0, ans=0.2 2024-06-20 20:28:40,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=264678.3333333333, ans=0.025 2024-06-20 20:28:51,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=264696.6666666667, ans=0.0 2024-06-20 20:28:58,434 INFO [train.py:1028] (0/2) Epoch 15, batch 2750, loss[loss=0.2073, simple_loss=0.2628, pruned_loss=0.07587, over 13318.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2622, pruned_loss=0.08244, over 2583348.03 frames. ], batch size: 43, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:28:58,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=264715.0, ans=0.025 2024-06-20 20:29:09,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-20 20:29:11,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=264733.3333333333, ans=0.125 2024-06-20 20:29:29,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264770.0, ans=0.1 2024-06-20 20:29:32,436 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.865e+02 1.995e+02 2.156e+02 3.232e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-20 20:29:41,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=264788.3333333333, ans=0.125 2024-06-20 20:29:48,984 INFO [train.py:1028] (0/2) Epoch 15, batch 2800, loss[loss=0.2207, simple_loss=0.2528, pruned_loss=0.09429, over 10926.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2617, pruned_loss=0.08249, over 2580917.45 frames. ], batch size: 304, lr: 3.93e-03, grad_scale: 32.0 2024-06-20 20:29:59,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.46 vs. limit=15.0 2024-06-20 20:30:04,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=264825.0, ans=0.025 2024-06-20 20:30:15,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=264843.3333333333, ans=0.125 2024-06-20 20:30:18,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2024-06-20 20:30:20,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.24 vs. limit=15.0 2024-06-20 20:30:28,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=264861.6666666667, ans=0.125 2024-06-20 20:30:33,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=264880.0, ans=10.0 2024-06-20 20:30:43,492 INFO [train.py:1028] (0/2) Epoch 15, batch 2850, loss[loss=0.2127, simple_loss=0.266, pruned_loss=0.07967, over 13075.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2605, pruned_loss=0.08192, over 2578455.80 frames. ], batch size: 48, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:30:49,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264898.3333333333, ans=0.125 2024-06-20 20:30:59,717 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2024-06-20 20:31:06,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.91 vs. limit=15.0 2024-06-20 20:31:18,136 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.907e+02 2.062e+02 2.339e+02 3.362e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-20 20:31:23,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=264971.6666666667, ans=0.0 2024-06-20 20:31:28,114 INFO [train.py:1028] (0/2) Epoch 15, batch 2900, loss[loss=0.1993, simple_loss=0.2477, pruned_loss=0.07542, over 13095.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2589, pruned_loss=0.08137, over 2586008.30 frames. ], batch size: 55, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:31:48,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=265026.6666666667, ans=0.2 2024-06-20 20:31:50,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.94 vs. limit=22.5 2024-06-20 20:31:52,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=8.0 2024-06-20 20:31:52,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265026.6666666667, ans=0.125 2024-06-20 20:31:52,819 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.603e+01 2024-06-20 20:32:06,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=265045.0, ans=0.5 2024-06-20 20:32:11,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.70 vs. limit=6.0 2024-06-20 20:32:15,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265063.3333333333, ans=0.125 2024-06-20 20:32:17,843 INFO [train.py:1028] (0/2) Epoch 15, batch 2950, loss[loss=0.2024, simple_loss=0.2617, pruned_loss=0.07155, over 13178.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2585, pruned_loss=0.08137, over 2580805.58 frames. ], batch size: 43, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:32:49,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=265118.3333333333, ans=0.5 2024-06-20 20:32:51,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265118.3333333333, ans=0.1 2024-06-20 20:33:01,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.47 vs. limit=15.0 2024-06-20 20:33:01,648 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.831e+02 1.923e+02 2.084e+02 2.737e+02, threshold=3.847e+02, percent-clipped=0.0 2024-06-20 20:33:10,889 INFO [train.py:1028] (0/2) Epoch 15, batch 3000, loss[loss=0.1965, simple_loss=0.2525, pruned_loss=0.07022, over 13175.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2575, pruned_loss=0.08084, over 2578823.48 frames. ], batch size: 59, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:33:10,891 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 20:33:17,002 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.0415, 3.5477, 3.9857, 3.6959], device='cuda:0') 2024-06-20 20:33:20,974 INFO [train.py:1060] (0/2) Epoch 15, validation: loss=0.1888, simple_loss=0.2537, pruned_loss=0.06193, over 351949.00 frames. 2024-06-20 20:33:20,977 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 20:33:28,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=265173.3333333333, ans=0.125 2024-06-20 20:33:43,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=15.0 2024-06-20 20:33:48,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=265210.0, ans=0.2 2024-06-20 20:33:48,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.76 vs. limit=10.0 2024-06-20 20:34:12,551 INFO [train.py:1028] (0/2) Epoch 15, batch 3050, loss[loss=0.2171, simple_loss=0.2604, pruned_loss=0.08693, over 13286.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2571, pruned_loss=0.08107, over 2578669.72 frames. ], batch size: 46, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:34:12,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=265265.0, ans=0.0 2024-06-20 20:34:30,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=265283.3333333333, ans=0.0 2024-06-20 20:34:42,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=265320.0, ans=0.125 2024-06-20 20:34:42,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.69 vs. limit=15.0 2024-06-20 20:34:50,784 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.642e+02 1.906e+02 2.004e+02 2.143e+02 3.096e+02, threshold=4.008e+02, percent-clipped=0.0 2024-06-20 20:34:55,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=265338.3333333333, ans=0.0 2024-06-20 20:34:59,743 INFO [train.py:1028] (0/2) Epoch 15, batch 3100, loss[loss=0.1857, simple_loss=0.2315, pruned_loss=0.06992, over 13073.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2559, pruned_loss=0.08026, over 2579674.90 frames. ], batch size: 144, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:35:19,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265393.3333333333, ans=0.1 2024-06-20 20:35:52,009 INFO [train.py:1028] (0/2) Epoch 15, batch 3150, loss[loss=0.1955, simple_loss=0.244, pruned_loss=0.07352, over 13064.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2551, pruned_loss=0.08001, over 2581765.67 frames. ], batch size: 159, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:35:54,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=265448.3333333333, ans=0.0 2024-06-20 20:35:54,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=265448.3333333333, ans=0.125 2024-06-20 20:36:05,086 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=12.0 2024-06-20 20:36:10,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=265485.0, ans=10.0 2024-06-20 20:36:16,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=265485.0, ans=0.125 2024-06-20 20:36:16,527 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.46 vs. limit=22.5 2024-06-20 20:36:18,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.30 vs. limit=22.5 2024-06-20 20:36:29,502 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 1.853e+02 1.979e+02 2.200e+02 2.868e+02, threshold=3.958e+02, percent-clipped=0.0 2024-06-20 20:36:31,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=265521.6666666667, ans=0.125 2024-06-20 20:36:40,297 INFO [train.py:1028] (0/2) Epoch 15, batch 3200, loss[loss=0.2027, simple_loss=0.2514, pruned_loss=0.07694, over 13099.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2546, pruned_loss=0.07984, over 2581621.02 frames. ], batch size: 55, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:36:42,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=265540.0, ans=0.125 2024-06-20 20:36:48,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=265540.0, ans=0.125 2024-06-20 20:36:48,499 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:37:01,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2024-06-20 20:37:12,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=265595.0, ans=0.025 2024-06-20 20:37:15,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2024-06-20 20:37:25,063 INFO [train.py:1028] (0/2) Epoch 15, batch 3250, loss[loss=0.1926, simple_loss=0.2423, pruned_loss=0.07145, over 13313.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2546, pruned_loss=0.07994, over 2585954.09 frames. ], batch size: 72, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:37:56,523 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.18 vs. limit=22.5 2024-06-20 20:38:06,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265686.6666666667, ans=0.1 2024-06-20 20:38:08,215 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.923e+02 2.066e+02 2.257e+02 3.249e+02, threshold=4.133e+02, percent-clipped=0.0 2024-06-20 20:38:19,383 INFO [train.py:1028] (0/2) Epoch 15, batch 3300, loss[loss=0.2318, simple_loss=0.2751, pruned_loss=0.09422, over 12731.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2542, pruned_loss=0.07953, over 2581784.60 frames. ], batch size: 176, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:38:22,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=265723.3333333333, ans=0.1 2024-06-20 20:38:34,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=265741.6666666667, ans=0.125 2024-06-20 20:38:44,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=265760.0, ans=0.0 2024-06-20 20:38:47,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265760.0, ans=0.125 2024-06-20 20:38:49,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=15.0 2024-06-20 20:38:52,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=265778.3333333333, ans=0.125 2024-06-20 20:39:08,941 INFO [train.py:1028] (0/2) Epoch 15, batch 3350, loss[loss=0.1903, simple_loss=0.2361, pruned_loss=0.0723, over 12882.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2535, pruned_loss=0.0798, over 2577194.33 frames. ], batch size: 158, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:39:15,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=265815.0, ans=0.125 2024-06-20 20:39:22,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=265833.3333333333, ans=0.025 2024-06-20 20:39:44,477 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.852e+02 2.022e+02 2.197e+02 3.049e+02, threshold=4.045e+02, percent-clipped=0.0 2024-06-20 20:39:47,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=265888.3333333333, ans=0.125 2024-06-20 20:39:54,075 INFO [train.py:1028] (0/2) Epoch 15, batch 3400, loss[loss=0.1789, simple_loss=0.2358, pruned_loss=0.06104, over 12441.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2532, pruned_loss=0.08019, over 2574799.00 frames. ], batch size: 22, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:39:57,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=265906.6666666667, ans=0.035 2024-06-20 20:40:02,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265925.0, ans=0.1 2024-06-20 20:40:05,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=265925.0, ans=0.0 2024-06-20 20:40:06,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=265925.0, ans=0.2 2024-06-20 20:40:27,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.51 vs. limit=10.0 2024-06-20 20:40:36,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=265980.0, ans=0.125 2024-06-20 20:40:45,946 INFO [train.py:1028] (0/2) Epoch 15, batch 3450, loss[loss=0.2235, simple_loss=0.2636, pruned_loss=0.09168, over 12774.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2524, pruned_loss=0.07978, over 2575763.86 frames. ], batch size: 176, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:40:49,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=265998.3333333333, ans=0.0 2024-06-20 20:40:58,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2024-06-20 20:41:03,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=266035.0, ans=0.95 2024-06-20 20:41:18,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2024-06-20 20:41:26,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.68 vs. limit=10.0 2024-06-20 20:41:28,984 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.795e+02 1.918e+02 2.065e+02 2.612e+02, threshold=3.836e+02, percent-clipped=0.0 2024-06-20 20:41:39,776 INFO [train.py:1028] (0/2) Epoch 15, batch 3500, loss[loss=0.1891, simple_loss=0.2425, pruned_loss=0.06791, over 12901.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2518, pruned_loss=0.07915, over 2576431.41 frames. ], batch size: 33, lr: 3.92e-03, grad_scale: 32.0 2024-06-20 20:42:06,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266145.0, ans=0.1 2024-06-20 20:42:17,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=266163.3333333333, ans=0.0 2024-06-20 20:42:20,711 INFO [train.py:1028] (0/2) Epoch 15, batch 3550, loss[loss=0.1978, simple_loss=0.2425, pruned_loss=0.07657, over 13180.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2514, pruned_loss=0.07854, over 2578474.47 frames. ], batch size: 95, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:42:33,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=266200.0, ans=0.125 2024-06-20 20:42:34,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=266200.0, ans=0.125 2024-06-20 20:42:51,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=266236.6666666667, ans=0.125 2024-06-20 20:42:56,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.804e+02 1.962e+02 2.114e+02 2.570e+02, threshold=3.923e+02, percent-clipped=0.0 2024-06-20 20:43:04,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=266255.0, ans=0.1 2024-06-20 20:43:07,586 INFO [train.py:1028] (0/2) Epoch 15, batch 3600, loss[loss=0.2199, simple_loss=0.2682, pruned_loss=0.08582, over 13254.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2515, pruned_loss=0.07872, over 2581589.05 frames. ], batch size: 49, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:43:22,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=266291.6666666667, ans=0.125 2024-06-20 20:43:36,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2024-06-20 20:43:36,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=266310.0, ans=0.125 2024-06-20 20:43:47,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=266328.3333333333, ans=0.0 2024-06-20 20:43:52,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=266346.6666666667, ans=0.0 2024-06-20 20:43:56,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.16 vs. limit=15.0 2024-06-20 20:44:02,998 INFO [train.py:1028] (0/2) Epoch 15, batch 3650, loss[loss=0.1896, simple_loss=0.237, pruned_loss=0.07112, over 13053.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2514, pruned_loss=0.07831, over 2579803.58 frames. ], batch size: 102, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:44:06,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=266365.0, ans=0.0 2024-06-20 20:44:29,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=266420.0, ans=0.5 2024-06-20 20:44:29,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=266420.0, ans=0.07 2024-06-20 20:44:35,343 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.851e+02 1.940e+02 2.110e+02 2.656e+02, threshold=3.879e+02, percent-clipped=0.0 2024-06-20 20:44:43,477 INFO [train.py:1028] (0/2) Epoch 15, batch 3700, loss[loss=0.1872, simple_loss=0.2414, pruned_loss=0.06652, over 13266.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2502, pruned_loss=0.07761, over 2584731.37 frames. ], batch size: 72, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:44:49,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=266475.0, ans=0.2 2024-06-20 20:44:51,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.95 vs. limit=12.0 2024-06-20 20:44:54,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=266475.0, ans=0.0 2024-06-20 20:44:57,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-20 20:45:01,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2024-06-20 20:45:04,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=266511.6666666667, ans=0.125 2024-06-20 20:45:12,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=266530.0, ans=0.0 2024-06-20 20:45:15,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.65 vs. limit=15.0 2024-06-20 20:45:19,030 INFO [train.py:1028] (0/2) Epoch 15, batch 3750, loss[loss=0.1979, simple_loss=0.2479, pruned_loss=0.07392, over 12568.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2493, pruned_loss=0.07725, over 2586489.23 frames. ], batch size: 22, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:45:20,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2024-06-20 20:45:23,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=266548.3333333333, ans=0.0 2024-06-20 20:45:36,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=266585.0, ans=0.125 2024-06-20 20:45:45,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2024-06-20 20:45:54,355 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.847e+02 2.007e+02 2.248e+02 3.608e+02, threshold=4.015e+02, percent-clipped=0.0 2024-06-20 20:45:55,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.34 vs. limit=15.0 2024-06-20 20:46:09,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266621.6666666667, ans=0.0 2024-06-20 20:46:10,869 INFO [train.py:1028] (0/2) Epoch 15, batch 3800, loss[loss=0.1919, simple_loss=0.2434, pruned_loss=0.07026, over 13211.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2502, pruned_loss=0.0778, over 2584500.16 frames. ], batch size: 83, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:46:14,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=266640.0, ans=0.2 2024-06-20 20:46:30,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=266676.6666666667, ans=0.125 2024-06-20 20:46:42,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=266695.0, ans=0.125 2024-06-20 20:46:43,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266695.0, ans=0.0 2024-06-20 20:46:46,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2024-06-20 20:46:57,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.63 vs. limit=15.0 2024-06-20 20:47:02,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-20 20:47:04,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=266731.6666666667, ans=0.04949747468305833 2024-06-20 20:47:04,985 INFO [train.py:1028] (0/2) Epoch 15, batch 3850, loss[loss=0.1875, simple_loss=0.2333, pruned_loss=0.07091, over 13065.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2496, pruned_loss=0.07749, over 2584090.56 frames. ], batch size: 144, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:47:06,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=266731.6666666667, ans=0.5 2024-06-20 20:47:09,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=266731.6666666667, ans=22.5 2024-06-20 20:47:10,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2024-06-20 20:47:14,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=266750.0, ans=0.0 2024-06-20 20:47:33,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2024-06-20 20:47:36,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266786.6666666667, ans=0.1 2024-06-20 20:47:38,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266786.6666666667, ans=0.1 2024-06-20 20:47:40,123 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.804e+02 1.942e+02 2.106e+02 2.490e+02, threshold=3.884e+02, percent-clipped=0.0 2024-06-20 20:47:40,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=266805.0, ans=0.1 2024-06-20 20:47:45,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=12.0 2024-06-20 20:47:51,250 INFO [train.py:1028] (0/2) Epoch 15, batch 3900, loss[loss=0.2019, simple_loss=0.2528, pruned_loss=0.07551, over 13177.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2494, pruned_loss=0.07771, over 2587798.13 frames. ], batch size: 83, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:47:52,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=266823.3333333333, ans=0.125 2024-06-20 20:48:07,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=266841.6666666667, ans=0.125 2024-06-20 20:48:08,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266860.0, ans=0.1 2024-06-20 20:48:10,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=266860.0, ans=0.0 2024-06-20 20:48:15,838 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:48:17,011 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 20:48:39,457 INFO [train.py:1028] (0/2) Epoch 15, batch 3950, loss[loss=0.216, simple_loss=0.2519, pruned_loss=0.09008, over 13082.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2487, pruned_loss=0.07704, over 2588390.60 frames. ], batch size: 132, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:48:50,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=266933.3333333333, ans=0.0 2024-06-20 20:48:52,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266933.3333333333, ans=0.125 2024-06-20 20:48:59,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=266933.3333333333, ans=0.0 2024-06-20 20:49:01,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2024-06-20 20:49:25,827 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 1.908e+02 2.063e+02 2.324e+02 3.106e+02, threshold=4.127e+02, percent-clipped=0.0 2024-06-20 20:49:26,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=266988.3333333333, ans=0.2 2024-06-20 20:49:33,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=266988.3333333333, ans=0.125 2024-06-20 20:49:36,423 INFO [train.py:1028] (0/2) Epoch 15, batch 4000, loss[loss=0.1922, simple_loss=0.2351, pruned_loss=0.07463, over 12872.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2478, pruned_loss=0.07682, over 2583181.23 frames. ], batch size: 39, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:49:40,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=267006.6666666667, ans=0.09899494936611666 2024-06-20 20:50:00,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267043.3333333333, ans=0.1 2024-06-20 20:50:08,544 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.38 vs. limit=10.0 2024-06-20 20:50:11,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=267061.6666666667, ans=0.125 2024-06-20 20:50:15,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=267061.6666666667, ans=0.125 2024-06-20 20:50:30,843 INFO [train.py:1028] (0/2) Epoch 15, batch 4050, loss[loss=0.2122, simple_loss=0.2446, pruned_loss=0.08988, over 10907.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2474, pruned_loss=0.0767, over 2581631.65 frames. ], batch size: 304, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:50:36,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267098.3333333333, ans=0.125 2024-06-20 20:50:38,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=267098.3333333333, ans=0.125 2024-06-20 20:50:41,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=267116.6666666667, ans=0.0 2024-06-20 20:50:46,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267116.6666666667, ans=0.125 2024-06-20 20:50:55,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2024-06-20 20:51:00,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267153.3333333333, ans=0.1 2024-06-20 20:51:01,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.78 vs. limit=15.0 2024-06-20 20:51:08,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.633e+02 1.896e+02 2.070e+02 2.284e+02 3.019e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-20 20:51:14,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.37 vs. limit=6.0 2024-06-20 20:51:19,748 INFO [train.py:1028] (0/2) Epoch 15, batch 4100, loss[loss=0.2139, simple_loss=0.2536, pruned_loss=0.08705, over 13030.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2483, pruned_loss=0.07754, over 2579369.00 frames. ], batch size: 102, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:51:21,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=267190.0, ans=0.0 2024-06-20 20:51:38,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=267226.6666666667, ans=0.125 2024-06-20 20:51:55,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=267263.3333333333, ans=0.0 2024-06-20 20:52:10,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2024-06-20 20:52:11,810 INFO [train.py:1028] (0/2) Epoch 15, batch 4150, loss[loss=0.2166, simple_loss=0.2656, pruned_loss=0.08377, over 13157.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2481, pruned_loss=0.07736, over 2576852.03 frames. ], batch size: 55, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:52:31,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2024-06-20 20:52:43,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=267336.6666666667, ans=0.125 2024-06-20 20:52:46,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=267336.6666666667, ans=0.09899494936611666 2024-06-20 20:52:47,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2024-06-20 20:52:48,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267336.6666666667, ans=0.1 2024-06-20 20:52:50,989 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.494e+02 1.794e+02 1.959e+02 2.119e+02 3.540e+02, threshold=3.917e+02, percent-clipped=0.0 2024-06-20 20:53:03,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=267355.0, ans=0.125 2024-06-20 20:53:06,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2024-06-20 20:53:07,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=15.0 2024-06-20 20:53:08,965 INFO [train.py:1028] (0/2) Epoch 15, batch 4200, loss[loss=0.1885, simple_loss=0.2364, pruned_loss=0.07028, over 13046.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2471, pruned_loss=0.07717, over 2579933.85 frames. ], batch size: 102, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:53:11,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=267373.3333333333, ans=0.125 2024-06-20 20:53:51,848 INFO [train.py:1028] (0/2) Epoch 15, batch 4250, loss[loss=0.1913, simple_loss=0.2408, pruned_loss=0.07086, over 13301.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.247, pruned_loss=0.07694, over 2581649.93 frames. ], batch size: 46, lr: 3.91e-03, grad_scale: 32.0 2024-06-20 20:53:55,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=267465.0, ans=0.0 2024-06-20 20:54:01,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=267483.3333333333, ans=0.125 2024-06-20 20:54:20,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=267520.0, ans=0.0 2024-06-20 20:54:23,953 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.869e+02 1.988e+02 2.186e+02 3.425e+02, threshold=3.977e+02, percent-clipped=0.0 2024-06-20 20:54:35,066 INFO [train.py:1028] (0/2) Epoch 15, batch 4300, loss[loss=0.2068, simple_loss=0.2515, pruned_loss=0.08103, over 13219.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2471, pruned_loss=0.07707, over 2583023.20 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:54:37,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=267556.6666666667, ans=0.125 2024-06-20 20:54:46,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=15.0 2024-06-20 20:54:50,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2024-06-20 20:54:57,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=267575.0, ans=0.0 2024-06-20 20:55:27,298 INFO [train.py:1028] (0/2) Epoch 15, batch 4350, loss[loss=0.2084, simple_loss=0.2556, pruned_loss=0.08067, over 13202.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2469, pruned_loss=0.07709, over 2586890.89 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:55:53,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=267685.0, ans=0.2 2024-06-20 20:56:10,652 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.816e+02 1.995e+02 2.176e+02 2.755e+02, threshold=3.991e+02, percent-clipped=0.0 2024-06-20 20:56:15,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=267721.6666666667, ans=0.0 2024-06-20 20:56:20,924 INFO [train.py:1028] (0/2) Epoch 15, batch 4400, loss[loss=0.2203, simple_loss=0.2601, pruned_loss=0.09021, over 13240.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.247, pruned_loss=0.07707, over 2586899.39 frames. ], batch size: 83, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:56:22,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=267740.0, ans=0.125 2024-06-20 20:56:25,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2024-06-20 20:56:33,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=267758.3333333333, ans=0.125 2024-06-20 20:56:36,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=267758.3333333333, ans=0.0 2024-06-20 20:56:36,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=267758.3333333333, ans=0.125 2024-06-20 20:56:37,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267776.6666666667, ans=0.1 2024-06-20 20:56:39,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267776.6666666667, ans=0.1 2024-06-20 20:57:04,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=267813.3333333333, ans=0.1 2024-06-20 20:57:05,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=267813.3333333333, ans=0.125 2024-06-20 20:57:08,037 INFO [train.py:1028] (0/2) Epoch 15, batch 4450, loss[loss=0.1829, simple_loss=0.231, pruned_loss=0.06744, over 12881.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2473, pruned_loss=0.07735, over 2581318.79 frames. ], batch size: 33, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:57:17,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=267850.0, ans=0.2 2024-06-20 20:57:25,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2024-06-20 20:57:25,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.64 vs. limit=5.0 2024-06-20 20:57:26,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267868.3333333333, ans=0.1 2024-06-20 20:57:31,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267868.3333333333, ans=0.1 2024-06-20 20:57:33,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.69 vs. limit=22.5 2024-06-20 20:57:44,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.851e+02 1.990e+02 2.169e+02 3.063e+02, threshold=3.981e+02, percent-clipped=0.0 2024-06-20 20:58:03,551 INFO [train.py:1028] (0/2) Epoch 15, batch 4500, loss[loss=0.1929, simple_loss=0.2403, pruned_loss=0.0727, over 13246.00 frames. ], tot_loss[loss=0.201, simple_loss=0.247, pruned_loss=0.07752, over 2585499.97 frames. ], batch size: 89, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:58:05,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=267923.3333333333, ans=0.125 2024-06-20 20:58:06,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=15.0 2024-06-20 20:58:09,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2024-06-20 20:58:09,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.93 vs. limit=10.0 2024-06-20 20:58:11,655 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.81 vs. limit=10.0 2024-06-20 20:58:13,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=267941.6666666667, ans=0.025 2024-06-20 20:58:33,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=267978.3333333333, ans=0.125 2024-06-20 20:58:38,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=267978.3333333333, ans=0.125 2024-06-20 20:58:56,033 INFO [train.py:1028] (0/2) Epoch 15, batch 4550, loss[loss=0.184, simple_loss=0.232, pruned_loss=0.06804, over 13246.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2466, pruned_loss=0.07726, over 2589903.54 frames. ], batch size: 52, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:59:02,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=268015.0, ans=0.125 2024-06-20 20:59:13,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=268033.3333333333, ans=0.0 2024-06-20 20:59:20,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2024-06-20 20:59:34,366 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.781e+02 1.882e+02 1.992e+02 2.555e+02, threshold=3.764e+02, percent-clipped=0.0 2024-06-20 20:59:42,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=23.08 vs. limit=22.5 2024-06-20 20:59:44,893 INFO [train.py:1028] (0/2) Epoch 15, batch 4600, loss[loss=0.2245, simple_loss=0.2632, pruned_loss=0.09296, over 12567.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2463, pruned_loss=0.07697, over 2585779.08 frames. ], batch size: 202, lr: 3.90e-03, grad_scale: 32.0 2024-06-20 20:59:46,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=268106.6666666667, ans=0.0 2024-06-20 20:59:49,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=268106.6666666667, ans=0.125 2024-06-20 20:59:54,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=268125.0, ans=0.0 2024-06-20 20:59:56,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=15.0 2024-06-20 20:59:58,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2024-06-20 21:00:07,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=268143.3333333333, ans=0.0 2024-06-20 21:00:32,000 INFO [train.py:1028] (0/2) Epoch 15, batch 4650, loss[loss=0.1918, simple_loss=0.2253, pruned_loss=0.07917, over 13082.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2458, pruned_loss=0.07676, over 2588288.90 frames. ], batch size: 132, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:00:55,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268216.6666666667, ans=0.1 2024-06-20 21:00:56,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2024-06-20 21:01:05,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=268235.0, ans=0.0 2024-06-20 21:01:17,899 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 1.822e+02 1.940e+02 2.148e+02 3.190e+02, threshold=3.880e+02, percent-clipped=0.0 2024-06-20 21:01:19,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=268271.6666666667, ans=0.0 2024-06-20 21:01:28,932 INFO [train.py:1028] (0/2) Epoch 15, batch 4700, loss[loss=0.1911, simple_loss=0.2402, pruned_loss=0.07101, over 12553.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2455, pruned_loss=0.07648, over 2583100.73 frames. ], batch size: 25, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:01:44,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=268308.3333333333, ans=0.0 2024-06-20 21:01:51,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268326.6666666667, ans=0.1 2024-06-20 21:01:59,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=268345.0, ans=0.125 2024-06-20 21:02:03,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.69 vs. limit=22.5 2024-06-20 21:02:03,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268345.0, ans=0.1 2024-06-20 21:02:13,266 INFO [train.py:1028] (0/2) Epoch 15, batch 4750, loss[loss=0.2056, simple_loss=0.2497, pruned_loss=0.08079, over 12513.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2445, pruned_loss=0.07629, over 2580669.66 frames. ], batch size: 202, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:02:19,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268400.0, ans=0.0 2024-06-20 21:02:21,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=268400.0, ans=0.025 2024-06-20 21:02:35,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=268436.6666666667, ans=0.125 2024-06-20 21:02:39,817 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.868e+02 2.024e+02 2.212e+02 3.155e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-20 21:02:42,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268455.0, ans=0.1 2024-06-20 21:02:44,516 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2024-06-20 21:02:46,258 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.40 vs. limit=6.0 2024-06-20 21:02:48,165 INFO [train.py:1028] (0/2) Epoch 15, batch 4800, loss[loss=0.2135, simple_loss=0.2612, pruned_loss=0.08285, over 13255.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.245, pruned_loss=0.07654, over 2577524.66 frames. ], batch size: 63, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:02:54,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268473.3333333333, ans=0.0 2024-06-20 21:03:02,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.74 vs. limit=10.0 2024-06-20 21:03:05,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=268510.0, ans=0.125 2024-06-20 21:03:07,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268510.0, ans=0.1 2024-06-20 21:03:21,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=268528.3333333333, ans=0.2 2024-06-20 21:03:21,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=15.0 2024-06-20 21:03:27,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=268546.6666666667, ans=0.0 2024-06-20 21:03:31,542 INFO [train.py:1028] (0/2) Epoch 15, batch 4850, loss[loss=0.1921, simple_loss=0.2382, pruned_loss=0.07294, over 13276.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2446, pruned_loss=0.07636, over 2576484.66 frames. ], batch size: 89, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:03:37,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=268565.0, ans=0.0 2024-06-20 21:03:44,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=268583.3333333333, ans=0.0 2024-06-20 21:03:45,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=268583.3333333333, ans=0.05 2024-06-20 21:04:08,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 1.849e+02 1.965e+02 2.191e+02 3.139e+02, threshold=3.929e+02, percent-clipped=0.0 2024-06-20 21:04:16,750 INFO [train.py:1028] (0/2) Epoch 15, batch 4900, loss[loss=0.197, simple_loss=0.2518, pruned_loss=0.07115, over 13217.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2446, pruned_loss=0.07599, over 2576858.35 frames. ], batch size: 59, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:04:26,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=268675.0, ans=0.125 2024-06-20 21:04:26,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2024-06-20 21:04:28,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=268675.0, ans=0.0 2024-06-20 21:04:41,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=268711.6666666667, ans=0.125 2024-06-20 21:04:49,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=268730.0, ans=0.125 2024-06-20 21:04:56,856 INFO [train.py:1028] (0/2) Epoch 15, batch 4950, loss[loss=0.2152, simple_loss=0.2499, pruned_loss=0.09028, over 10988.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2452, pruned_loss=0.0766, over 2570458.82 frames. ], batch size: 304, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:05:10,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268766.6666666667, ans=0.1 2024-06-20 21:05:20,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=268803.3333333333, ans=0.125 2024-06-20 21:05:26,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=268803.3333333333, ans=0.0 2024-06-20 21:05:28,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 1.964e+02 2.122e+02 2.352e+02 3.045e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-20 21:05:28,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=268821.6666666667, ans=0.0 2024-06-20 21:05:34,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.96 vs. limit=10.0 2024-06-20 21:05:35,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=268821.6666666667, ans=0.125 2024-06-20 21:05:40,922 INFO [train.py:1028] (0/2) Epoch 15, batch 5000, loss[loss=0.2018, simple_loss=0.2382, pruned_loss=0.0827, over 13170.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2453, pruned_loss=0.07651, over 2574779.74 frames. ], batch size: 95, lr: 3.90e-03, grad_scale: 64.0 2024-06-20 21:05:42,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.54 vs. limit=15.0 2024-06-20 21:05:53,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=268858.3333333333, ans=0.125 2024-06-20 21:06:04,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=268876.6666666667, ans=0.125 2024-06-20 21:06:09,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=268895.0, ans=0.05 2024-06-20 21:06:12,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=268895.0, ans=10.0 2024-06-20 21:06:14,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=268913.3333333333, ans=0.1 2024-06-20 21:06:15,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=268913.3333333333, ans=0.0 2024-06-20 21:06:22,595 INFO [train.py:1028] (0/2) Epoch 15, batch 5050, loss[loss=0.192, simple_loss=0.2415, pruned_loss=0.07123, over 12878.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2448, pruned_loss=0.07594, over 2572515.45 frames. ], batch size: 36, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:06:42,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=268968.3333333333, ans=0.0 2024-06-20 21:06:42,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=268968.3333333333, ans=0.125 2024-06-20 21:06:43,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=268968.3333333333, ans=0.125 2024-06-20 21:06:44,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=15.0 2024-06-20 21:06:48,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=268968.3333333333, ans=0.125 2024-06-20 21:06:48,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=268968.3333333333, ans=0.07 2024-06-20 21:06:58,291 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.813e+02 1.942e+02 2.089e+02 2.974e+02, threshold=3.885e+02, percent-clipped=0.0 2024-06-20 21:06:59,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=17.10 vs. limit=15.0 2024-06-20 21:06:59,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=269005.0, ans=0.025 2024-06-20 21:07:05,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=269005.0, ans=0.0 2024-06-20 21:07:07,980 INFO [train.py:1028] (0/2) Epoch 15, batch 5100, loss[loss=0.2076, simple_loss=0.2596, pruned_loss=0.07778, over 12899.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2453, pruned_loss=0.07658, over 2568298.20 frames. ], batch size: 39, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:07:09,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.90 vs. limit=22.5 2024-06-20 21:07:21,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2024-06-20 21:07:22,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=269041.6666666667, ans=0.125 2024-06-20 21:07:22,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.08 vs. limit=15.0 2024-06-20 21:07:26,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.37 vs. limit=22.5 2024-06-20 21:07:31,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=269060.0, ans=0.125 2024-06-20 21:07:37,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=269078.3333333333, ans=0.125 2024-06-20 21:07:42,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=269096.6666666667, ans=0.0 2024-06-20 21:07:48,934 INFO [train.py:1028] (0/2) Epoch 15, batch 5150, loss[loss=0.1985, simple_loss=0.2378, pruned_loss=0.07963, over 13060.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2457, pruned_loss=0.07714, over 2570637.68 frames. ], batch size: 132, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:07:52,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=269115.0, ans=0.0 2024-06-20 21:08:07,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=269133.3333333333, ans=0.1 2024-06-20 21:08:09,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=269151.6666666667, ans=0.09899494936611666 2024-06-20 21:08:18,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269170.0, ans=0.125 2024-06-20 21:08:22,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=269170.0, ans=0.2 2024-06-20 21:08:24,101 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.819e+02 1.957e+02 2.123e+02 3.044e+02, threshold=3.914e+02, percent-clipped=0.0 2024-06-20 21:08:30,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=269188.3333333333, ans=0.0 2024-06-20 21:08:32,720 INFO [train.py:1028] (0/2) Epoch 15, batch 5200, loss[loss=0.2113, simple_loss=0.2456, pruned_loss=0.08855, over 13154.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2452, pruned_loss=0.07688, over 2573549.96 frames. ], batch size: 95, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:08:36,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=269206.6666666667, ans=0.07 2024-06-20 21:08:37,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.74 vs. limit=15.0 2024-06-20 21:08:41,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=269225.0, ans=0.2 2024-06-20 21:09:14,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.18 vs. limit=6.0 2024-06-20 21:09:16,473 INFO [train.py:1028] (0/2) Epoch 15, batch 5250, loss[loss=0.198, simple_loss=0.2481, pruned_loss=0.07394, over 13284.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2448, pruned_loss=0.07664, over 2570513.30 frames. ], batch size: 52, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:09:40,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2024-06-20 21:09:45,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=269353.3333333333, ans=0.04949747468305833 2024-06-20 21:09:47,749 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 1.849e+02 1.975e+02 2.214e+02 2.998e+02, threshold=3.951e+02, percent-clipped=0.0 2024-06-20 21:09:49,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=269371.6666666667, ans=0.0 2024-06-20 21:09:56,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=269390.0, ans=0.0 2024-06-20 21:09:56,681 INFO [train.py:1028] (0/2) Epoch 15, batch 5300, loss[loss=0.2024, simple_loss=0.248, pruned_loss=0.07837, over 13015.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2454, pruned_loss=0.07677, over 2566878.23 frames. ], batch size: 144, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:10:14,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2024-06-20 21:10:26,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=269445.0, ans=0.125 2024-06-20 21:10:29,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=269445.0, ans=0.125 2024-06-20 21:10:34,232 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:10:40,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.00 vs. limit=6.0 2024-06-20 21:10:41,921 INFO [train.py:1028] (0/2) Epoch 15, batch 5350, loss[loss=0.2002, simple_loss=0.2596, pruned_loss=0.07045, over 11539.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2448, pruned_loss=0.07637, over 2573962.25 frames. ], batch size: 16, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:10:49,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=269500.0, ans=0.125 2024-06-20 21:10:58,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2024-06-20 21:11:05,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=269518.3333333333, ans=0.025 2024-06-20 21:11:15,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=269536.6666666667, ans=0.125 2024-06-20 21:11:16,948 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.608e+02 1.823e+02 1.901e+02 2.035e+02 3.037e+02, threshold=3.802e+02, percent-clipped=0.0 2024-06-20 21:11:20,516 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2024-06-20 21:11:21,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=269555.0, ans=0.125 2024-06-20 21:11:21,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=15.0 2024-06-20 21:11:25,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=269573.3333333333, ans=0.0 2024-06-20 21:11:25,743 INFO [train.py:1028] (0/2) Epoch 15, batch 5400, loss[loss=0.21, simple_loss=0.2471, pruned_loss=0.08643, over 12275.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2447, pruned_loss=0.07658, over 2567502.93 frames. ], batch size: 241, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:11:26,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=269573.3333333333, ans=0.0 2024-06-20 21:11:28,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=269573.3333333333, ans=0.0 2024-06-20 21:11:32,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.76 vs. limit=22.5 2024-06-20 21:11:43,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=269610.0, ans=0.0 2024-06-20 21:11:47,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=269610.0, ans=0.125 2024-06-20 21:12:02,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=269646.6666666667, ans=0.0 2024-06-20 21:12:05,809 INFO [train.py:1028] (0/2) Epoch 15, batch 5450, loss[loss=0.185, simple_loss=0.2328, pruned_loss=0.06863, over 12444.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2451, pruned_loss=0.07672, over 2571378.41 frames. ], batch size: 25, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:12:16,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2024-06-20 21:12:16,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=269683.3333333333, ans=0.125 2024-06-20 21:12:23,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=269701.6666666667, ans=0.0 2024-06-20 21:12:23,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.85 vs. limit=15.0 2024-06-20 21:12:24,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269701.6666666667, ans=0.1 2024-06-20 21:12:30,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=269720.0, ans=0.0 2024-06-20 21:12:41,297 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 1.821e+02 1.941e+02 2.083e+02 2.803e+02, threshold=3.881e+02, percent-clipped=0.0 2024-06-20 21:12:43,595 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.80 vs. limit=22.5 2024-06-20 21:12:47,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2024-06-20 21:12:50,244 INFO [train.py:1028] (0/2) Epoch 15, batch 5500, loss[loss=0.2151, simple_loss=0.2443, pruned_loss=0.09297, over 12151.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2446, pruned_loss=0.07657, over 2565446.95 frames. ], batch size: 240, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:12:55,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=269756.6666666667, ans=15.0 2024-06-20 21:13:00,435 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.416e+00 2024-06-20 21:13:03,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.91 vs. limit=22.5 2024-06-20 21:13:26,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=269830.0, ans=0.125 2024-06-20 21:13:29,735 INFO [train.py:1028] (0/2) Epoch 15, batch 5550, loss[loss=0.2022, simple_loss=0.2577, pruned_loss=0.07336, over 13244.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2438, pruned_loss=0.0759, over 2569710.91 frames. ], batch size: 43, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:13:29,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269848.3333333333, ans=0.1 2024-06-20 21:13:36,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=269848.3333333333, ans=0.0 2024-06-20 21:13:41,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2024-06-20 21:13:42,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.14 vs. limit=15.0 2024-06-20 21:13:42,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=269866.6666666667, ans=0.0 2024-06-20 21:13:47,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2024-06-20 21:14:04,334 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.838e+02 1.948e+02 2.151e+02 3.081e+02, threshold=3.897e+02, percent-clipped=0.0 2024-06-20 21:14:12,992 INFO [train.py:1028] (0/2) Epoch 15, batch 5600, loss[loss=0.2061, simple_loss=0.248, pruned_loss=0.08204, over 13223.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2438, pruned_loss=0.07611, over 2571584.43 frames. ], batch size: 89, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:14:17,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=269940.0, ans=0.125 2024-06-20 21:14:18,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=269940.0, ans=0.125 2024-06-20 21:14:22,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=269958.3333333333, ans=0.0 2024-06-20 21:14:22,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=269958.3333333333, ans=0.2 2024-06-20 21:14:26,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=269958.3333333333, ans=0.0 2024-06-20 21:14:39,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=269995.0, ans=0.125 2024-06-20 21:14:40,599 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:14:40,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=269995.0, ans=0.0 2024-06-20 21:14:45,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269995.0, ans=0.1 2024-06-20 21:14:48,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=270013.3333333333, ans=0.025 2024-06-20 21:14:54,182 INFO [train.py:1028] (0/2) Epoch 15, batch 5650, loss[loss=0.2197, simple_loss=0.2629, pruned_loss=0.0883, over 12529.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2437, pruned_loss=0.07567, over 2576152.64 frames. ], batch size: 202, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:15:08,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.40 vs. limit=15.0 2024-06-20 21:15:23,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=270086.6666666667, ans=0.125 2024-06-20 21:15:24,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=270086.6666666667, ans=0.0 2024-06-20 21:15:30,244 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.800e+02 1.928e+02 2.106e+02 2.711e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 21:15:31,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=270105.0, ans=0.0 2024-06-20 21:15:33,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270105.0, ans=0.1 2024-06-20 21:15:39,492 INFO [train.py:1028] (0/2) Epoch 15, batch 5700, loss[loss=0.1802, simple_loss=0.2331, pruned_loss=0.06365, over 13233.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2437, pruned_loss=0.07574, over 2579492.89 frames. ], batch size: 63, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:15:42,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=270123.3333333333, ans=0.125 2024-06-20 21:15:53,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270141.6666666667, ans=0.1 2024-06-20 21:15:55,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=270160.0, ans=0.125 2024-06-20 21:16:06,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.13 vs. limit=15.0 2024-06-20 21:16:08,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.18 vs. limit=10.0 2024-06-20 21:16:23,809 INFO [train.py:1028] (0/2) Epoch 15, batch 5750, loss[loss=0.2108, simple_loss=0.2486, pruned_loss=0.08652, over 12792.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2441, pruned_loss=0.07575, over 2578616.35 frames. ], batch size: 176, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:16:24,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=270215.0, ans=0.0 2024-06-20 21:16:24,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=270215.0, ans=0.025 2024-06-20 21:16:26,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270215.0, ans=0.1 2024-06-20 21:16:28,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=270215.0, ans=0.125 2024-06-20 21:16:52,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=270270.0, ans=0.0 2024-06-20 21:16:53,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=270270.0, ans=0.125 2024-06-20 21:16:55,922 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 1.842e+02 1.984e+02 2.208e+02 3.218e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-20 21:16:58,087 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.01 vs. limit=15.0 2024-06-20 21:17:05,066 INFO [train.py:1028] (0/2) Epoch 15, batch 5800, loss[loss=0.2137, simple_loss=0.262, pruned_loss=0.08264, over 12740.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2459, pruned_loss=0.07676, over 2579045.27 frames. ], batch size: 176, lr: 3.89e-03, grad_scale: 64.0 2024-06-20 21:17:06,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2024-06-20 21:17:08,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=270306.6666666667, ans=0.0 2024-06-20 21:17:12,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=270325.0, ans=0.025 2024-06-20 21:17:19,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=270325.0, ans=0.125 2024-06-20 21:17:22,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270343.3333333333, ans=0.1 2024-06-20 21:17:23,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=12.0 2024-06-20 21:17:38,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.91 vs. limit=12.0 2024-06-20 21:17:39,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=270361.6666666667, ans=0.125 2024-06-20 21:17:44,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=270380.0, ans=0.0 2024-06-20 21:17:49,391 INFO [train.py:1028] (0/2) Epoch 15, batch 5850, loss[loss=0.2142, simple_loss=0.2569, pruned_loss=0.0858, over 12605.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2478, pruned_loss=0.07756, over 2577627.47 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:18:07,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=270435.0, ans=0.0 2024-06-20 21:18:12,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.86 vs. limit=10.0 2024-06-20 21:18:24,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2024-06-20 21:18:25,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=270453.3333333333, ans=0.2 2024-06-20 21:18:25,656 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 1.849e+02 1.988e+02 2.143e+02 3.516e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 21:18:34,550 INFO [train.py:1028] (0/2) Epoch 15, batch 5900, loss[loss=0.2007, simple_loss=0.2439, pruned_loss=0.07878, over 13076.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2495, pruned_loss=0.07807, over 2578012.43 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:18:36,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=270490.0, ans=0.0 2024-06-20 21:18:41,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=270490.0, ans=0.0 2024-06-20 21:18:53,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=270526.6666666667, ans=0.125 2024-06-20 21:19:01,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.78 vs. limit=15.0 2024-06-20 21:19:04,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=270545.0, ans=0.125 2024-06-20 21:19:06,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=270563.3333333333, ans=0.0 2024-06-20 21:19:08,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=270563.3333333333, ans=0.125 2024-06-20 21:19:10,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=9.84 vs. limit=12.0 2024-06-20 21:19:15,240 INFO [train.py:1028] (0/2) Epoch 15, batch 5950, loss[loss=0.2235, simple_loss=0.2699, pruned_loss=0.08855, over 13109.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2509, pruned_loss=0.07882, over 2583075.76 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:19:15,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=8.41 vs. limit=12.0 2024-06-20 21:19:24,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=270600.0, ans=0.125 2024-06-20 21:19:45,167 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.948e+02 2.080e+02 2.302e+02 2.912e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-20 21:19:54,022 INFO [train.py:1028] (0/2) Epoch 15, batch 6000, loss[loss=0.2658, simple_loss=0.2969, pruned_loss=0.1174, over 12278.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2532, pruned_loss=0.07996, over 2575424.21 frames. ], batch size: 241, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:19:54,023 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 21:20:04,534 INFO [train.py:1060] (0/2) Epoch 15, validation: loss=0.1895, simple_loss=0.2543, pruned_loss=0.06236, over 351949.00 frames. 2024-06-20 21:20:04,534 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 21:20:10,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=270673.3333333333, ans=0.1 2024-06-20 21:20:15,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2024-06-20 21:20:21,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.03 vs. limit=15.0 2024-06-20 21:20:35,599 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.238e-01 2024-06-20 21:20:41,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=270746.6666666667, ans=0.0 2024-06-20 21:20:45,895 INFO [train.py:1028] (0/2) Epoch 15, batch 6050, loss[loss=0.2068, simple_loss=0.2548, pruned_loss=0.07935, over 12906.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2545, pruned_loss=0.08017, over 2578239.53 frames. ], batch size: 39, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:20:48,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=270765.0, ans=0.0 2024-06-20 21:21:13,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=270820.0, ans=0.09899494936611666 2024-06-20 21:21:21,646 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.909e+02 2.013e+02 2.202e+02 2.856e+02, threshold=4.025e+02, percent-clipped=0.0 2024-06-20 21:21:21,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=270838.3333333333, ans=0.0 2024-06-20 21:21:30,673 INFO [train.py:1028] (0/2) Epoch 15, batch 6100, loss[loss=0.2009, simple_loss=0.2515, pruned_loss=0.07509, over 13099.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2556, pruned_loss=0.08052, over 2579864.77 frames. ], batch size: 121, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:21:31,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2024-06-20 21:21:32,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=270856.6666666667, ans=0.0 2024-06-20 21:21:49,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=270893.3333333333, ans=0.125 2024-06-20 21:21:51,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=270893.3333333333, ans=0.125 2024-06-20 21:21:57,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2024-06-20 21:22:10,988 INFO [train.py:1028] (0/2) Epoch 15, batch 6150, loss[loss=0.2094, simple_loss=0.2452, pruned_loss=0.08678, over 10953.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2572, pruned_loss=0.08138, over 2577819.80 frames. ], batch size: 304, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:22:14,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=22.5 2024-06-20 21:22:20,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.94 vs. limit=22.5 2024-06-20 21:22:20,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.97 vs. limit=10.0 2024-06-20 21:22:24,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=270966.6666666667, ans=0.1 2024-06-20 21:22:38,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=271003.3333333333, ans=0.125 2024-06-20 21:22:45,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=271003.3333333333, ans=0.125 2024-06-20 21:22:46,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.925e+02 2.140e+02 2.560e+02 4.089e+02, threshold=4.279e+02, percent-clipped=1.0 2024-06-20 21:22:55,151 INFO [train.py:1028] (0/2) Epoch 15, batch 6200, loss[loss=0.2418, simple_loss=0.2824, pruned_loss=0.1006, over 13228.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.259, pruned_loss=0.08204, over 2575712.00 frames. ], batch size: 89, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:23:25,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=271095.0, ans=0.025 2024-06-20 21:23:30,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271095.0, ans=0.1 2024-06-20 21:23:37,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=271113.3333333333, ans=0.0 2024-06-20 21:23:39,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=271131.6666666667, ans=0.0 2024-06-20 21:23:40,267 INFO [train.py:1028] (0/2) Epoch 15, batch 6250, loss[loss=0.2231, simple_loss=0.2621, pruned_loss=0.09209, over 13245.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.26, pruned_loss=0.08243, over 2569590.84 frames. ], batch size: 83, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:23:42,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.33 vs. limit=22.5 2024-06-20 21:23:53,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=271150.0, ans=0.125 2024-06-20 21:24:08,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=271186.6666666667, ans=0.0 2024-06-20 21:24:10,610 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.994e+02 2.192e+02 2.660e+02 4.379e+02, threshold=4.384e+02, percent-clipped=1.0 2024-06-20 21:24:13,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=271205.0, ans=0.2 2024-06-20 21:24:17,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=12.0 2024-06-20 21:24:19,106 INFO [train.py:1028] (0/2) Epoch 15, batch 6300, loss[loss=0.1744, simple_loss=0.2218, pruned_loss=0.06347, over 11956.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2607, pruned_loss=0.0827, over 2564784.54 frames. ], batch size: 17, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:24:20,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2024-06-20 21:24:23,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=271223.3333333333, ans=0.0 2024-06-20 21:24:40,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=271260.0, ans=0.0 2024-06-20 21:24:52,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=271278.3333333333, ans=0.0 2024-06-20 21:24:56,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=271296.6666666667, ans=0.0 2024-06-20 21:25:01,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=271296.6666666667, ans=0.2 2024-06-20 21:25:02,613 INFO [train.py:1028] (0/2) Epoch 15, batch 6350, loss[loss=0.2392, simple_loss=0.2851, pruned_loss=0.09663, over 12465.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2618, pruned_loss=0.08272, over 2573840.71 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:25:07,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=271315.0, ans=0.2 2024-06-20 21:25:07,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=12.0 2024-06-20 21:25:09,242 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-148000.pt 2024-06-20 21:25:32,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=271370.0, ans=0.125 2024-06-20 21:25:32,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=271370.0, ans=0.05 2024-06-20 21:25:38,077 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.903e+02 2.037e+02 2.257e+02 3.021e+02, threshold=4.074e+02, percent-clipped=0.0 2024-06-20 21:25:44,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.81 vs. limit=15.0 2024-06-20 21:25:46,691 INFO [train.py:1028] (0/2) Epoch 15, batch 6400, loss[loss=0.2312, simple_loss=0.2843, pruned_loss=0.08905, over 13249.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2636, pruned_loss=0.08348, over 2574664.61 frames. ], batch size: 67, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:26:00,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=271425.0, ans=0.0 2024-06-20 21:26:28,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=271480.0, ans=0.2 2024-06-20 21:26:29,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=271498.3333333333, ans=0.125 2024-06-20 21:26:29,844 INFO [train.py:1028] (0/2) Epoch 15, batch 6450, loss[loss=0.2377, simple_loss=0.279, pruned_loss=0.09825, over 12544.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2651, pruned_loss=0.08412, over 2579741.12 frames. ], batch size: 202, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:26:47,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=271535.0, ans=0.125 2024-06-20 21:26:54,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=271553.3333333333, ans=0.125 2024-06-20 21:26:55,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=271553.3333333333, ans=0.2 2024-06-20 21:27:01,187 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.012e+02 2.165e+02 2.335e+02 3.505e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-20 21:27:10,228 INFO [train.py:1028] (0/2) Epoch 15, batch 6500, loss[loss=0.2229, simple_loss=0.2627, pruned_loss=0.09159, over 10720.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2668, pruned_loss=0.08459, over 2582818.61 frames. ], batch size: 303, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:27:15,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=271590.0, ans=0.2 2024-06-20 21:27:34,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=271626.6666666667, ans=0.125 2024-06-20 21:27:41,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=271645.0, ans=0.0 2024-06-20 21:27:45,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=271663.3333333333, ans=0.02 2024-06-20 21:27:46,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=271663.3333333333, ans=0.125 2024-06-20 21:27:47,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=271663.3333333333, ans=0.0 2024-06-20 21:27:48,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=271663.3333333333, ans=0.125 2024-06-20 21:27:52,998 INFO [train.py:1028] (0/2) Epoch 15, batch 6550, loss[loss=0.2086, simple_loss=0.267, pruned_loss=0.07505, over 12733.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2684, pruned_loss=0.08525, over 2586628.09 frames. ], batch size: 22, lr: 3.88e-03, grad_scale: 64.0 2024-06-20 21:27:59,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.11 vs. limit=22.5 2024-06-20 21:28:03,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=271700.0, ans=0.1 2024-06-20 21:28:06,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=271700.0, ans=0.125 2024-06-20 21:28:16,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=271736.6666666667, ans=0.125 2024-06-20 21:28:20,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=271736.6666666667, ans=0.125 2024-06-20 21:28:24,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2024-06-20 21:28:24,383 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.969e+02 2.084e+02 2.276e+02 3.903e+02, threshold=4.168e+02, percent-clipped=0.0 2024-06-20 21:28:28,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=15.0 2024-06-20 21:28:36,615 INFO [train.py:1028] (0/2) Epoch 15, batch 6600, loss[loss=0.2327, simple_loss=0.2789, pruned_loss=0.09328, over 13211.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2695, pruned_loss=0.08589, over 2589068.64 frames. ], batch size: 72, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:28:43,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=271773.3333333333, ans=0.125 2024-06-20 21:29:04,298 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:29:05,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=271828.3333333333, ans=0.0 2024-06-20 21:29:18,240 INFO [train.py:1028] (0/2) Epoch 15, batch 6650, loss[loss=0.2749, simple_loss=0.3172, pruned_loss=0.1163, over 12890.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2705, pruned_loss=0.08606, over 2585118.93 frames. ], batch size: 158, lr: 3.87e-03, grad_scale: 128.0 2024-06-20 21:29:19,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=271865.0, ans=0.125 2024-06-20 21:29:20,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.31 vs. limit=15.0 2024-06-20 21:29:20,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=271865.0, ans=0.125 2024-06-20 21:29:29,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=271883.3333333333, ans=0.125 2024-06-20 21:29:36,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=271901.6666666667, ans=0.1 2024-06-20 21:29:37,230 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.41 vs. limit=15.0 2024-06-20 21:29:40,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271901.6666666667, ans=0.1 2024-06-20 21:29:42,182 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:29:45,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=271920.0, ans=0.0 2024-06-20 21:29:48,870 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 2.001e+02 2.220e+02 2.440e+02 3.486e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-20 21:30:01,836 INFO [train.py:1028] (0/2) Epoch 15, batch 6700, loss[loss=0.2468, simple_loss=0.2847, pruned_loss=0.1045, over 12776.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2716, pruned_loss=0.08659, over 2584419.61 frames. ], batch size: 176, lr: 3.87e-03, grad_scale: 128.0 2024-06-20 21:30:01,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=271956.6666666667, ans=10.0 2024-06-20 21:30:05,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271956.6666666667, ans=0.125 2024-06-20 21:30:09,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=271975.0, ans=0.125 2024-06-20 21:30:11,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=271975.0, ans=0.5 2024-06-20 21:30:12,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=271975.0, ans=0.125 2024-06-20 21:30:22,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=271993.3333333333, ans=0.125 2024-06-20 21:30:27,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=8.92 vs. limit=12.0 2024-06-20 21:30:37,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=272030.0, ans=0.125 2024-06-20 21:30:39,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272030.0, ans=0.1 2024-06-20 21:30:42,995 INFO [train.py:1028] (0/2) Epoch 15, batch 6750, loss[loss=0.2739, simple_loss=0.3067, pruned_loss=0.1206, over 12282.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2725, pruned_loss=0.08702, over 2578902.21 frames. ], batch size: 241, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:30:56,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=272066.6666666667, ans=0.2 2024-06-20 21:30:58,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=22.5 2024-06-20 21:31:18,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=272121.6666666667, ans=0.125 2024-06-20 21:31:18,996 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.979e+02 2.127e+02 2.240e+02 3.296e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-20 21:31:27,001 INFO [train.py:1028] (0/2) Epoch 15, batch 6800, loss[loss=0.2386, simple_loss=0.282, pruned_loss=0.0976, over 13239.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2739, pruned_loss=0.08728, over 2580676.40 frames. ], batch size: 67, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:31:33,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.98 vs. limit=12.0 2024-06-20 21:31:34,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=272158.3333333333, ans=0.125 2024-06-20 21:31:46,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2024-06-20 21:32:06,743 INFO [train.py:1028] (0/2) Epoch 15, batch 6850, loss[loss=0.2433, simple_loss=0.3036, pruned_loss=0.09151, over 13234.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2745, pruned_loss=0.08707, over 2585183.43 frames. ], batch size: 63, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:32:07,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.49 vs. limit=15.0 2024-06-20 21:32:08,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-20 21:32:10,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=272231.6666666667, ans=0.0 2024-06-20 21:32:13,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=272231.6666666667, ans=0.125 2024-06-20 21:32:14,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=272250.0, ans=0.125 2024-06-20 21:32:16,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=272250.0, ans=0.0 2024-06-20 21:32:17,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=272250.0, ans=0.0 2024-06-20 21:32:19,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=272250.0, ans=0.0 2024-06-20 21:32:21,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2024-06-20 21:32:29,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=272286.6666666667, ans=0.0 2024-06-20 21:32:30,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=272286.6666666667, ans=0.025 2024-06-20 21:32:42,203 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.945e+02 2.117e+02 2.305e+02 3.142e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-20 21:32:50,646 INFO [train.py:1028] (0/2) Epoch 15, batch 6900, loss[loss=0.2308, simple_loss=0.2843, pruned_loss=0.08859, over 13243.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2752, pruned_loss=0.08714, over 2586551.72 frames. ], batch size: 49, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:32:56,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=272323.3333333333, ans=0.5 2024-06-20 21:33:03,713 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2024-06-20 21:33:11,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=272360.0, ans=0.0 2024-06-20 21:33:12,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272360.0, ans=0.1 2024-06-20 21:33:13,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.20 vs. limit=15.0 2024-06-20 21:33:18,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=272378.3333333333, ans=0.125 2024-06-20 21:33:23,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=272396.6666666667, ans=0.125 2024-06-20 21:33:27,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=272396.6666666667, ans=0.125 2024-06-20 21:33:35,307 INFO [train.py:1028] (0/2) Epoch 15, batch 6950, loss[loss=0.1926, simple_loss=0.2497, pruned_loss=0.06777, over 11321.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2756, pruned_loss=0.08705, over 2581153.25 frames. ], batch size: 16, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:33:35,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=272415.0, ans=0.125 2024-06-20 21:33:55,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=272451.6666666667, ans=0.2 2024-06-20 21:33:59,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2024-06-20 21:34:03,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=272470.0, ans=0.125 2024-06-20 21:34:08,473 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 1.996e+02 2.117e+02 2.330e+02 3.249e+02, threshold=4.233e+02, percent-clipped=0.0 2024-06-20 21:34:16,134 INFO [train.py:1028] (0/2) Epoch 15, batch 7000, loss[loss=0.2263, simple_loss=0.2731, pruned_loss=0.08975, over 12940.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2752, pruned_loss=0.0869, over 2576528.54 frames. ], batch size: 158, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:34:19,299 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:34:31,833 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:34:31,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=272543.3333333333, ans=0.125 2024-06-20 21:34:41,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=15.0 2024-06-20 21:34:49,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.81 vs. limit=15.0 2024-06-20 21:34:54,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.31 vs. limit=22.5 2024-06-20 21:34:57,126 INFO [train.py:1028] (0/2) Epoch 15, batch 7050, loss[loss=0.236, simple_loss=0.2808, pruned_loss=0.09561, over 12672.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2766, pruned_loss=0.08757, over 2583482.14 frames. ], batch size: 176, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:35:04,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=272598.3333333333, ans=0.125 2024-06-20 21:35:05,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=272598.3333333333, ans=0.0 2024-06-20 21:35:17,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=272635.0, ans=0.125 2024-06-20 21:35:18,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=272635.0, ans=0.2 2024-06-20 21:35:32,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.992e+02 2.131e+02 2.309e+02 2.932e+02, threshold=4.261e+02, percent-clipped=0.0 2024-06-20 21:35:37,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2024-06-20 21:35:40,716 INFO [train.py:1028] (0/2) Epoch 15, batch 7100, loss[loss=0.2419, simple_loss=0.2949, pruned_loss=0.09449, over 13182.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2776, pruned_loss=0.08817, over 2575976.00 frames. ], batch size: 112, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:35:40,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=272690.0, ans=0.0 2024-06-20 21:35:43,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.31 vs. limit=22.5 2024-06-20 21:35:45,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=272690.0, ans=0.125 2024-06-20 21:35:47,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=272690.0, ans=0.05 2024-06-20 21:35:48,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=272708.3333333333, ans=0.0 2024-06-20 21:35:51,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=272708.3333333333, ans=0.125 2024-06-20 21:36:02,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272726.6666666667, ans=0.1 2024-06-20 21:36:07,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272745.0, ans=0.1 2024-06-20 21:36:11,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272745.0, ans=0.0 2024-06-20 21:36:12,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=272745.0, ans=0.0 2024-06-20 21:36:13,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=272745.0, ans=0.0 2024-06-20 21:36:14,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=272745.0, ans=0.025 2024-06-20 21:36:14,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.41 vs. limit=22.5 2024-06-20 21:36:19,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=272763.3333333333, ans=0.2 2024-06-20 21:36:23,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272781.6666666667, ans=0.0 2024-06-20 21:36:24,198 INFO [train.py:1028] (0/2) Epoch 15, batch 7150, loss[loss=0.2677, simple_loss=0.3109, pruned_loss=0.1122, over 12531.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.278, pruned_loss=0.08816, over 2574550.99 frames. ], batch size: 202, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:36:24,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=272781.6666666667, ans=0.5 2024-06-20 21:36:36,060 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.024e+01 2024-06-20 21:36:56,617 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.969e+02 2.138e+02 2.340e+02 3.254e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-20 21:36:57,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=272855.0, ans=0.1 2024-06-20 21:37:04,542 INFO [train.py:1028] (0/2) Epoch 15, batch 7200, loss[loss=0.2335, simple_loss=0.2842, pruned_loss=0.09142, over 13194.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2796, pruned_loss=0.08874, over 2579595.51 frames. ], batch size: 112, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:37:07,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=272873.3333333333, ans=0.2 2024-06-20 21:37:34,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2024-06-20 21:37:34,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.35 vs. limit=15.0 2024-06-20 21:37:48,810 INFO [train.py:1028] (0/2) Epoch 15, batch 7250, loss[loss=0.2166, simple_loss=0.2818, pruned_loss=0.0757, over 12930.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2801, pruned_loss=0.08874, over 2581369.27 frames. ], batch size: 36, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:37:50,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.04 vs. limit=15.0 2024-06-20 21:37:56,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272983.3333333333, ans=0.1 2024-06-20 21:38:04,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=273001.6666666667, ans=0.025 2024-06-20 21:38:06,100 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:38:08,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=273001.6666666667, ans=0.025 2024-06-20 21:38:09,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=273001.6666666667, ans=0.5 2024-06-20 21:38:20,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=12.0 2024-06-20 21:38:20,693 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 2.004e+02 2.165e+02 2.430e+02 2.934e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-20 21:38:20,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273038.3333333333, ans=0.1 2024-06-20 21:38:26,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=273038.3333333333, ans=0.125 2024-06-20 21:38:28,231 INFO [train.py:1028] (0/2) Epoch 15, batch 7300, loss[loss=0.197, simple_loss=0.2603, pruned_loss=0.06686, over 12954.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.281, pruned_loss=0.08903, over 2580854.66 frames. ], batch size: 36, lr: 3.87e-03, grad_scale: 64.0 2024-06-20 21:38:33,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=273056.6666666667, ans=0.0 2024-06-20 21:38:54,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=273093.3333333333, ans=0.125 2024-06-20 21:39:03,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273111.6666666667, ans=0.0 2024-06-20 21:39:09,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=273130.0, ans=0.025 2024-06-20 21:39:09,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=15.0 2024-06-20 21:39:10,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.35 vs. limit=22.5 2024-06-20 21:39:12,288 INFO [train.py:1028] (0/2) Epoch 15, batch 7350, loss[loss=0.225, simple_loss=0.2875, pruned_loss=0.08122, over 13274.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2812, pruned_loss=0.08907, over 2581336.07 frames. ], batch size: 46, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:39:25,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=273166.6666666667, ans=0.125 2024-06-20 21:39:32,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=273185.0, ans=0.0 2024-06-20 21:39:43,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.976e+02 2.081e+02 2.234e+02 3.296e+02, threshold=4.163e+02, percent-clipped=0.0 2024-06-20 21:39:51,899 INFO [train.py:1028] (0/2) Epoch 15, batch 7400, loss[loss=0.2676, simple_loss=0.3128, pruned_loss=0.1112, over 13259.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2813, pruned_loss=0.08927, over 2586492.03 frames. ], batch size: 63, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:39:56,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=273240.0, ans=0.0 2024-06-20 21:39:57,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=273240.0, ans=0.125 2024-06-20 21:39:59,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=12.0 2024-06-20 21:40:00,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273258.3333333333, ans=0.0 2024-06-20 21:40:02,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.76 vs. limit=22.5 2024-06-20 21:40:04,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=273258.3333333333, ans=0.2 2024-06-20 21:40:15,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=273276.6666666667, ans=0.125 2024-06-20 21:40:21,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=273295.0, ans=0.07 2024-06-20 21:40:23,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=273295.0, ans=0.125 2024-06-20 21:40:32,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=273313.3333333333, ans=0.0 2024-06-20 21:40:36,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.68 vs. limit=15.0 2024-06-20 21:40:36,604 INFO [train.py:1028] (0/2) Epoch 15, batch 7450, loss[loss=0.217, simple_loss=0.2828, pruned_loss=0.07564, over 12608.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2814, pruned_loss=0.08908, over 2580583.42 frames. ], batch size: 29, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:40:42,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=273331.6666666667, ans=0.125 2024-06-20 21:40:42,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.21 vs. limit=12.0 2024-06-20 21:40:45,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.98 vs. limit=10.0 2024-06-20 21:40:46,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=273350.0, ans=0.125 2024-06-20 21:40:46,933 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:40:48,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273350.0, ans=0.0 2024-06-20 21:41:06,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=273386.6666666667, ans=0.0 2024-06-20 21:41:13,585 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.136e+02 2.454e+02 2.818e+02 4.281e+02, threshold=4.907e+02, percent-clipped=1.0 2024-06-20 21:41:20,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=273405.0, ans=0.0 2024-06-20 21:41:21,780 INFO [train.py:1028] (0/2) Epoch 15, batch 7500, loss[loss=0.23, simple_loss=0.271, pruned_loss=0.09445, over 10792.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2824, pruned_loss=0.08975, over 2577459.19 frames. ], batch size: 303, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:41:27,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273423.3333333333, ans=0.125 2024-06-20 21:41:33,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273441.6666666667, ans=0.125 2024-06-20 21:41:40,325 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:41:43,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=273460.0, ans=0.025 2024-06-20 21:41:48,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=273478.3333333333, ans=0.125 2024-06-20 21:41:52,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=273496.6666666667, ans=0.0 2024-06-20 21:41:52,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.73 vs. limit=15.0 2024-06-20 21:42:01,026 INFO [train.py:1028] (0/2) Epoch 15, batch 7550, loss[loss=0.2279, simple_loss=0.2725, pruned_loss=0.09163, over 12948.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2833, pruned_loss=0.0904, over 2577085.97 frames. ], batch size: 158, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:42:27,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=273551.6666666667, ans=0.125 2024-06-20 21:42:34,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=273570.0, ans=0.2 2024-06-20 21:42:37,476 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 1.971e+02 2.074e+02 2.300e+02 2.935e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-20 21:42:40,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=273588.3333333333, ans=0.0 2024-06-20 21:42:45,430 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2024-06-20 21:42:45,769 INFO [train.py:1028] (0/2) Epoch 15, batch 7600, loss[loss=0.2411, simple_loss=0.2926, pruned_loss=0.09485, over 13276.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2838, pruned_loss=0.09062, over 2576135.93 frames. ], batch size: 83, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:42:45,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273606.6666666667, ans=0.1 2024-06-20 21:42:54,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.70 vs. limit=22.5 2024-06-20 21:42:56,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=273625.0, ans=22.5 2024-06-20 21:43:04,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=273643.3333333333, ans=0.125 2024-06-20 21:43:04,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=15.0 2024-06-20 21:43:13,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=273661.6666666667, ans=0.2 2024-06-20 21:43:30,487 INFO [train.py:1028] (0/2) Epoch 15, batch 7650, loss[loss=0.2224, simple_loss=0.2772, pruned_loss=0.08386, over 12911.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2837, pruned_loss=0.09061, over 2571932.99 frames. ], batch size: 33, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:43:50,459 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:43:53,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273735.0, ans=0.125 2024-06-20 21:44:00,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.63 vs. limit=22.5 2024-06-20 21:44:04,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.018e+02 2.299e+02 2.611e+02 3.832e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-20 21:44:05,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273771.6666666667, ans=0.1 2024-06-20 21:44:11,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=273790.0, ans=0.1 2024-06-20 21:44:12,012 INFO [train.py:1028] (0/2) Epoch 15, batch 7700, loss[loss=0.2375, simple_loss=0.2978, pruned_loss=0.08863, over 13211.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2842, pruned_loss=0.09066, over 2568814.33 frames. ], batch size: 63, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:44:13,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=273790.0, ans=0.0 2024-06-20 21:44:15,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2024-06-20 21:44:20,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=273808.3333333333, ans=0.2 2024-06-20 21:44:25,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=273808.3333333333, ans=0.025 2024-06-20 21:44:29,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=273826.6666666667, ans=0.0 2024-06-20 21:44:35,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=273845.0, ans=0.125 2024-06-20 21:44:54,749 INFO [train.py:1028] (0/2) Epoch 15, batch 7750, loss[loss=0.2406, simple_loss=0.2996, pruned_loss=0.09081, over 13183.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.285, pruned_loss=0.09094, over 2573768.04 frames. ], batch size: 72, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:44:55,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=273881.6666666667, ans=0.125 2024-06-20 21:45:01,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=273881.6666666667, ans=0.04949747468305833 2024-06-20 21:45:21,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=273936.6666666667, ans=0.0 2024-06-20 21:45:22,228 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 21:45:26,858 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.023e+02 2.214e+02 2.393e+02 2.988e+02, threshold=4.429e+02, percent-clipped=0.0 2024-06-20 21:45:34,898 INFO [train.py:1028] (0/2) Epoch 15, batch 7800, loss[loss=0.2421, simple_loss=0.2922, pruned_loss=0.09596, over 13143.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.285, pruned_loss=0.09106, over 2578582.88 frames. ], batch size: 95, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:45:36,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2024-06-20 21:45:36,968 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2024-06-20 21:45:37,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=273973.3333333333, ans=0.125 2024-06-20 21:45:44,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=273973.3333333333, ans=0.0 2024-06-20 21:45:45,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=273973.3333333333, ans=0.125 2024-06-20 21:45:52,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=273991.6666666667, ans=0.125 2024-06-20 21:45:58,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=274010.0, ans=0.125 2024-06-20 21:45:59,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=274010.0, ans=0.0 2024-06-20 21:46:16,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274046.6666666667, ans=0.1 2024-06-20 21:46:16,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=274046.6666666667, ans=0.2 2024-06-20 21:46:20,029 INFO [train.py:1028] (0/2) Epoch 15, batch 7850, loss[loss=0.2214, simple_loss=0.2718, pruned_loss=0.08551, over 11874.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2856, pruned_loss=0.09128, over 2572809.19 frames. ], batch size: 17, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:46:28,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=274083.3333333333, ans=0.125 2024-06-20 21:46:33,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2024-06-20 21:46:41,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=274101.6666666667, ans=0.125 2024-06-20 21:46:49,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=274120.0, ans=0.05 2024-06-20 21:46:51,510 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.018e+02 2.217e+02 2.411e+02 3.643e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-20 21:47:03,323 INFO [train.py:1028] (0/2) Epoch 15, batch 7900, loss[loss=0.2319, simple_loss=0.2887, pruned_loss=0.08756, over 13197.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.286, pruned_loss=0.09166, over 2572715.13 frames. ], batch size: 77, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:47:15,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2024-06-20 21:47:18,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=274193.3333333333, ans=0.0 2024-06-20 21:47:32,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-20 21:47:37,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=274230.0, ans=0.125 2024-06-20 21:47:39,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=274230.0, ans=0.125 2024-06-20 21:47:42,858 INFO [train.py:1028] (0/2) Epoch 15, batch 7950, loss[loss=0.2643, simple_loss=0.2914, pruned_loss=0.1186, over 10687.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.287, pruned_loss=0.09221, over 2575798.98 frames. ], batch size: 304, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:47:45,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=274248.3333333333, ans=0.125 2024-06-20 21:47:45,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=274248.3333333333, ans=0.0 2024-06-20 21:48:06,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=274285.0, ans=0.125 2024-06-20 21:48:07,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=274285.0, ans=0.125 2024-06-20 21:48:07,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=274285.0, ans=0.125 2024-06-20 21:48:14,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.15 vs. limit=15.0 2024-06-20 21:48:16,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=274303.3333333333, ans=0.2 2024-06-20 21:48:19,237 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 1.992e+02 2.217e+02 2.442e+02 3.835e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-20 21:48:24,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=274321.6666666667, ans=0.125 2024-06-20 21:48:27,233 INFO [train.py:1028] (0/2) Epoch 15, batch 8000, loss[loss=0.2136, simple_loss=0.2747, pruned_loss=0.07625, over 12647.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2873, pruned_loss=0.09179, over 2572795.81 frames. ], batch size: 29, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:48:28,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=274340.0, ans=0.125 2024-06-20 21:48:32,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=274340.0, ans=0.04949747468305833 2024-06-20 21:48:41,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=274376.6666666667, ans=0.025 2024-06-20 21:48:44,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=274376.6666666667, ans=0.0 2024-06-20 21:48:45,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=274376.6666666667, ans=0.07 2024-06-20 21:49:07,651 INFO [train.py:1028] (0/2) Epoch 15, batch 8050, loss[loss=0.253, simple_loss=0.3022, pruned_loss=0.1019, over 13151.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2873, pruned_loss=0.09185, over 2572506.80 frames. ], batch size: 83, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:49:18,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=274450.0, ans=0.0 2024-06-20 21:49:36,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=274486.6666666667, ans=0.125 2024-06-20 21:49:37,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=274486.6666666667, ans=0.0 2024-06-20 21:49:42,909 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.137e+02 2.396e+02 2.645e+02 4.063e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-20 21:49:50,763 INFO [train.py:1028] (0/2) Epoch 15, batch 8100, loss[loss=0.2286, simple_loss=0.2843, pruned_loss=0.08648, over 13158.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2874, pruned_loss=0.09168, over 2576934.60 frames. ], batch size: 112, lr: 3.86e-03, grad_scale: 64.0 2024-06-20 21:50:11,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=274560.0, ans=0.1 2024-06-20 21:50:18,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274578.3333333333, ans=0.125 2024-06-20 21:50:31,415 INFO [train.py:1028] (0/2) Epoch 15, batch 8150, loss[loss=0.2143, simple_loss=0.2658, pruned_loss=0.08141, over 13088.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2878, pruned_loss=0.09162, over 2581058.08 frames. ], batch size: 121, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:50:41,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=274615.0, ans=0.2 2024-06-20 21:50:44,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.79 vs. limit=22.5 2024-06-20 21:50:46,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=274633.3333333333, ans=0.125 2024-06-20 21:50:49,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=274633.3333333333, ans=0.125 2024-06-20 21:51:07,145 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.042e+02 2.184e+02 2.355e+02 3.075e+02, threshold=4.368e+02, percent-clipped=0.0 2024-06-20 21:51:09,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274688.3333333333, ans=0.1 2024-06-20 21:51:13,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274688.3333333333, ans=0.1 2024-06-20 21:51:15,225 INFO [train.py:1028] (0/2) Epoch 15, batch 8200, loss[loss=0.2429, simple_loss=0.2905, pruned_loss=0.09766, over 13154.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2878, pruned_loss=0.09157, over 2583918.31 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:51:15,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274706.6666666667, ans=0.1 2024-06-20 21:51:17,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=274706.6666666667, ans=0.0 2024-06-20 21:51:23,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274725.0, ans=0.125 2024-06-20 21:51:34,704 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=8.860e-01 2024-06-20 21:51:56,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=274780.0, ans=0.0 2024-06-20 21:51:57,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=274780.0, ans=0.125 2024-06-20 21:51:59,881 INFO [train.py:1028] (0/2) Epoch 15, batch 8250, loss[loss=0.2435, simple_loss=0.2989, pruned_loss=0.09409, over 13280.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2887, pruned_loss=0.09186, over 2583866.93 frames. ], batch size: 52, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:52:14,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=274835.0, ans=0.0 2024-06-20 21:52:14,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=274835.0, ans=0.0 2024-06-20 21:52:15,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.20 vs. limit=22.5 2024-06-20 21:52:29,826 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.012e+02 2.138e+02 2.291e+02 3.037e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-20 21:52:37,809 INFO [train.py:1028] (0/2) Epoch 15, batch 8300, loss[loss=0.2347, simple_loss=0.2818, pruned_loss=0.09378, over 13044.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2883, pruned_loss=0.09191, over 2580530.87 frames. ], batch size: 102, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:52:39,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.24 vs. limit=15.0 2024-06-20 21:52:43,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=274890.0, ans=0.09899494936611666 2024-06-20 21:52:54,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=274926.6666666667, ans=0.125 2024-06-20 21:52:55,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=274926.6666666667, ans=0.025 2024-06-20 21:52:56,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=274926.6666666667, ans=0.0 2024-06-20 21:53:07,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=274945.0, ans=0.2 2024-06-20 21:53:12,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=274945.0, ans=0.125 2024-06-20 21:53:21,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274981.6666666667, ans=0.125 2024-06-20 21:53:22,069 INFO [train.py:1028] (0/2) Epoch 15, batch 8350, loss[loss=0.2158, simple_loss=0.2718, pruned_loss=0.07991, over 13186.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2885, pruned_loss=0.09153, over 2580487.99 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:53:29,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=275000.0, ans=0.125 2024-06-20 21:53:33,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=275000.0, ans=0.125 2024-06-20 21:53:44,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.27 vs. limit=6.0 2024-06-20 21:53:47,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=275036.6666666667, ans=0.125 2024-06-20 21:53:52,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=275036.6666666667, ans=0.125 2024-06-20 21:53:54,241 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.031e+02 2.182e+02 2.380e+02 2.962e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-20 21:53:55,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=275055.0, ans=0.0 2024-06-20 21:54:01,988 INFO [train.py:1028] (0/2) Epoch 15, batch 8400, loss[loss=0.2331, simple_loss=0.2879, pruned_loss=0.0891, over 12905.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.2883, pruned_loss=0.09141, over 2577406.82 frames. ], batch size: 39, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:54:13,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=12.0 2024-06-20 21:54:26,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=275110.0, ans=0.0 2024-06-20 21:54:38,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=15.0 2024-06-20 21:54:45,636 INFO [train.py:1028] (0/2) Epoch 15, batch 8450, loss[loss=0.2469, simple_loss=0.293, pruned_loss=0.1004, over 13190.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.289, pruned_loss=0.09161, over 2579194.45 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:54:53,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275183.3333333333, ans=0.1 2024-06-20 21:55:06,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=275201.6666666667, ans=0.0 2024-06-20 21:55:13,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=275220.0, ans=0.125 2024-06-20 21:55:17,671 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.059e+02 2.169e+02 2.348e+02 3.167e+02, threshold=4.338e+02, percent-clipped=0.0 2024-06-20 21:55:28,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=275238.3333333333, ans=0.0 2024-06-20 21:55:29,534 INFO [train.py:1028] (0/2) Epoch 15, batch 8500, loss[loss=0.2204, simple_loss=0.2756, pruned_loss=0.0826, over 12787.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.2896, pruned_loss=0.09175, over 2579119.20 frames. ], batch size: 29, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:55:30,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2024-06-20 21:55:41,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=275275.0, ans=0.125 2024-06-20 21:55:43,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=275275.0, ans=0.125 2024-06-20 21:55:45,974 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2024-06-20 21:55:51,340 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2024-06-20 21:55:53,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=275311.6666666667, ans=0.125 2024-06-20 21:55:55,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275311.6666666667, ans=0.1 2024-06-20 21:56:03,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=275330.0, ans=0.0 2024-06-20 21:56:04,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=275330.0, ans=0.0 2024-06-20 21:56:09,968 INFO [train.py:1028] (0/2) Epoch 15, batch 8550, loss[loss=0.2576, simple_loss=0.2994, pruned_loss=0.1079, over 12609.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.2893, pruned_loss=0.09169, over 2576443.19 frames. ], batch size: 22, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:56:26,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=15.0 2024-06-20 21:56:31,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.87 vs. limit=15.0 2024-06-20 21:56:42,280 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.005e+02 2.149e+02 2.347e+02 2.864e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-20 21:56:49,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275421.6666666667, ans=0.125 2024-06-20 21:56:53,996 INFO [train.py:1028] (0/2) Epoch 15, batch 8600, loss[loss=0.2249, simple_loss=0.2797, pruned_loss=0.08501, over 13136.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.2898, pruned_loss=0.09194, over 2573043.03 frames. ], batch size: 112, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:56:54,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=275440.0, ans=0.1 2024-06-20 21:56:58,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.67 vs. limit=12.0 2024-06-20 21:57:01,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.71 vs. limit=15.0 2024-06-20 21:57:06,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=275458.3333333333, ans=0.125 2024-06-20 21:57:10,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=275476.6666666667, ans=0.2 2024-06-20 21:57:31,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=275513.3333333333, ans=0.0 2024-06-20 21:57:34,771 INFO [train.py:1028] (0/2) Epoch 15, batch 8650, loss[loss=0.2319, simple_loss=0.2872, pruned_loss=0.08828, over 13005.00 frames. ], tot_loss[loss=0.237, simple_loss=0.2899, pruned_loss=0.092, over 2576422.48 frames. ], batch size: 102, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:57:37,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275531.6666666667, ans=0.1 2024-06-20 21:58:09,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=275586.6666666667, ans=10.0 2024-06-20 21:58:09,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=275605.0, ans=0.2 2024-06-20 21:58:10,340 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.011e+02 2.128e+02 2.348e+02 2.972e+02, threshold=4.256e+02, percent-clipped=0.0 2024-06-20 21:58:12,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=275605.0, ans=0.0 2024-06-20 21:58:15,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=275605.0, ans=0.125 2024-06-20 21:58:18,205 INFO [train.py:1028] (0/2) Epoch 15, batch 8700, loss[loss=0.2243, simple_loss=0.2807, pruned_loss=0.08399, over 13183.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.2904, pruned_loss=0.09257, over 2572322.35 frames. ], batch size: 59, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:58:20,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=275623.3333333333, ans=0.125 2024-06-20 21:58:24,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=275623.3333333333, ans=0.07 2024-06-20 21:58:30,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.79 vs. limit=15.0 2024-06-20 21:58:43,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.04 vs. limit=22.5 2024-06-20 21:58:51,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=275696.6666666667, ans=0.0 2024-06-20 21:58:58,424 INFO [train.py:1028] (0/2) Epoch 15, batch 8750, loss[loss=0.2332, simple_loss=0.2833, pruned_loss=0.09149, over 13084.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2895, pruned_loss=0.0921, over 2566814.02 frames. ], batch size: 121, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:59:10,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2024-06-20 21:59:11,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=275733.3333333333, ans=0.0 2024-06-20 21:59:34,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=275788.3333333333, ans=0.1 2024-06-20 21:59:35,493 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.974e+02 2.145e+02 2.289e+02 3.457e+02, threshold=4.290e+02, percent-clipped=0.0 2024-06-20 21:59:42,780 INFO [train.py:1028] (0/2) Epoch 15, batch 8800, loss[loss=0.2284, simple_loss=0.2849, pruned_loss=0.08594, over 13261.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.29, pruned_loss=0.09216, over 2571528.27 frames. ], batch size: 72, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 21:59:42,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=275806.6666666667, ans=0.125 2024-06-20 22:00:09,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=275843.3333333333, ans=0.125 2024-06-20 22:00:19,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=275880.0, ans=0.0 2024-06-20 22:00:27,941 INFO [train.py:1028] (0/2) Epoch 15, batch 8850, loss[loss=0.2821, simple_loss=0.32, pruned_loss=0.1221, over 12556.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.2903, pruned_loss=0.09251, over 2560108.38 frames. ], batch size: 202, lr: 3.85e-03, grad_scale: 64.0 2024-06-20 22:00:28,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275898.3333333333, ans=0.1 2024-06-20 22:00:34,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=275898.3333333333, ans=0.2 2024-06-20 22:00:41,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=275916.6666666667, ans=0.125 2024-06-20 22:00:46,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275935.0, ans=0.125 2024-06-20 22:00:55,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.50 vs. limit=10.0 2024-06-20 22:01:01,004 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.060e+02 2.235e+02 2.392e+02 3.421e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-20 22:01:08,509 INFO [train.py:1028] (0/2) Epoch 15, batch 8900, loss[loss=0.2211, simple_loss=0.28, pruned_loss=0.08115, over 12926.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.2907, pruned_loss=0.09281, over 2559143.07 frames. ], batch size: 33, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:01:19,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=276008.3333333333, ans=0.0 2024-06-20 22:01:24,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=276026.6666666667, ans=0.0 2024-06-20 22:01:42,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=276045.0, ans=0.0 2024-06-20 22:01:50,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-20 22:01:52,499 INFO [train.py:1028] (0/2) Epoch 15, batch 8950, loss[loss=0.2538, simple_loss=0.3049, pruned_loss=0.1013, over 12559.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.2914, pruned_loss=0.09272, over 2559358.06 frames. ], batch size: 202, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:02:09,237 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.356e+00 2024-06-20 22:02:11,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=276118.3333333333, ans=0.0 2024-06-20 22:02:14,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=276118.3333333333, ans=0.125 2024-06-20 22:02:17,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=276136.6666666667, ans=0.0 2024-06-20 22:02:20,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=276136.6666666667, ans=0.125 2024-06-20 22:02:25,919 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 2.033e+02 2.229e+02 2.538e+02 3.662e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-20 22:02:33,188 INFO [train.py:1028] (0/2) Epoch 15, batch 9000, loss[loss=0.2286, simple_loss=0.288, pruned_loss=0.08456, over 13314.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.291, pruned_loss=0.09208, over 2566456.73 frames. ], batch size: 46, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:02:33,189 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 22:02:46,187 INFO [train.py:1060] (0/2) Epoch 15, validation: loss=0.1894, simple_loss=0.2539, pruned_loss=0.06241, over 351949.00 frames. 2024-06-20 22:02:46,188 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 22:03:05,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=276210.0, ans=0.125 2024-06-20 22:03:08,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=276210.0, ans=0.0 2024-06-20 22:03:10,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276228.3333333333, ans=0.1 2024-06-20 22:03:16,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=276228.3333333333, ans=0.025 2024-06-20 22:03:17,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2024-06-20 22:03:21,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=276246.6666666667, ans=0.1 2024-06-20 22:03:25,681 INFO [train.py:1028] (0/2) Epoch 15, batch 9050, loss[loss=0.238, simple_loss=0.2859, pruned_loss=0.09508, over 11786.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2924, pruned_loss=0.09283, over 2566542.64 frames. ], batch size: 17, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:03:40,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=276301.6666666667, ans=0.125 2024-06-20 22:03:45,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=276301.6666666667, ans=0.2 2024-06-20 22:03:58,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.067e+02 2.184e+02 2.432e+02 3.135e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-20 22:04:04,934 INFO [train.py:1028] (0/2) Epoch 15, batch 9100, loss[loss=0.2207, simple_loss=0.2836, pruned_loss=0.0789, over 13263.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.2918, pruned_loss=0.09256, over 2567583.96 frames. ], batch size: 72, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:04:09,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=276356.6666666667, ans=0.2 2024-06-20 22:04:17,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.08 vs. limit=10.0 2024-06-20 22:04:18,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=22.5 2024-06-20 22:04:29,751 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:04:43,078 INFO [train.py:1028] (0/2) Epoch 15, batch 9150, loss[loss=0.2302, simple_loss=0.2817, pruned_loss=0.08931, over 13157.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2925, pruned_loss=0.093, over 2568985.45 frames. ], batch size: 77, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:04:46,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=276448.3333333333, ans=0.125 2024-06-20 22:04:59,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=276466.6666666667, ans=0.125 2024-06-20 22:05:12,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=276503.3333333333, ans=0.2 2024-06-20 22:05:18,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=276521.6666666667, ans=0.05 2024-06-20 22:05:20,070 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.057e+02 2.212e+02 2.470e+02 3.489e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-20 22:05:20,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=276521.6666666667, ans=0.125 2024-06-20 22:05:22,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=276521.6666666667, ans=0.125 2024-06-20 22:05:25,986 INFO [train.py:1028] (0/2) Epoch 15, batch 9200, loss[loss=0.2344, simple_loss=0.2912, pruned_loss=0.08877, over 12926.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.2923, pruned_loss=0.09269, over 2571843.45 frames. ], batch size: 36, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:05:30,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.96 vs. limit=22.5 2024-06-20 22:05:32,977 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:05:35,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=276558.3333333333, ans=0.07 2024-06-20 22:05:48,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2024-06-20 22:05:58,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2024-06-20 22:06:02,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=276631.6666666667, ans=0.2 2024-06-20 22:06:03,611 INFO [train.py:1028] (0/2) Epoch 15, batch 9250, loss[loss=0.2279, simple_loss=0.2845, pruned_loss=0.08561, over 13171.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.292, pruned_loss=0.09241, over 2574161.54 frames. ], batch size: 67, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:06:05,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.69 vs. limit=22.5 2024-06-20 22:06:09,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=276631.6666666667, ans=0.2 2024-06-20 22:06:09,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276631.6666666667, ans=0.1 2024-06-20 22:06:10,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=276631.6666666667, ans=0.2 2024-06-20 22:06:14,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2024-06-20 22:06:18,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=276668.3333333333, ans=0.125 2024-06-20 22:06:30,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=276686.6666666667, ans=0.125 2024-06-20 22:06:30,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=276686.6666666667, ans=0.125 2024-06-20 22:06:33,675 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.992e+02 2.150e+02 2.291e+02 3.421e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-20 22:06:38,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=276705.0, ans=0.0 2024-06-20 22:06:39,332 INFO [train.py:1028] (0/2) Epoch 15, batch 9300, loss[loss=0.2288, simple_loss=0.2871, pruned_loss=0.08522, over 12972.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.2921, pruned_loss=0.0923, over 2571503.00 frames. ], batch size: 39, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:06:43,966 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:06:51,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=276741.6666666667, ans=0.125 2024-06-20 22:06:58,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=276760.0, ans=0.0 2024-06-20 22:07:12,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.77 vs. limit=12.0 2024-06-20 22:07:19,730 INFO [train.py:1028] (0/2) Epoch 15, batch 9350, loss[loss=0.2501, simple_loss=0.302, pruned_loss=0.09909, over 12542.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2925, pruned_loss=0.09278, over 2569118.57 frames. ], batch size: 22, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:07:25,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=276815.0, ans=0.125 2024-06-20 22:07:25,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=276815.0, ans=6.0 2024-06-20 22:07:27,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=276833.3333333333, ans=0.125 2024-06-20 22:07:30,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276833.3333333333, ans=0.125 2024-06-20 22:07:30,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=276833.3333333333, ans=0.125 2024-06-20 22:07:32,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=276833.3333333333, ans=0.2 2024-06-20 22:07:45,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=276870.0, ans=0.0 2024-06-20 22:07:48,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=276888.3333333333, ans=0.0 2024-06-20 22:07:50,218 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.059e+02 2.200e+02 2.428e+02 5.384e+02, threshold=4.400e+02, percent-clipped=1.0 2024-06-20 22:07:56,868 INFO [train.py:1028] (0/2) Epoch 15, batch 9400, loss[loss=0.2384, simple_loss=0.2936, pruned_loss=0.09155, over 13203.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2919, pruned_loss=0.09263, over 2568031.41 frames. ], batch size: 52, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:08:01,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.74 vs. limit=15.0 2024-06-20 22:08:06,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2024-06-20 22:08:12,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=276943.3333333333, ans=0.0 2024-06-20 22:08:33,793 INFO [train.py:1028] (0/2) Epoch 15, batch 9450, loss[loss=0.2275, simple_loss=0.2839, pruned_loss=0.0855, over 12560.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.2923, pruned_loss=0.09347, over 2568045.37 frames. ], batch size: 22, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:08:42,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=277016.6666666667, ans=0.025 2024-06-20 22:08:52,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2024-06-20 22:08:52,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=15.0 2024-06-20 22:08:52,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2024-06-20 22:09:06,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.041e+02 2.152e+02 2.361e+02 3.165e+02, threshold=4.304e+02, percent-clipped=0.0 2024-06-20 22:09:13,011 INFO [train.py:1028] (0/2) Epoch 15, batch 9500, loss[loss=0.2331, simple_loss=0.2904, pruned_loss=0.08787, over 13228.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2916, pruned_loss=0.0928, over 2576883.57 frames. ], batch size: 43, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:09:18,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=277090.0, ans=0.0 2024-06-20 22:09:18,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=277090.0, ans=0.0 2024-06-20 22:09:21,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=277108.3333333333, ans=0.125 2024-06-20 22:09:29,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=277126.6666666667, ans=0.125 2024-06-20 22:09:45,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=277163.3333333333, ans=0.125 2024-06-20 22:09:49,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=277163.3333333333, ans=0.125 2024-06-20 22:09:50,617 INFO [train.py:1028] (0/2) Epoch 15, batch 9550, loss[loss=0.2101, simple_loss=0.2673, pruned_loss=0.07647, over 12880.00 frames. ], tot_loss[loss=0.239, simple_loss=0.2918, pruned_loss=0.09308, over 2571082.12 frames. ], batch size: 39, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:09:55,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=277181.6666666667, ans=0.0 2024-06-20 22:10:08,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=277218.3333333333, ans=0.0 2024-06-20 22:10:08,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.87 vs. limit=15.0 2024-06-20 22:10:09,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=277218.3333333333, ans=0.0 2024-06-20 22:10:11,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277236.6666666667, ans=0.1 2024-06-20 22:10:19,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=277255.0, ans=0.025 2024-06-20 22:10:20,230 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.040e+02 2.170e+02 2.424e+02 3.926e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-20 22:10:20,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=277255.0, ans=0.0 2024-06-20 22:10:26,619 INFO [train.py:1028] (0/2) Epoch 15, batch 9600, loss[loss=0.2546, simple_loss=0.3012, pruned_loss=0.104, over 10541.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.2912, pruned_loss=0.09252, over 2570697.89 frames. ], batch size: 304, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:10:31,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=277273.3333333333, ans=0.125 2024-06-20 22:10:41,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277291.6666666667, ans=0.1 2024-06-20 22:10:50,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.37 vs. limit=10.0 2024-06-20 22:10:55,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.90 vs. limit=15.0 2024-06-20 22:11:03,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=277346.6666666667, ans=0.2 2024-06-20 22:11:04,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277346.6666666667, ans=0.125 2024-06-20 22:11:06,130 INFO [train.py:1028] (0/2) Epoch 15, batch 9650, loss[loss=0.2298, simple_loss=0.2829, pruned_loss=0.08834, over 13093.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.292, pruned_loss=0.09354, over 2560852.60 frames. ], batch size: 132, lr: 3.84e-03, grad_scale: 64.0 2024-06-20 22:11:17,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=277383.3333333333, ans=0.2 2024-06-20 22:11:28,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=277420.0, ans=0.0 2024-06-20 22:11:29,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=277420.0, ans=0.125 2024-06-20 22:11:34,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=277438.3333333333, ans=0.125 2024-06-20 22:11:36,278 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.040e+02 2.243e+02 2.511e+02 3.188e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-20 22:11:36,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=277438.3333333333, ans=0.1 2024-06-20 22:11:45,013 INFO [train.py:1028] (0/2) Epoch 15, batch 9700, loss[loss=0.2578, simple_loss=0.2985, pruned_loss=0.1085, over 13059.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.2913, pruned_loss=0.09308, over 2556121.42 frames. ], batch size: 144, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:11:49,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=277456.6666666667, ans=0.0 2024-06-20 22:12:03,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=277493.3333333333, ans=0.05 2024-06-20 22:12:21,563 INFO [train.py:1028] (0/2) Epoch 15, batch 9750, loss[loss=0.2198, simple_loss=0.273, pruned_loss=0.08332, over 13153.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.29, pruned_loss=0.0921, over 2553544.47 frames. ], batch size: 132, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:12:27,825 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:12:39,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=277585.0, ans=22.5 2024-06-20 22:12:39,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=277585.0, ans=0.125 2024-06-20 22:12:41,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=277585.0, ans=0.1 2024-06-20 22:12:43,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=277603.3333333333, ans=0.125 2024-06-20 22:12:51,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.042e+02 2.175e+02 2.429e+02 3.761e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-20 22:12:58,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.65 vs. limit=22.5 2024-06-20 22:12:59,796 INFO [train.py:1028] (0/2) Epoch 15, batch 9800, loss[loss=0.2334, simple_loss=0.2884, pruned_loss=0.08919, over 12983.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.2897, pruned_loss=0.09185, over 2546986.28 frames. ], batch size: 39, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:13:06,239 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:13:17,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=277676.6666666667, ans=0.125 2024-06-20 22:13:22,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2024-06-20 22:13:24,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=277695.0, ans=0.1 2024-06-20 22:13:35,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=277731.6666666667, ans=0.1 2024-06-20 22:13:36,456 INFO [train.py:1028] (0/2) Epoch 15, batch 9850, loss[loss=0.2191, simple_loss=0.2687, pruned_loss=0.08471, over 13163.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2885, pruned_loss=0.09121, over 2538248.95 frames. ], batch size: 103, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:13:44,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=277750.0, ans=0.125 2024-06-20 22:13:56,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277768.3333333333, ans=0.1 2024-06-20 22:13:56,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=277768.3333333333, ans=0.125 2024-06-20 22:14:01,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=277786.6666666667, ans=0.125 2024-06-20 22:14:06,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=277805.0, ans=0.2 2024-06-20 22:14:07,228 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.038e+02 2.147e+02 2.275e+02 2.687e+02, threshold=4.295e+02, percent-clipped=0.0 2024-06-20 22:14:10,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=277805.0, ans=0.015 2024-06-20 22:14:13,738 INFO [train.py:1028] (0/2) Epoch 15, batch 9900, loss[loss=0.255, simple_loss=0.308, pruned_loss=0.1009, over 12930.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2889, pruned_loss=0.09178, over 2531644.25 frames. ], batch size: 39, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:14:14,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277823.3333333333, ans=0.125 2024-06-20 22:14:16,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=277823.3333333333, ans=0.125 2024-06-20 22:14:21,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=277841.6666666667, ans=0.2 2024-06-20 22:14:26,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2024-06-20 22:14:34,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=277860.0, ans=0.035 2024-06-20 22:14:35,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277860.0, ans=0.125 2024-06-20 22:14:38,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=277878.3333333333, ans=0.2 2024-06-20 22:14:42,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=277878.3333333333, ans=0.05 2024-06-20 22:14:43,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=277896.6666666667, ans=0.95 2024-06-20 22:14:51,302 INFO [train.py:1028] (0/2) Epoch 15, batch 9950, loss[loss=0.2498, simple_loss=0.2972, pruned_loss=0.1011, over 12632.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2875, pruned_loss=0.09144, over 2526849.99 frames. ], batch size: 29, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:14:53,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2024-06-20 22:14:57,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=12.0 2024-06-20 22:15:06,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2024-06-20 22:15:08,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277951.6666666667, ans=0.1 2024-06-20 22:15:19,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=277970.0, ans=0.025 2024-06-20 22:15:22,678 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.122e+02 2.241e+02 2.471e+02 3.287e+02, threshold=4.483e+02, percent-clipped=0.0 2024-06-20 22:15:26,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.63 vs. limit=10.0 2024-06-20 22:15:26,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=277988.3333333333, ans=0.125 2024-06-20 22:15:29,387 INFO [train.py:1028] (0/2) Epoch 15, batch 10000, loss[loss=0.2454, simple_loss=0.3041, pruned_loss=0.09335, over 12443.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2888, pruned_loss=0.09247, over 2488594.36 frames. ], batch size: 22, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:15:50,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278043.3333333333, ans=0.1 2024-06-20 22:15:54,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=278061.6666666667, ans=0.125 2024-06-20 22:15:56,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=278061.6666666667, ans=0.2 2024-06-20 22:16:06,609 INFO [train.py:1028] (0/2) Epoch 15, batch 10050, loss[loss=0.2401, simple_loss=0.2979, pruned_loss=0.09115, over 12875.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.289, pruned_loss=0.09338, over 2445824.49 frames. ], batch size: 22, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:16:11,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.90 vs. limit=6.0 2024-06-20 22:16:17,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.42 vs. limit=22.5 2024-06-20 22:16:24,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=278135.0, ans=0.0 2024-06-20 22:16:25,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278135.0, ans=0.1 2024-06-20 22:16:35,189 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:16:35,833 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 2.061e+02 2.240e+02 2.510e+02 4.001e+02, threshold=4.479e+02, percent-clipped=0.0 2024-06-20 22:16:42,727 INFO [train.py:1028] (0/2) Epoch 15, batch 10100, loss[loss=0.2491, simple_loss=0.3008, pruned_loss=0.0987, over 10982.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.2889, pruned_loss=0.09324, over 2424867.21 frames. ], batch size: 16, lr: 3.83e-03, grad_scale: 64.0 2024-06-20 22:16:53,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278208.3333333333, ans=0.1 2024-06-20 22:16:53,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.35 vs. limit=6.0 2024-06-20 22:16:54,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=278208.3333333333, ans=0.0 2024-06-20 22:16:58,916 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-15.pt 2024-06-20 22:19:19,177 INFO [train.py:1028] (0/2) Epoch 16, batch 0, loss[loss=0.1899, simple_loss=0.2483, pruned_loss=0.06578, over 12959.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2483, pruned_loss=0.06578, over 12959.00 frames. ], batch size: 36, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:19:19,178 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 22:19:27,085 INFO [train.py:1060] (0/2) Epoch 16, validation: loss=0.1901, simple_loss=0.255, pruned_loss=0.06255, over 351949.00 frames. 2024-06-20 22:19:27,086 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 22:19:27,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=278221.1666666667, ans=0.125 2024-06-20 22:19:29,103 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:19:35,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278221.1666666667, ans=0.125 2024-06-20 22:19:37,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=278221.1666666667, ans=0.2 2024-06-20 22:19:39,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=278239.5, ans=0.0 2024-06-20 22:19:42,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2024-06-20 22:19:47,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-20 22:20:10,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=278294.5, ans=0.125 2024-06-20 22:20:11,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278294.5, ans=0.1 2024-06-20 22:20:14,517 INFO [train.py:1028] (0/2) Epoch 16, batch 50, loss[loss=0.2087, simple_loss=0.2673, pruned_loss=0.07501, over 12591.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2691, pruned_loss=0.08429, over 574679.23 frames. ], batch size: 29, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:20:14,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=278312.8333333333, ans=0.1 2024-06-20 22:20:32,614 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.619e+02 1.944e+02 2.090e+02 2.257e+02 3.301e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-20 22:20:34,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=278349.5, ans=0.0 2024-06-20 22:20:44,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278386.1666666667, ans=0.125 2024-06-20 22:20:44,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=278386.1666666667, ans=0.0 2024-06-20 22:20:52,099 INFO [train.py:1028] (0/2) Epoch 16, batch 100, loss[loss=0.1984, simple_loss=0.2554, pruned_loss=0.07067, over 13236.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2678, pruned_loss=0.08275, over 1017901.84 frames. ], batch size: 46, lr: 3.71e-03, grad_scale: 64.0 2024-06-20 22:21:04,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=278422.8333333333, ans=0.125 2024-06-20 22:21:05,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=278422.8333333333, ans=0.0 2024-06-20 22:21:13,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=278441.1666666667, ans=0.0 2024-06-20 22:21:16,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2024-06-20 22:21:27,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278477.8333333333, ans=0.1 2024-06-20 22:21:29,967 INFO [train.py:1028] (0/2) Epoch 16, batch 150, loss[loss=0.2131, simple_loss=0.2814, pruned_loss=0.07244, over 12720.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2691, pruned_loss=0.08321, over 1365101.53 frames. ], batch size: 29, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:21:41,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2024-06-20 22:21:42,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.63 vs. limit=15.0 2024-06-20 22:21:48,555 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.892e+02 2.012e+02 2.178e+02 2.587e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-20 22:21:49,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=278532.8333333333, ans=0.2 2024-06-20 22:22:11,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=278569.5, ans=0.0 2024-06-20 22:22:15,715 INFO [train.py:1028] (0/2) Epoch 16, batch 200, loss[loss=0.2451, simple_loss=0.2925, pruned_loss=0.09886, over 12556.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2691, pruned_loss=0.08348, over 1634706.62 frames. ], batch size: 202, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:22:16,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=15.0 2024-06-20 22:22:31,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2024-06-20 22:22:38,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2024-06-20 22:22:46,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=278661.1666666667, ans=0.125 2024-06-20 22:22:48,458 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-152000.pt 2024-06-20 22:22:57,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=278661.1666666667, ans=0.125 2024-06-20 22:22:59,938 INFO [train.py:1028] (0/2) Epoch 16, batch 250, loss[loss=0.2142, simple_loss=0.2576, pruned_loss=0.0854, over 13047.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2685, pruned_loss=0.08333, over 1846354.88 frames. ], batch size: 144, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:23:04,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=278679.5, ans=0.2 2024-06-20 22:23:04,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2024-06-20 22:23:07,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=278697.8333333333, ans=0.125 2024-06-20 22:23:10,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=278697.8333333333, ans=0.125 2024-06-20 22:23:17,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=278716.1666666667, ans=0.1 2024-06-20 22:23:18,967 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.911e+02 2.040e+02 2.212e+02 2.942e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-20 22:23:23,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=278734.5, ans=0.0 2024-06-20 22:23:23,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=278734.5, ans=0.125 2024-06-20 22:23:29,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=278734.5, ans=0.2 2024-06-20 22:23:33,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=278752.8333333333, ans=0.125 2024-06-20 22:23:35,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=278752.8333333333, ans=0.125 2024-06-20 22:23:37,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-20 22:23:37,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=278752.8333333333, ans=0.125 2024-06-20 22:23:38,964 INFO [train.py:1028] (0/2) Epoch 16, batch 300, loss[loss=0.2333, simple_loss=0.2797, pruned_loss=0.09343, over 13145.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2687, pruned_loss=0.08364, over 2009845.74 frames. ], batch size: 112, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:23:41,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.48 vs. limit=15.0 2024-06-20 22:24:02,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2024-06-20 22:24:17,095 INFO [train.py:1028] (0/2) Epoch 16, batch 350, loss[loss=0.2036, simple_loss=0.2617, pruned_loss=0.07268, over 12881.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2686, pruned_loss=0.08351, over 2138476.66 frames. ], batch size: 33, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:24:17,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=278862.8333333333, ans=0.0 2024-06-20 22:24:42,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.915e+02 2.036e+02 2.237e+02 2.649e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-20 22:25:02,374 INFO [train.py:1028] (0/2) Epoch 16, batch 400, loss[loss=0.2111, simple_loss=0.2655, pruned_loss=0.07833, over 13256.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2684, pruned_loss=0.08299, over 2237748.78 frames. ], batch size: 63, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:25:02,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=278954.5, ans=0.025 2024-06-20 22:25:19,487 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:25:21,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=278991.1666666667, ans=0.125 2024-06-20 22:25:26,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=279009.5, ans=0.125 2024-06-20 22:25:34,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=279027.8333333333, ans=0.125 2024-06-20 22:25:34,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=279027.8333333333, ans=0.125 2024-06-20 22:25:37,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279027.8333333333, ans=0.1 2024-06-20 22:25:40,976 INFO [train.py:1028] (0/2) Epoch 16, batch 450, loss[loss=0.2295, simple_loss=0.2827, pruned_loss=0.08811, over 13198.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2684, pruned_loss=0.08314, over 2312783.63 frames. ], batch size: 67, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:25:59,654 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.899e+02 2.010e+02 2.162e+02 3.318e+02, threshold=4.019e+02, percent-clipped=0.0 2024-06-20 22:26:01,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=279082.8333333333, ans=0.125 2024-06-20 22:26:09,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=279101.1666666667, ans=0.125 2024-06-20 22:26:20,060 INFO [train.py:1028] (0/2) Epoch 16, batch 500, loss[loss=0.2309, simple_loss=0.273, pruned_loss=0.09444, over 13104.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2684, pruned_loss=0.08295, over 2375367.42 frames. ], batch size: 121, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:26:34,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=279156.1666666667, ans=0.1 2024-06-20 22:26:52,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=279211.1666666667, ans=0.125 2024-06-20 22:27:02,116 INFO [train.py:1028] (0/2) Epoch 16, batch 550, loss[loss=0.2089, simple_loss=0.2581, pruned_loss=0.07983, over 12960.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2672, pruned_loss=0.08255, over 2421324.45 frames. ], batch size: 158, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:27:04,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=12.0 2024-06-20 22:27:18,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=279247.8333333333, ans=0.1 2024-06-20 22:27:23,655 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.945e+02 2.048e+02 2.295e+02 3.139e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-20 22:27:23,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=279266.1666666667, ans=0.125 2024-06-20 22:27:30,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=279284.5, ans=0.0 2024-06-20 22:27:37,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=279302.8333333333, ans=0.035 2024-06-20 22:27:39,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=279302.8333333333, ans=0.125 2024-06-20 22:27:42,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=279321.1666666667, ans=0.0 2024-06-20 22:27:42,997 INFO [train.py:1028] (0/2) Epoch 16, batch 600, loss[loss=0.1998, simple_loss=0.2418, pruned_loss=0.07889, over 13017.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2669, pruned_loss=0.08224, over 2458798.19 frames. ], batch size: 144, lr: 3.70e-03, grad_scale: 64.0 2024-06-20 22:27:43,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2024-06-20 22:27:46,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=279321.1666666667, ans=0.0 2024-06-20 22:27:49,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=279321.1666666667, ans=0.0 2024-06-20 22:27:54,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.97 vs. limit=22.5 2024-06-20 22:27:59,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=279357.8333333333, ans=0.09899494936611666 2024-06-20 22:28:14,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279394.5, ans=0.1 2024-06-20 22:28:19,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.54 vs. limit=6.0 2024-06-20 22:28:20,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=279394.5, ans=0.1 2024-06-20 22:28:21,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=279412.8333333333, ans=0.125 2024-06-20 22:28:22,278 INFO [train.py:1028] (0/2) Epoch 16, batch 650, loss[loss=0.2147, simple_loss=0.2754, pruned_loss=0.07704, over 13180.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2667, pruned_loss=0.08155, over 2489782.09 frames. ], batch size: 59, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:28:29,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=279412.8333333333, ans=0.2 2024-06-20 22:28:36,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=279431.1666666667, ans=0.125 2024-06-20 22:28:41,348 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.922e+02 2.044e+02 2.209e+02 3.211e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-20 22:28:47,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279467.8333333333, ans=0.1 2024-06-20 22:28:48,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=279467.8333333333, ans=0.0 2024-06-20 22:28:52,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.62 vs. limit=22.5 2024-06-20 22:29:00,885 INFO [train.py:1028] (0/2) Epoch 16, batch 700, loss[loss=0.2058, simple_loss=0.2651, pruned_loss=0.07325, over 13269.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.266, pruned_loss=0.08132, over 2512900.73 frames. ], batch size: 46, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:29:01,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=279504.5, ans=0.0 2024-06-20 22:29:04,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=279504.5, ans=0.125 2024-06-20 22:29:04,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=279504.5, ans=0.025 2024-06-20 22:29:19,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2024-06-20 22:29:33,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=279559.5, ans=0.125 2024-06-20 22:29:41,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=279577.8333333333, ans=0.125 2024-06-20 22:29:42,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.29 vs. limit=10.0 2024-06-20 22:29:46,208 INFO [train.py:1028] (0/2) Epoch 16, batch 750, loss[loss=0.2137, simple_loss=0.2676, pruned_loss=0.07991, over 13215.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2663, pruned_loss=0.08144, over 2527799.66 frames. ], batch size: 63, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:29:52,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=279596.1666666667, ans=0.02 2024-06-20 22:30:01,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279632.8333333333, ans=0.1 2024-06-20 22:30:04,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.923e+02 2.020e+02 2.162e+02 2.795e+02, threshold=4.041e+02, percent-clipped=0.0 2024-06-20 22:30:12,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=279651.1666666667, ans=0.0 2024-06-20 22:30:25,630 INFO [train.py:1028] (0/2) Epoch 16, batch 800, loss[loss=0.1967, simple_loss=0.2561, pruned_loss=0.06862, over 12833.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2658, pruned_loss=0.08113, over 2541452.31 frames. ], batch size: 36, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:30:33,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=279706.1666666667, ans=0.0 2024-06-20 22:30:36,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=279706.1666666667, ans=0.125 2024-06-20 22:30:42,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=279724.5, ans=0.0 2024-06-20 22:30:42,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-20 22:30:43,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=279724.5, ans=0.125 2024-06-20 22:30:47,030 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.358e-02 2024-06-20 22:30:53,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=8.0 2024-06-20 22:30:59,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=279761.1666666667, ans=0.0 2024-06-20 22:31:05,176 INFO [train.py:1028] (0/2) Epoch 16, batch 850, loss[loss=0.2185, simple_loss=0.2681, pruned_loss=0.08446, over 13098.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.266, pruned_loss=0.08138, over 2551564.36 frames. ], batch size: 95, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:31:05,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=279779.5, ans=0.125 2024-06-20 22:31:17,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.05 vs. limit=15.0 2024-06-20 22:31:23,629 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.923e+02 2.128e+02 2.317e+02 2.954e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-20 22:31:37,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=279852.8333333333, ans=0.125 2024-06-20 22:31:43,610 INFO [train.py:1028] (0/2) Epoch 16, batch 900, loss[loss=0.2209, simple_loss=0.2765, pruned_loss=0.08265, over 12814.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.266, pruned_loss=0.08161, over 2556806.29 frames. ], batch size: 36, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:32:09,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=279907.8333333333, ans=0.2 2024-06-20 22:32:22,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.31 vs. limit=22.5 2024-06-20 22:32:24,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=279944.5, ans=0.125 2024-06-20 22:32:24,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=279944.5, ans=0.125 2024-06-20 22:32:25,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=279944.5, ans=0.125 2024-06-20 22:32:30,209 INFO [train.py:1028] (0/2) Epoch 16, batch 950, loss[loss=0.2052, simple_loss=0.2585, pruned_loss=0.07591, over 12901.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2664, pruned_loss=0.08173, over 2559654.23 frames. ], batch size: 39, lr: 3.70e-03, grad_scale: 128.0 2024-06-20 22:32:48,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.971e+02 2.089e+02 2.285e+02 3.304e+02, threshold=4.178e+02, percent-clipped=0.0 2024-06-20 22:33:08,332 INFO [train.py:1028] (0/2) Epoch 16, batch 1000, loss[loss=0.212, simple_loss=0.2626, pruned_loss=0.08072, over 13311.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2661, pruned_loss=0.08195, over 2561799.49 frames. ], batch size: 49, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:33:18,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=280072.8333333333, ans=0.125 2024-06-20 22:33:18,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=280072.8333333333, ans=10.0 2024-06-20 22:33:34,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=280109.5, ans=0.025 2024-06-20 22:33:47,142 INFO [train.py:1028] (0/2) Epoch 16, batch 1050, loss[loss=0.1735, simple_loss=0.2342, pruned_loss=0.05643, over 13185.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2671, pruned_loss=0.08225, over 2565061.10 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:33:48,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=280146.1666666667, ans=0.5 2024-06-20 22:33:48,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=280146.1666666667, ans=0.0 2024-06-20 22:33:49,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=280146.1666666667, ans=0.125 2024-06-20 22:33:52,591 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:33:56,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=280164.5, ans=0.09899494936611666 2024-06-20 22:34:01,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=280182.8333333333, ans=0.125 2024-06-20 22:34:05,392 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.925e+02 2.055e+02 2.218e+02 2.787e+02, threshold=4.110e+02, percent-clipped=0.0 2024-06-20 22:34:19,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=280201.1666666667, ans=0.125 2024-06-20 22:34:32,134 INFO [train.py:1028] (0/2) Epoch 16, batch 1100, loss[loss=0.2056, simple_loss=0.2565, pruned_loss=0.07737, over 13260.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2677, pruned_loss=0.08266, over 2569567.94 frames. ], batch size: 52, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:34:32,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2024-06-20 22:34:39,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2024-06-20 22:34:41,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=280256.1666666667, ans=0.125 2024-06-20 22:34:55,899 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:35:03,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=280311.1666666667, ans=0.0 2024-06-20 22:35:03,230 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.15 vs. limit=15.0 2024-06-20 22:35:11,240 INFO [train.py:1028] (0/2) Epoch 16, batch 1150, loss[loss=0.2388, simple_loss=0.2848, pruned_loss=0.09633, over 13219.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2676, pruned_loss=0.08275, over 2570187.10 frames. ], batch size: 52, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:35:16,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=280329.5, ans=0.0 2024-06-20 22:35:25,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=280366.1666666667, ans=0.125 2024-06-20 22:35:29,261 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.924e+02 2.047e+02 2.211e+02 3.089e+02, threshold=4.094e+02, percent-clipped=0.0 2024-06-20 22:35:35,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=280384.5, ans=0.125 2024-06-20 22:35:37,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=280384.5, ans=0.1 2024-06-20 22:35:42,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=280402.8333333333, ans=0.125 2024-06-20 22:35:45,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280402.8333333333, ans=0.125 2024-06-20 22:35:48,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280421.1666666667, ans=0.125 2024-06-20 22:35:49,133 INFO [train.py:1028] (0/2) Epoch 16, batch 1200, loss[loss=0.2013, simple_loss=0.2545, pruned_loss=0.07405, over 13151.00 frames. ], tot_loss[loss=0.217, simple_loss=0.268, pruned_loss=0.08301, over 2572965.83 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:36:22,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=280494.5, ans=0.1 2024-06-20 22:36:27,342 INFO [train.py:1028] (0/2) Epoch 16, batch 1250, loss[loss=0.2121, simple_loss=0.2608, pruned_loss=0.08172, over 13184.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2682, pruned_loss=0.08297, over 2584355.62 frames. ], batch size: 112, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:36:28,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=280512.8333333333, ans=0.125 2024-06-20 22:36:46,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2024-06-20 22:36:53,052 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.978e+02 2.161e+02 2.436e+02 3.055e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-20 22:36:55,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=280549.5, ans=0.2 2024-06-20 22:37:00,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=280567.8333333333, ans=0.125 2024-06-20 22:37:02,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=280567.8333333333, ans=0.0 2024-06-20 22:37:03,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280567.8333333333, ans=0.1 2024-06-20 22:37:06,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=280586.1666666667, ans=0.125 2024-06-20 22:37:12,253 INFO [train.py:1028] (0/2) Epoch 16, batch 1300, loss[loss=0.2391, simple_loss=0.2785, pruned_loss=0.09986, over 12711.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2684, pruned_loss=0.08303, over 2584974.94 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:37:13,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280604.5, ans=0.125 2024-06-20 22:37:28,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=280641.1666666667, ans=0.125 2024-06-20 22:37:36,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280659.5, ans=0.0 2024-06-20 22:37:39,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=280659.5, ans=0.125 2024-06-20 22:37:41,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2024-06-20 22:37:51,605 INFO [train.py:1028] (0/2) Epoch 16, batch 1350, loss[loss=0.2139, simple_loss=0.2678, pruned_loss=0.07999, over 13173.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2682, pruned_loss=0.08257, over 2588004.41 frames. ], batch size: 59, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:37:59,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.75 vs. limit=22.5 2024-06-20 22:38:10,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.013e+02 2.171e+02 2.458e+02 3.108e+02, threshold=4.342e+02, percent-clipped=0.0 2024-06-20 22:38:13,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=280732.8333333333, ans=0.125 2024-06-20 22:38:20,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=280751.1666666667, ans=0.125 2024-06-20 22:38:22,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=280769.5, ans=0.09899494936611666 2024-06-20 22:38:29,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=280769.5, ans=0.0 2024-06-20 22:38:30,755 INFO [train.py:1028] (0/2) Epoch 16, batch 1400, loss[loss=0.2069, simple_loss=0.2622, pruned_loss=0.07575, over 12902.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2681, pruned_loss=0.08276, over 2588681.31 frames. ], batch size: 26, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:38:41,857 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.94 vs. limit=15.0 2024-06-20 22:38:42,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.16 vs. limit=10.0 2024-06-20 22:38:45,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=280824.5, ans=0.04949747468305833 2024-06-20 22:39:15,806 INFO [train.py:1028] (0/2) Epoch 16, batch 1450, loss[loss=0.2048, simple_loss=0.2563, pruned_loss=0.0767, over 13087.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2675, pruned_loss=0.08247, over 2588734.29 frames. ], batch size: 121, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:39:21,806 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.36 vs. limit=22.5 2024-06-20 22:39:25,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280897.8333333333, ans=0.1 2024-06-20 22:39:27,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.34 vs. limit=15.0 2024-06-20 22:39:34,123 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 1.951e+02 2.095e+02 2.280e+02 3.323e+02, threshold=4.190e+02, percent-clipped=0.0 2024-06-20 22:39:43,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=280934.5, ans=0.125 2024-06-20 22:39:53,905 INFO [train.py:1028] (0/2) Epoch 16, batch 1500, loss[loss=0.2128, simple_loss=0.2555, pruned_loss=0.08504, over 13231.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2672, pruned_loss=0.08265, over 2590568.63 frames. ], batch size: 83, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:40:00,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=280971.1666666667, ans=0.0 2024-06-20 22:40:03,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280989.5, ans=0.0 2024-06-20 22:40:25,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=15.0 2024-06-20 22:40:31,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=281044.5, ans=0.2 2024-06-20 22:40:31,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=12.0 2024-06-20 22:40:32,560 INFO [train.py:1028] (0/2) Epoch 16, batch 1550, loss[loss=0.22, simple_loss=0.2671, pruned_loss=0.08644, over 13131.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2679, pruned_loss=0.08297, over 2585766.64 frames. ], batch size: 103, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:40:38,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=281062.8333333333, ans=0.125 2024-06-20 22:40:51,157 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.003e+02 2.141e+02 2.380e+02 3.248e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-20 22:41:02,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=281136.1666666667, ans=0.1 2024-06-20 22:41:10,915 INFO [train.py:1028] (0/2) Epoch 16, batch 1600, loss[loss=0.2087, simple_loss=0.2582, pruned_loss=0.07956, over 13131.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2677, pruned_loss=0.08259, over 2580635.14 frames. ], batch size: 77, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:41:12,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.58 vs. limit=15.0 2024-06-20 22:41:23,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=281154.5, ans=0.125 2024-06-20 22:41:24,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=15.0 2024-06-20 22:41:40,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=281209.5, ans=0.04949747468305833 2024-06-20 22:41:56,734 INFO [train.py:1028] (0/2) Epoch 16, batch 1650, loss[loss=0.2107, simple_loss=0.2628, pruned_loss=0.07926, over 13153.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2681, pruned_loss=0.08286, over 2577272.03 frames. ], batch size: 95, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:42:02,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.36 vs. limit=22.5 2024-06-20 22:42:03,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=281246.1666666667, ans=0.2 2024-06-20 22:42:09,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=281264.5, ans=0.04949747468305833 2024-06-20 22:42:15,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.962e+02 2.088e+02 2.289e+02 2.758e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-20 22:42:20,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=281301.1666666667, ans=0.125 2024-06-20 22:42:21,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=281301.1666666667, ans=0.125 2024-06-20 22:42:29,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281319.5, ans=0.1 2024-06-20 22:42:33,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=281319.5, ans=0.125 2024-06-20 22:42:35,816 INFO [train.py:1028] (0/2) Epoch 16, batch 1700, loss[loss=0.2285, simple_loss=0.2811, pruned_loss=0.08797, over 12402.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2681, pruned_loss=0.0828, over 2581790.49 frames. ], batch size: 25, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:43:02,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=281392.8333333333, ans=0.125 2024-06-20 22:43:03,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2024-06-20 22:43:10,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=281411.1666666667, ans=0.0 2024-06-20 22:43:12,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=281411.1666666667, ans=0.0 2024-06-20 22:43:13,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=281411.1666666667, ans=0.0 2024-06-20 22:43:14,364 INFO [train.py:1028] (0/2) Epoch 16, batch 1750, loss[loss=0.2147, simple_loss=0.2727, pruned_loss=0.07833, over 12676.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2683, pruned_loss=0.08278, over 2583162.76 frames. ], batch size: 22, lr: 3.69e-03, grad_scale: 128.0 2024-06-20 22:43:20,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=281429.5, ans=0.2 2024-06-20 22:43:21,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.92 vs. limit=15.0 2024-06-20 22:43:23,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=281447.8333333333, ans=0.0 2024-06-20 22:43:25,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.11 vs. limit=12.0 2024-06-20 22:43:33,185 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.936e+02 2.059e+02 2.202e+02 2.965e+02, threshold=4.119e+02, percent-clipped=0.0 2024-06-20 22:43:34,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=281466.1666666667, ans=0.0 2024-06-20 22:43:35,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=281466.1666666667, ans=0.2 2024-06-20 22:43:37,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=281484.5, ans=0.125 2024-06-20 22:43:39,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=281484.5, ans=0.2 2024-06-20 22:43:39,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.87 vs. limit=22.5 2024-06-20 22:43:41,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=281484.5, ans=0.025 2024-06-20 22:43:59,478 INFO [train.py:1028] (0/2) Epoch 16, batch 1800, loss[loss=0.21, simple_loss=0.2691, pruned_loss=0.07546, over 13248.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2694, pruned_loss=0.08317, over 2583255.09 frames. ], batch size: 67, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:44:08,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2024-06-20 22:44:31,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=281594.5, ans=0.125 2024-06-20 22:44:31,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=281594.5, ans=0.125 2024-06-20 22:44:35,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=281594.5, ans=0.125 2024-06-20 22:44:38,651 INFO [train.py:1028] (0/2) Epoch 16, batch 1850, loss[loss=0.213, simple_loss=0.2654, pruned_loss=0.08026, over 13195.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2694, pruned_loss=0.0833, over 2584087.35 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:44:44,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281612.8333333333, ans=0.1 2024-06-20 22:44:52,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=281631.1666666667, ans=0.0 2024-06-20 22:44:57,158 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.958e+02 2.098e+02 2.289e+02 2.903e+02, threshold=4.195e+02, percent-clipped=0.0 2024-06-20 22:44:59,816 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:45:01,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281667.8333333333, ans=0.1 2024-06-20 22:45:13,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.37 vs. limit=22.5 2024-06-20 22:45:17,174 INFO [train.py:1028] (0/2) Epoch 16, batch 1900, loss[loss=0.2158, simple_loss=0.2648, pruned_loss=0.08338, over 13110.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2687, pruned_loss=0.08313, over 2586387.89 frames. ], batch size: 95, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:45:22,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=281704.5, ans=0.125 2024-06-20 22:45:37,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=281741.1666666667, ans=0.125 2024-06-20 22:45:56,233 INFO [train.py:1028] (0/2) Epoch 16, batch 1950, loss[loss=0.2088, simple_loss=0.2666, pruned_loss=0.07543, over 13246.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2681, pruned_loss=0.08292, over 2591419.37 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:46:07,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=281814.5, ans=0.0 2024-06-20 22:46:10,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=281814.5, ans=0.0 2024-06-20 22:46:18,292 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 1.887e+02 1.993e+02 2.148e+02 2.965e+02, threshold=3.985e+02, percent-clipped=0.0 2024-06-20 22:46:18,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=281832.8333333333, ans=0.125 2024-06-20 22:46:20,566 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:46:21,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=281832.8333333333, ans=0.025 2024-06-20 22:46:24,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=281851.1666666667, ans=0.125 2024-06-20 22:46:26,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281851.1666666667, ans=0.1 2024-06-20 22:46:28,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=281851.1666666667, ans=0.125 2024-06-20 22:46:30,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=281869.5, ans=0.125 2024-06-20 22:46:34,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281869.5, ans=0.1 2024-06-20 22:46:34,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=281869.5, ans=0.125 2024-06-20 22:46:38,321 INFO [train.py:1028] (0/2) Epoch 16, batch 2000, loss[loss=0.2213, simple_loss=0.2762, pruned_loss=0.08323, over 12498.00 frames. ], tot_loss[loss=0.217, simple_loss=0.268, pruned_loss=0.08298, over 2587271.17 frames. ], batch size: 22, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:46:48,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=281906.1666666667, ans=0.0 2024-06-20 22:47:06,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=281942.8333333333, ans=0.125 2024-06-20 22:47:18,161 INFO [train.py:1028] (0/2) Epoch 16, batch 2050, loss[loss=0.2442, simple_loss=0.2883, pruned_loss=0.1, over 12559.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2684, pruned_loss=0.08337, over 2582734.60 frames. ], batch size: 29, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:47:30,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=281997.8333333333, ans=10.0 2024-06-20 22:47:31,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=281997.8333333333, ans=0.0 2024-06-20 22:47:32,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=282016.1666666667, ans=0.125 2024-06-20 22:47:36,513 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.927e+02 2.045e+02 2.194e+02 2.964e+02, threshold=4.090e+02, percent-clipped=0.0 2024-06-20 22:47:39,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.41 vs. limit=22.5 2024-06-20 22:47:40,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.38 vs. limit=15.0 2024-06-20 22:47:56,416 INFO [train.py:1028] (0/2) Epoch 16, batch 2100, loss[loss=0.2087, simple_loss=0.2585, pruned_loss=0.07949, over 13195.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2689, pruned_loss=0.08358, over 2584587.39 frames. ], batch size: 59, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:48:08,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=282089.5, ans=0.0 2024-06-20 22:48:12,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=282107.8333333333, ans=0.125 2024-06-20 22:48:17,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=282107.8333333333, ans=0.0 2024-06-20 22:48:18,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=282107.8333333333, ans=0.125 2024-06-20 22:48:21,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2024-06-20 22:48:41,644 INFO [train.py:1028] (0/2) Epoch 16, batch 2150, loss[loss=0.2178, simple_loss=0.2738, pruned_loss=0.08089, over 13206.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2685, pruned_loss=0.08304, over 2587375.73 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:48:48,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=282162.8333333333, ans=0.0 2024-06-20 22:48:58,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=282199.5, ans=0.0 2024-06-20 22:49:01,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.980e+02 2.103e+02 2.307e+02 2.793e+02, threshold=4.207e+02, percent-clipped=0.0 2024-06-20 22:49:03,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=282199.5, ans=0.0 2024-06-20 22:49:13,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=282236.1666666667, ans=0.125 2024-06-20 22:49:18,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=282236.1666666667, ans=0.125 2024-06-20 22:49:18,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=282236.1666666667, ans=0.125 2024-06-20 22:49:21,661 INFO [train.py:1028] (0/2) Epoch 16, batch 2200, loss[loss=0.2077, simple_loss=0.2573, pruned_loss=0.07904, over 13140.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2689, pruned_loss=0.08317, over 2587503.73 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:49:32,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=282272.8333333333, ans=0.07 2024-06-20 22:49:36,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=282272.8333333333, ans=0.125 2024-06-20 22:49:43,331 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:49:46,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=282309.5, ans=0.125 2024-06-20 22:49:51,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=282309.5, ans=0.2 2024-06-20 22:49:56,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=282327.8333333333, ans=0.2 2024-06-20 22:50:00,409 INFO [train.py:1028] (0/2) Epoch 16, batch 2250, loss[loss=0.2182, simple_loss=0.279, pruned_loss=0.07871, over 13289.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2687, pruned_loss=0.08324, over 2585814.37 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:50:06,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=282346.1666666667, ans=0.125 2024-06-20 22:50:18,844 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.940e+02 2.044e+02 2.193e+02 2.971e+02, threshold=4.089e+02, percent-clipped=0.0 2024-06-20 22:50:23,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=282401.1666666667, ans=0.2 2024-06-20 22:50:31,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=282419.5, ans=0.125 2024-06-20 22:50:39,141 INFO [train.py:1028] (0/2) Epoch 16, batch 2300, loss[loss=0.2106, simple_loss=0.2655, pruned_loss=0.07785, over 12936.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.269, pruned_loss=0.08325, over 2580449.04 frames. ], batch size: 33, lr: 3.68e-03, grad_scale: 128.0 2024-06-20 22:50:52,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=282456.1666666667, ans=0.5 2024-06-20 22:50:56,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.59 vs. limit=15.0 2024-06-20 22:51:05,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=282474.5, ans=0.1 2024-06-20 22:51:17,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2024-06-20 22:51:24,596 INFO [train.py:1028] (0/2) Epoch 16, batch 2350, loss[loss=0.1965, simple_loss=0.2551, pruned_loss=0.06893, over 13194.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2684, pruned_loss=0.08282, over 2585236.48 frames. ], batch size: 67, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:51:27,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=282529.5, ans=0.2 2024-06-20 22:51:29,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=282529.5, ans=0.1 2024-06-20 22:51:43,511 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.956e+02 2.107e+02 2.315e+02 2.717e+02, threshold=4.214e+02, percent-clipped=0.0 2024-06-20 22:51:49,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282584.5, ans=0.125 2024-06-20 22:51:59,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.06 vs. limit=15.0 2024-06-20 22:52:01,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282602.8333333333, ans=0.125 2024-06-20 22:52:02,623 INFO [train.py:1028] (0/2) Epoch 16, batch 2400, loss[loss=0.2173, simple_loss=0.267, pruned_loss=0.08378, over 13298.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2678, pruned_loss=0.08276, over 2588041.31 frames. ], batch size: 46, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:52:09,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282639.5, ans=0.125 2024-06-20 22:52:14,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=282639.5, ans=0.2 2024-06-20 22:52:20,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=282657.8333333333, ans=0.125 2024-06-20 22:52:26,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=282676.1666666667, ans=0.125 2024-06-20 22:52:27,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2024-06-20 22:52:33,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=282694.5, ans=0.0 2024-06-20 22:52:40,517 INFO [train.py:1028] (0/2) Epoch 16, batch 2450, loss[loss=0.1979, simple_loss=0.2499, pruned_loss=0.07299, over 13302.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2668, pruned_loss=0.08278, over 2585245.07 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:52:40,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=282712.8333333333, ans=0.125 2024-06-20 22:52:52,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282731.1666666667, ans=0.125 2024-06-20 22:52:55,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=282749.5, ans=0.0 2024-06-20 22:52:56,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=282749.5, ans=0.125 2024-06-20 22:53:00,126 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.941e+02 2.054e+02 2.219e+02 2.866e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-20 22:53:21,679 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 22:53:25,627 INFO [train.py:1028] (0/2) Epoch 16, batch 2500, loss[loss=0.192, simple_loss=0.2458, pruned_loss=0.06914, over 13255.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2664, pruned_loss=0.0827, over 2587927.14 frames. ], batch size: 83, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:53:30,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=282804.5, ans=0.0 2024-06-20 22:53:36,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=15.0 2024-06-20 22:53:36,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=282822.8333333333, ans=0.125 2024-06-20 22:53:49,030 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2024-06-20 22:53:49,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=282859.5, ans=0.0 2024-06-20 22:53:53,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=282859.5, ans=0.025 2024-06-20 22:54:01,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.03 vs. limit=10.0 2024-06-20 22:54:03,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.06 vs. limit=15.0 2024-06-20 22:54:04,691 INFO [train.py:1028] (0/2) Epoch 16, batch 2550, loss[loss=0.2125, simple_loss=0.2658, pruned_loss=0.07957, over 12614.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2652, pruned_loss=0.08243, over 2586713.68 frames. ], batch size: 22, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:54:04,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=282896.1666666667, ans=0.125 2024-06-20 22:54:23,084 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 1.880e+02 2.034e+02 2.239e+02 2.956e+02, threshold=4.068e+02, percent-clipped=0.0 2024-06-20 22:54:36,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282969.5, ans=0.125 2024-06-20 22:54:37,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.53 vs. limit=22.5 2024-06-20 22:54:38,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=282969.5, ans=0.0 2024-06-20 22:54:42,532 INFO [train.py:1028] (0/2) Epoch 16, batch 2600, loss[loss=0.2059, simple_loss=0.2525, pruned_loss=0.07965, over 13348.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2635, pruned_loss=0.08205, over 2586775.51 frames. ], batch size: 52, lr: 3.68e-03, grad_scale: 64.0 2024-06-20 22:55:14,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.81 vs. limit=22.5 2024-06-20 22:55:21,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283061.1666666667, ans=0.125 2024-06-20 22:55:27,562 INFO [train.py:1028] (0/2) Epoch 16, batch 2650, loss[loss=0.2045, simple_loss=0.2503, pruned_loss=0.07935, over 13011.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2622, pruned_loss=0.08155, over 2586402.83 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:55:43,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=283116.1666666667, ans=0.5 2024-06-20 22:55:46,653 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.909e+02 2.053e+02 2.279e+02 2.896e+02, threshold=4.106e+02, percent-clipped=0.0 2024-06-20 22:56:06,509 INFO [train.py:1028] (0/2) Epoch 16, batch 2700, loss[loss=0.2139, simple_loss=0.2655, pruned_loss=0.08114, over 13253.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2605, pruned_loss=0.08108, over 2584194.76 frames. ], batch size: 89, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:56:12,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283171.1666666667, ans=0.1 2024-06-20 22:56:31,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2024-06-20 22:56:38,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=283244.5, ans=0.0 2024-06-20 22:56:43,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=283244.5, ans=0.1 2024-06-20 22:56:46,047 INFO [train.py:1028] (0/2) Epoch 16, batch 2750, loss[loss=0.2185, simple_loss=0.2693, pruned_loss=0.08387, over 13261.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2593, pruned_loss=0.08021, over 2580534.66 frames. ], batch size: 43, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:56:57,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=283281.1666666667, ans=0.125 2024-06-20 22:56:58,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=283281.1666666667, ans=0.0 2024-06-20 22:57:02,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.41 vs. limit=15.0 2024-06-20 22:57:05,725 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.888e+02 1.989e+02 2.105e+02 3.553e+02, threshold=3.979e+02, percent-clipped=0.0 2024-06-20 22:57:06,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283299.5, ans=0.1 2024-06-20 22:57:10,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=283317.8333333333, ans=0.0 2024-06-20 22:57:31,829 INFO [train.py:1028] (0/2) Epoch 16, batch 2800, loss[loss=0.2149, simple_loss=0.2546, pruned_loss=0.08764, over 10819.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2589, pruned_loss=0.08022, over 2578876.71 frames. ], batch size: 304, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:57:39,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283372.8333333333, ans=0.125 2024-06-20 22:57:43,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=283372.8333333333, ans=0.0 2024-06-20 22:57:44,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=283372.8333333333, ans=0.125 2024-06-20 22:57:51,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2024-06-20 22:57:59,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=283409.5, ans=0.0 2024-06-20 22:58:05,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=283427.8333333333, ans=0.125 2024-06-20 22:58:09,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=283446.1666666667, ans=0.125 2024-06-20 22:58:10,056 INFO [train.py:1028] (0/2) Epoch 16, batch 2850, loss[loss=0.2196, simple_loss=0.2722, pruned_loss=0.08354, over 13326.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2584, pruned_loss=0.08023, over 2576619.81 frames. ], batch size: 49, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:58:16,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=283464.5, ans=0.0 2024-06-20 22:58:28,778 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 1.881e+02 2.008e+02 2.172e+02 2.782e+02, threshold=4.016e+02, percent-clipped=0.0 2024-06-20 22:58:35,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=283501.1666666667, ans=0.0 2024-06-20 22:58:41,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=283519.5, ans=0.0 2024-06-20 22:58:47,686 INFO [train.py:1028] (0/2) Epoch 16, batch 2900, loss[loss=0.2152, simple_loss=0.2636, pruned_loss=0.08341, over 13141.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2567, pruned_loss=0.07962, over 2584889.39 frames. ], batch size: 55, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:58:52,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=283537.8333333333, ans=0.0 2024-06-20 22:59:27,256 INFO [train.py:1028] (0/2) Epoch 16, batch 2950, loss[loss=0.1694, simple_loss=0.2206, pruned_loss=0.05906, over 13279.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2568, pruned_loss=0.07953, over 2578935.38 frames. ], batch size: 43, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 22:59:35,956 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2024-06-20 22:59:37,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=283647.8333333333, ans=0.0 2024-06-20 22:59:54,351 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.856e+02 1.969e+02 2.113e+02 2.922e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-20 22:59:54,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283666.1666666667, ans=0.0 2024-06-20 23:00:02,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=283684.5, ans=0.0 2024-06-20 23:00:04,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=283684.5, ans=0.09899494936611666 2024-06-20 23:00:06,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.51 vs. limit=22.5 2024-06-20 23:00:14,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=283721.1666666667, ans=10.0 2024-06-20 23:00:15,018 INFO [train.py:1028] (0/2) Epoch 16, batch 3000, loss[loss=0.2226, simple_loss=0.2742, pruned_loss=0.08554, over 13139.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2556, pruned_loss=0.07899, over 2577561.27 frames. ], batch size: 59, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:00:15,019 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 23:00:24,218 INFO [train.py:1060] (0/2) Epoch 16, validation: loss=0.1882, simple_loss=0.2529, pruned_loss=0.06175, over 351949.00 frames. 2024-06-20 23:00:24,219 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 23:00:30,965 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:00:32,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=283739.5, ans=0.125 2024-06-20 23:00:34,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=283739.5, ans=0.0 2024-06-20 23:00:55,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2024-06-20 23:00:57,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=283794.5, ans=0.2 2024-06-20 23:00:58,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=283794.5, ans=0.125 2024-06-20 23:01:01,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=283794.5, ans=0.125 2024-06-20 23:01:02,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=283794.5, ans=0.125 2024-06-20 23:01:05,810 INFO [train.py:1028] (0/2) Epoch 16, batch 3050, loss[loss=0.2211, simple_loss=0.2735, pruned_loss=0.08434, over 13229.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2551, pruned_loss=0.0792, over 2578594.61 frames. ], batch size: 46, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:01:11,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=283812.8333333333, ans=0.125 2024-06-20 23:01:13,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=283831.1666666667, ans=0.025 2024-06-20 23:01:22,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283849.5, ans=0.1 2024-06-20 23:01:22,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=283849.5, ans=0.0 2024-06-20 23:01:25,568 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.668e+02 1.874e+02 1.952e+02 2.178e+02 2.885e+02, threshold=3.905e+02, percent-clipped=0.0 2024-06-20 23:01:30,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=283867.8333333333, ans=0.0 2024-06-20 23:01:34,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=283867.8333333333, ans=0.125 2024-06-20 23:01:42,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=283886.1666666667, ans=0.09899494936611666 2024-06-20 23:01:42,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=283886.1666666667, ans=0.0 2024-06-20 23:01:45,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.67 vs. limit=15.0 2024-06-20 23:01:45,412 INFO [train.py:1028] (0/2) Epoch 16, batch 3100, loss[loss=0.2043, simple_loss=0.2499, pruned_loss=0.07929, over 13015.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2537, pruned_loss=0.07851, over 2579460.15 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:01:49,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=12.0 2024-06-20 23:01:59,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=283922.8333333333, ans=0.0 2024-06-20 23:02:06,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=283941.1666666667, ans=0.0 2024-06-20 23:02:11,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=283941.1666666667, ans=0.125 2024-06-20 23:02:14,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=283959.5, ans=0.125 2024-06-20 23:02:22,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=283959.5, ans=0.125 2024-06-20 23:02:32,493 INFO [train.py:1028] (0/2) Epoch 16, batch 3150, loss[loss=0.2063, simple_loss=0.2519, pruned_loss=0.08039, over 12969.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2538, pruned_loss=0.07866, over 2581590.37 frames. ], batch size: 158, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:02:34,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=283996.1666666667, ans=0.025 2024-06-20 23:02:36,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=283996.1666666667, ans=0.125 2024-06-20 23:02:53,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 1.862e+02 1.971e+02 2.114e+02 2.636e+02, threshold=3.942e+02, percent-clipped=0.0 2024-06-20 23:02:59,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=284051.1666666667, ans=0.125 2024-06-20 23:02:59,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=284051.1666666667, ans=0.2 2024-06-20 23:03:13,378 INFO [train.py:1028] (0/2) Epoch 16, batch 3200, loss[loss=0.1931, simple_loss=0.2417, pruned_loss=0.07224, over 13102.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2538, pruned_loss=0.07845, over 2582150.12 frames. ], batch size: 55, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:03:24,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=284106.1666666667, ans=0.2 2024-06-20 23:03:44,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=284142.8333333333, ans=0.5 2024-06-20 23:03:47,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284161.1666666667, ans=0.1 2024-06-20 23:03:53,501 INFO [train.py:1028] (0/2) Epoch 16, batch 3250, loss[loss=0.1893, simple_loss=0.2448, pruned_loss=0.06696, over 13218.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2525, pruned_loss=0.07796, over 2586672.76 frames. ], batch size: 72, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:04:13,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=284216.1666666667, ans=0.0 2024-06-20 23:04:14,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.914e+02 2.049e+02 2.274e+02 2.848e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-20 23:04:18,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=284234.5, ans=0.025 2024-06-20 23:04:25,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=15.0 2024-06-20 23:04:36,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=284252.8333333333, ans=0.125 2024-06-20 23:04:37,843 INFO [train.py:1028] (0/2) Epoch 16, batch 3300, loss[loss=0.2197, simple_loss=0.2613, pruned_loss=0.08908, over 12771.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2521, pruned_loss=0.07765, over 2583652.86 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:05:02,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=284307.8333333333, ans=0.125 2024-06-20 23:05:07,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=284326.1666666667, ans=0.1 2024-06-20 23:05:11,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=284326.1666666667, ans=0.125 2024-06-20 23:05:13,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=284344.5, ans=0.07 2024-06-20 23:05:20,854 INFO [train.py:1028] (0/2) Epoch 16, batch 3350, loss[loss=0.1931, simple_loss=0.2403, pruned_loss=0.07291, over 12948.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2514, pruned_loss=0.07776, over 2578185.95 frames. ], batch size: 158, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:05:29,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.63 vs. limit=15.0 2024-06-20 23:05:40,702 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.851e+02 1.984e+02 2.117e+02 2.810e+02, threshold=3.969e+02, percent-clipped=0.0 2024-06-20 23:05:46,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=284417.8333333333, ans=0.2 2024-06-20 23:05:48,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=284417.8333333333, ans=0.125 2024-06-20 23:05:50,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=284417.8333333333, ans=0.0 2024-06-20 23:05:57,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-20 23:06:00,850 INFO [train.py:1028] (0/2) Epoch 16, batch 3400, loss[loss=0.2016, simple_loss=0.2572, pruned_loss=0.07302, over 12707.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2509, pruned_loss=0.0776, over 2576224.98 frames. ], batch size: 22, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:06:02,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=284454.5, ans=0.0 2024-06-20 23:06:09,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=17.87 vs. limit=15.0 2024-06-20 23:06:11,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=284472.8333333333, ans=0.025 2024-06-20 23:06:24,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=284491.1666666667, ans=0.07 2024-06-20 23:06:27,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=284509.5, ans=0.125 2024-06-20 23:06:37,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=284527.8333333333, ans=0.0 2024-06-20 23:06:39,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=284527.8333333333, ans=0.0 2024-06-20 23:06:40,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=284527.8333333333, ans=0.0 2024-06-20 23:06:41,594 INFO [train.py:1028] (0/2) Epoch 16, batch 3450, loss[loss=0.2112, simple_loss=0.2572, pruned_loss=0.0826, over 12744.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2503, pruned_loss=0.07726, over 2577240.04 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 64.0 2024-06-20 23:06:52,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284564.5, ans=0.1 2024-06-20 23:07:08,979 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.847e+02 1.964e+02 2.134e+02 2.665e+02, threshold=3.928e+02, percent-clipped=0.0 2024-06-20 23:07:12,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=284601.1666666667, ans=0.0 2024-06-20 23:07:13,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.16 vs. limit=15.0 2024-06-20 23:07:18,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=284601.1666666667, ans=0.125 2024-06-20 23:07:18,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=284601.1666666667, ans=0.0 2024-06-20 23:07:20,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=284619.5, ans=0.125 2024-06-20 23:07:29,353 INFO [train.py:1028] (0/2) Epoch 16, batch 3500, loss[loss=0.2244, simple_loss=0.2726, pruned_loss=0.08811, over 12870.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2502, pruned_loss=0.07703, over 2574382.77 frames. ], batch size: 33, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:07:41,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.66 vs. limit=15.0 2024-06-20 23:07:46,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=284674.5, ans=0.2 2024-06-20 23:07:46,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=284674.5, ans=0.1 2024-06-20 23:08:10,066 INFO [train.py:1028] (0/2) Epoch 16, batch 3550, loss[loss=0.1935, simple_loss=0.2465, pruned_loss=0.07026, over 13149.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2497, pruned_loss=0.0769, over 2575565.12 frames. ], batch size: 95, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:08:27,969 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.842e+02 1.928e+02 2.077e+02 2.629e+02, threshold=3.856e+02, percent-clipped=0.0 2024-06-20 23:08:31,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=284784.5, ans=0.125 2024-06-20 23:08:38,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.53 vs. limit=22.5 2024-06-20 23:08:51,055 INFO [train.py:1028] (0/2) Epoch 16, batch 3600, loss[loss=0.183, simple_loss=0.2412, pruned_loss=0.06235, over 13024.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.249, pruned_loss=0.07676, over 2578485.78 frames. ], batch size: 48, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:08:59,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=284839.5, ans=0.125 2024-06-20 23:09:08,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.98 vs. limit=10.0 2024-06-20 23:09:30,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=284876.1666666667, ans=0.1 2024-06-20 23:09:30,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=284876.1666666667, ans=0.2 2024-06-20 23:09:36,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2024-06-20 23:09:39,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=284894.5, ans=0.125 2024-06-20 23:09:43,563 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:09:46,863 INFO [train.py:1028] (0/2) Epoch 16, batch 3650, loss[loss=0.1849, simple_loss=0.2337, pruned_loss=0.06801, over 12983.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2487, pruned_loss=0.07642, over 2577574.55 frames. ], batch size: 102, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:09:47,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.08 vs. limit=15.0 2024-06-20 23:09:58,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=284912.8333333333, ans=0.025 2024-06-20 23:10:07,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=284931.1666666667, ans=0.125 2024-06-20 23:10:11,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=284949.5, ans=0.125 2024-06-20 23:10:12,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=284949.5, ans=0.125 2024-06-20 23:10:15,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.552e+02 1.908e+02 2.060e+02 2.298e+02 3.353e+02, threshold=4.120e+02, percent-clipped=0.0 2024-06-20 23:10:28,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=284967.8333333333, ans=0.125 2024-06-20 23:10:31,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=284986.1666666667, ans=0.0 2024-06-20 23:10:40,577 INFO [train.py:1028] (0/2) Epoch 16, batch 3700, loss[loss=0.1737, simple_loss=0.2279, pruned_loss=0.05976, over 13243.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2476, pruned_loss=0.07579, over 2583677.94 frames. ], batch size: 72, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:10:43,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=285004.5, ans=0.125 2024-06-20 23:10:50,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=285022.8333333333, ans=0.125 2024-06-20 23:10:57,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=285022.8333333333, ans=0.0 2024-06-20 23:11:05,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=285041.1666666667, ans=0.125 2024-06-20 23:11:14,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=285059.5, ans=0.125 2024-06-20 23:11:14,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=285059.5, ans=0.125 2024-06-20 23:11:22,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=285077.8333333333, ans=0.025 2024-06-20 23:11:28,383 INFO [train.py:1028] (0/2) Epoch 16, batch 3750, loss[loss=0.2437, simple_loss=0.2836, pruned_loss=0.1019, over 12559.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2479, pruned_loss=0.07572, over 2586093.95 frames. ], batch size: 22, lr: 3.66e-03, grad_scale: 64.0 2024-06-20 23:11:50,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.98 vs. limit=6.0 2024-06-20 23:11:50,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.623e+02 1.804e+02 1.922e+02 2.061e+02 3.374e+02, threshold=3.845e+02, percent-clipped=0.0 2024-06-20 23:12:14,339 INFO [train.py:1028] (0/2) Epoch 16, batch 3800, loss[loss=0.2023, simple_loss=0.2455, pruned_loss=0.07961, over 13208.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2477, pruned_loss=0.07565, over 2583648.51 frames. ], batch size: 83, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:12:20,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285187.8333333333, ans=0.1 2024-06-20 23:12:20,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2024-06-20 23:12:38,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.17 vs. limit=22.5 2024-06-20 23:12:46,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=285224.5, ans=0.07 2024-06-20 23:12:57,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=285242.8333333333, ans=0.125 2024-06-20 23:13:01,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=285242.8333333333, ans=0.125 2024-06-20 23:13:02,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=285242.8333333333, ans=0.2 2024-06-20 23:13:14,200 INFO [train.py:1028] (0/2) Epoch 16, batch 3850, loss[loss=0.2137, simple_loss=0.26, pruned_loss=0.08375, over 13060.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2467, pruned_loss=0.07512, over 2582863.17 frames. ], batch size: 144, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:13:15,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=285279.5, ans=0.125 2024-06-20 23:13:26,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=285297.8333333333, ans=0.125 2024-06-20 23:13:27,897 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.99 vs. limit=10.0 2024-06-20 23:13:34,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2024-06-20 23:13:38,057 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.537e+02 1.841e+02 1.938e+02 2.078e+02 2.531e+02, threshold=3.875e+02, percent-clipped=0.0 2024-06-20 23:13:43,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=285334.5, ans=0.025 2024-06-20 23:13:46,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=285334.5, ans=0.0 2024-06-20 23:13:59,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285371.1666666667, ans=0.1 2024-06-20 23:14:00,646 INFO [train.py:1028] (0/2) Epoch 16, batch 3900, loss[loss=0.1994, simple_loss=0.2417, pruned_loss=0.0785, over 13166.00 frames. ], tot_loss[loss=0.199, simple_loss=0.247, pruned_loss=0.07549, over 2586176.60 frames. ], batch size: 83, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:14:12,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=285389.5, ans=0.0 2024-06-20 23:14:18,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285407.8333333333, ans=0.1 2024-06-20 23:14:27,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=285407.8333333333, ans=0.125 2024-06-20 23:14:32,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=285426.1666666667, ans=0.015 2024-06-20 23:14:40,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=285444.5, ans=0.025 2024-06-20 23:14:41,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=285444.5, ans=0.0 2024-06-20 23:14:48,913 INFO [train.py:1028] (0/2) Epoch 16, batch 3950, loss[loss=0.196, simple_loss=0.2366, pruned_loss=0.07767, over 13114.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2461, pruned_loss=0.07483, over 2589030.69 frames. ], batch size: 132, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:15:08,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.72 vs. limit=10.0 2024-06-20 23:15:16,038 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 1.861e+02 1.991e+02 2.171e+02 2.653e+02, threshold=3.982e+02, percent-clipped=0.0 2024-06-20 23:15:20,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-20 23:15:25,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285517.8333333333, ans=0.1 2024-06-20 23:15:33,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285536.1666666667, ans=0.1 2024-06-20 23:15:46,861 INFO [train.py:1028] (0/2) Epoch 16, batch 4000, loss[loss=0.1833, simple_loss=0.2452, pruned_loss=0.06073, over 12911.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2459, pruned_loss=0.07502, over 2583136.58 frames. ], batch size: 39, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:16:12,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-20 23:16:31,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=285627.8333333333, ans=0.0 2024-06-20 23:16:42,909 INFO [train.py:1028] (0/2) Epoch 16, batch 4050, loss[loss=0.1989, simple_loss=0.2342, pruned_loss=0.08183, over 11019.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2456, pruned_loss=0.07528, over 2580802.51 frames. ], batch size: 304, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:16:48,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=285646.1666666667, ans=10.0 2024-06-20 23:16:51,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=285646.1666666667, ans=0.125 2024-06-20 23:16:54,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-20 23:17:06,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=285682.8333333333, ans=0.125 2024-06-20 23:17:06,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=285682.8333333333, ans=0.2 2024-06-20 23:17:07,327 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.888e+02 1.987e+02 2.147e+02 2.586e+02, threshold=3.975e+02, percent-clipped=0.0 2024-06-20 23:17:09,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=285682.8333333333, ans=0.0 2024-06-20 23:17:30,637 INFO [train.py:1028] (0/2) Epoch 16, batch 4100, loss[loss=0.1902, simple_loss=0.233, pruned_loss=0.07374, over 13072.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2451, pruned_loss=0.07525, over 2577245.33 frames. ], batch size: 102, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:17:33,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=15.0 2024-06-20 23:17:38,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=285737.8333333333, ans=0.125 2024-06-20 23:17:42,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=285756.1666666667, ans=0.125 2024-06-20 23:17:43,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=285756.1666666667, ans=15.0 2024-06-20 23:17:45,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=285756.1666666667, ans=15.0 2024-06-20 23:17:48,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=285774.5, ans=0.125 2024-06-20 23:17:50,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=285774.5, ans=0.125 2024-06-20 23:17:59,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=285792.8333333333, ans=0.0 2024-06-20 23:18:02,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.19 vs. limit=10.0 2024-06-20 23:18:14,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=285811.1666666667, ans=0.2 2024-06-20 23:18:17,517 INFO [train.py:1028] (0/2) Epoch 16, batch 4150, loss[loss=0.2158, simple_loss=0.2592, pruned_loss=0.08622, over 13130.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2453, pruned_loss=0.07529, over 2577288.07 frames. ], batch size: 55, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:18:20,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285829.5, ans=0.1 2024-06-20 23:18:31,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=285847.8333333333, ans=0.125 2024-06-20 23:18:48,347 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.854e+02 1.960e+02 2.155e+02 3.018e+02, threshold=3.919e+02, percent-clipped=0.0 2024-06-20 23:19:13,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=285902.8333333333, ans=0.125 2024-06-20 23:19:17,823 INFO [train.py:1028] (0/2) Epoch 16, batch 4200, loss[loss=0.2024, simple_loss=0.2411, pruned_loss=0.08184, over 13139.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2452, pruned_loss=0.07534, over 2578873.78 frames. ], batch size: 103, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:19:23,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=285921.1666666667, ans=0.125 2024-06-20 23:19:34,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.96 vs. limit=15.0 2024-06-20 23:19:49,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=285976.1666666667, ans=10.0 2024-06-20 23:19:58,770 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-156000.pt 2024-06-20 23:20:05,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=285994.5, ans=0.2 2024-06-20 23:20:12,897 INFO [train.py:1028] (0/2) Epoch 16, batch 4250, loss[loss=0.2037, simple_loss=0.2554, pruned_loss=0.07595, over 13312.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2444, pruned_loss=0.07465, over 2581486.67 frames. ], batch size: 46, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:20:16,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=286012.8333333333, ans=0.125 2024-06-20 23:20:37,903 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.867e+02 1.996e+02 2.216e+02 2.847e+02, threshold=3.992e+02, percent-clipped=0.0 2024-06-20 23:20:38,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=286049.5, ans=0.0 2024-06-20 23:20:44,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.46 vs. limit=10.0 2024-06-20 23:20:45,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=286067.8333333333, ans=0.025 2024-06-20 23:21:00,993 INFO [train.py:1028] (0/2) Epoch 16, batch 4300, loss[loss=0.1894, simple_loss=0.2385, pruned_loss=0.0701, over 13182.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2443, pruned_loss=0.07453, over 2581506.73 frames. ], batch size: 59, lr: 3.66e-03, grad_scale: 32.0 2024-06-20 23:21:04,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=286104.5, ans=0.125 2024-06-20 23:21:17,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2024-06-20 23:21:23,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=286141.1666666667, ans=0.1 2024-06-20 23:21:24,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=286141.1666666667, ans=0.2 2024-06-20 23:21:28,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=286159.5, ans=0.2 2024-06-20 23:21:44,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=286177.8333333333, ans=0.0 2024-06-20 23:21:49,707 INFO [train.py:1028] (0/2) Epoch 16, batch 4350, loss[loss=0.1968, simple_loss=0.2438, pruned_loss=0.07485, over 13224.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2436, pruned_loss=0.07432, over 2585745.54 frames. ], batch size: 59, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:21:52,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=286196.1666666667, ans=0.125 2024-06-20 23:22:12,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=286214.5, ans=0.025 2024-06-20 23:22:18,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=286232.8333333333, ans=0.125 2024-06-20 23:22:18,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=286232.8333333333, ans=0.125 2024-06-20 23:22:22,714 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.615e+02 1.864e+02 1.997e+02 2.170e+02 3.078e+02, threshold=3.995e+02, percent-clipped=0.0 2024-06-20 23:22:24,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=286232.8333333333, ans=6.0 2024-06-20 23:22:24,905 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:22:33,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286251.1666666667, ans=0.1 2024-06-20 23:22:40,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286269.5, ans=0.125 2024-06-20 23:22:46,625 INFO [train.py:1028] (0/2) Epoch 16, batch 4400, loss[loss=0.2092, simple_loss=0.2571, pruned_loss=0.08068, over 13204.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2433, pruned_loss=0.07436, over 2586674.67 frames. ], batch size: 83, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:22:56,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=286306.1666666667, ans=0.2 2024-06-20 23:22:58,016 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.55 vs. limit=15.0 2024-06-20 23:23:11,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=286324.5, ans=0.125 2024-06-20 23:23:16,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286342.8333333333, ans=0.1 2024-06-20 23:23:22,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286342.8333333333, ans=0.1 2024-06-20 23:23:24,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=286361.1666666667, ans=0.125 2024-06-20 23:23:31,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=286361.1666666667, ans=0.0 2024-06-20 23:23:35,963 INFO [train.py:1028] (0/2) Epoch 16, batch 4450, loss[loss=0.1932, simple_loss=0.2498, pruned_loss=0.06827, over 12829.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2442, pruned_loss=0.07479, over 2581328.25 frames. ], batch size: 33, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:23:37,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=286379.5, ans=0.07 2024-06-20 23:23:58,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286416.1666666667, ans=0.1 2024-06-20 23:24:00,557 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.644e+02 1.863e+02 2.029e+02 2.205e+02 2.881e+02, threshold=4.057e+02, percent-clipped=0.0 2024-06-20 23:24:11,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=286434.5, ans=0.0 2024-06-20 23:24:13,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286452.8333333333, ans=0.1 2024-06-20 23:24:20,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=286452.8333333333, ans=0.125 2024-06-20 23:24:23,592 INFO [train.py:1028] (0/2) Epoch 16, batch 4500, loss[loss=0.1841, simple_loss=0.2301, pruned_loss=0.06908, over 13228.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2435, pruned_loss=0.07445, over 2585234.87 frames. ], batch size: 89, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:24:45,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2024-06-20 23:25:07,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=286526.1666666667, ans=0.125 2024-06-20 23:25:23,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=286562.8333333333, ans=0.0 2024-06-20 23:25:24,230 INFO [train.py:1028] (0/2) Epoch 16, batch 4550, loss[loss=0.1944, simple_loss=0.2414, pruned_loss=0.07369, over 13283.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2438, pruned_loss=0.07475, over 2588949.08 frames. ], batch size: 52, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:25:36,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=286581.1666666667, ans=0.2 2024-06-20 23:25:39,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286581.1666666667, ans=0.1 2024-06-20 23:25:49,047 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.808e+02 1.933e+02 2.089e+02 2.767e+02, threshold=3.867e+02, percent-clipped=0.0 2024-06-20 23:26:00,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2024-06-20 23:26:07,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=286636.1666666667, ans=0.125 2024-06-20 23:26:12,689 INFO [train.py:1028] (0/2) Epoch 16, batch 4600, loss[loss=0.2234, simple_loss=0.2665, pruned_loss=0.09019, over 12531.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2441, pruned_loss=0.0749, over 2584372.10 frames. ], batch size: 202, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:26:16,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=286654.5, ans=0.0 2024-06-20 23:26:39,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=16.79 vs. limit=15.0 2024-06-20 23:26:46,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2024-06-20 23:26:50,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=15.0 2024-06-20 23:26:52,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=286727.8333333333, ans=0.0 2024-06-20 23:26:58,926 INFO [train.py:1028] (0/2) Epoch 16, batch 4650, loss[loss=0.1815, simple_loss=0.2241, pruned_loss=0.06939, over 13108.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2436, pruned_loss=0.07473, over 2587255.14 frames. ], batch size: 132, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:26:59,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2024-06-20 23:27:06,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286746.1666666667, ans=0.1 2024-06-20 23:27:17,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=286782.8333333333, ans=0.0 2024-06-20 23:27:22,206 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.829e+02 1.958e+02 2.171e+02 3.937e+02, threshold=3.917e+02, percent-clipped=1.0 2024-06-20 23:27:26,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=286801.1666666667, ans=0.0 2024-06-20 23:27:50,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=286819.5, ans=0.125 2024-06-20 23:27:52,285 INFO [train.py:1028] (0/2) Epoch 16, batch 4700, loss[loss=0.1936, simple_loss=0.2387, pruned_loss=0.07422, over 12853.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2434, pruned_loss=0.07463, over 2582867.28 frames. ], batch size: 26, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:27:52,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=286837.8333333333, ans=0.09899494936611666 2024-06-20 23:28:15,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=286874.5, ans=0.0 2024-06-20 23:28:15,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=286874.5, ans=0.04949747468305833 2024-06-20 23:28:27,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=22.5 2024-06-20 23:28:34,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286911.1666666667, ans=0.125 2024-06-20 23:28:35,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286911.1666666667, ans=0.1 2024-06-20 23:28:38,678 INFO [train.py:1028] (0/2) Epoch 16, batch 4750, loss[loss=0.2219, simple_loss=0.2622, pruned_loss=0.09081, over 12454.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2428, pruned_loss=0.07461, over 2579221.33 frames. ], batch size: 202, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:28:50,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2024-06-20 23:28:56,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=22.5 2024-06-20 23:29:01,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2024-06-20 23:29:04,424 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.674e+02 1.938e+02 2.108e+02 2.370e+02 3.291e+02, threshold=4.216e+02, percent-clipped=0.0 2024-06-20 23:29:09,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=286984.5, ans=0.05 2024-06-20 23:29:27,729 INFO [train.py:1028] (0/2) Epoch 16, batch 4800, loss[loss=0.1766, simple_loss=0.2274, pruned_loss=0.06288, over 13261.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2433, pruned_loss=0.07459, over 2576527.18 frames. ], batch size: 63, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:29:31,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=287021.1666666667, ans=0.95 2024-06-20 23:29:31,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=15.0 2024-06-20 23:29:32,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=287021.1666666667, ans=0.125 2024-06-20 23:29:39,455 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:29:42,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2024-06-20 23:29:53,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=287076.1666666667, ans=0.125 2024-06-20 23:29:58,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=287076.1666666667, ans=0.125 2024-06-20 23:30:02,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287094.5, ans=0.125 2024-06-20 23:30:13,959 INFO [train.py:1028] (0/2) Epoch 16, batch 4850, loss[loss=0.1753, simple_loss=0.2277, pruned_loss=0.0614, over 13228.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2428, pruned_loss=0.07409, over 2574166.50 frames. ], batch size: 89, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:30:24,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2024-06-20 23:30:27,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287112.8333333333, ans=0.1 2024-06-20 23:30:29,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=287112.8333333333, ans=0.0 2024-06-20 23:30:38,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2024-06-20 23:30:51,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2024-06-20 23:30:52,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=287149.5, ans=0.1 2024-06-20 23:30:53,063 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.829e+02 1.967e+02 2.103e+02 2.788e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-20 23:31:13,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=287186.1666666667, ans=0.0 2024-06-20 23:31:15,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=287186.1666666667, ans=0.125 2024-06-20 23:31:15,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=287186.1666666667, ans=0.2 2024-06-20 23:31:19,308 INFO [train.py:1028] (0/2) Epoch 16, batch 4900, loss[loss=0.1874, simple_loss=0.2414, pruned_loss=0.06673, over 13208.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2427, pruned_loss=0.07411, over 2575388.87 frames. ], batch size: 59, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:31:41,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2024-06-20 23:31:57,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=287277.8333333333, ans=0.125 2024-06-20 23:32:01,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=287277.8333333333, ans=0.125 2024-06-20 23:32:03,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=287277.8333333333, ans=0.0 2024-06-20 23:32:06,107 INFO [train.py:1028] (0/2) Epoch 16, batch 4950, loss[loss=0.2136, simple_loss=0.2436, pruned_loss=0.0918, over 10873.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2427, pruned_loss=0.07443, over 2569059.76 frames. ], batch size: 303, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:32:30,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.607e+02 1.836e+02 1.982e+02 2.125e+02 2.832e+02, threshold=3.964e+02, percent-clipped=0.0 2024-06-20 23:32:30,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=287332.8333333333, ans=0.025 2024-06-20 23:32:36,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=287351.1666666667, ans=0.125 2024-06-20 23:32:53,541 INFO [train.py:1028] (0/2) Epoch 16, batch 5000, loss[loss=0.2181, simple_loss=0.2595, pruned_loss=0.08835, over 13189.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2429, pruned_loss=0.07436, over 2573954.11 frames. ], batch size: 95, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:33:48,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=15.0 2024-06-20 23:33:51,723 INFO [train.py:1028] (0/2) Epoch 16, batch 5050, loss[loss=0.1927, simple_loss=0.2517, pruned_loss=0.06681, over 12893.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2425, pruned_loss=0.07404, over 2572585.51 frames. ], batch size: 36, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:34:12,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287516.1666666667, ans=0.1 2024-06-20 23:34:16,923 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.882e+02 1.981e+02 2.255e+02 3.441e+02, threshold=3.962e+02, percent-clipped=0.0 2024-06-20 23:34:35,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287552.8333333333, ans=0.1 2024-06-20 23:34:41,715 INFO [train.py:1028] (0/2) Epoch 16, batch 5100, loss[loss=0.217, simple_loss=0.2649, pruned_loss=0.08459, over 12991.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.243, pruned_loss=0.07441, over 2568818.49 frames. ], batch size: 39, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:34:46,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287571.1666666667, ans=0.1 2024-06-20 23:34:48,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=287571.1666666667, ans=0.125 2024-06-20 23:34:54,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=287589.5, ans=0.125 2024-06-20 23:35:08,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=287607.8333333333, ans=0.2 2024-06-20 23:35:08,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=287626.1666666667, ans=0.2 2024-06-20 23:35:16,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287626.1666666667, ans=0.1 2024-06-20 23:35:26,029 INFO [train.py:1028] (0/2) Epoch 16, batch 5150, loss[loss=0.1863, simple_loss=0.2242, pruned_loss=0.07415, over 13100.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2428, pruned_loss=0.07469, over 2570798.68 frames. ], batch size: 132, lr: 3.65e-03, grad_scale: 32.0 2024-06-20 23:35:31,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=287662.8333333333, ans=0.1 2024-06-20 23:35:32,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2024-06-20 23:35:42,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=287681.1666666667, ans=0.125 2024-06-20 23:35:51,608 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.613e+02 1.805e+02 1.910e+02 2.086e+02 2.814e+02, threshold=3.821e+02, percent-clipped=0.0 2024-06-20 23:35:56,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=287717.8333333333, ans=0.0 2024-06-20 23:35:57,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2024-06-20 23:36:09,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.43 vs. limit=22.5 2024-06-20 23:36:15,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287736.1666666667, ans=0.1 2024-06-20 23:36:27,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=287754.5, ans=0.125 2024-06-20 23:36:28,189 INFO [train.py:1028] (0/2) Epoch 16, batch 5200, loss[loss=0.2087, simple_loss=0.2452, pruned_loss=0.08607, over 13159.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2421, pruned_loss=0.0744, over 2574894.86 frames. ], batch size: 95, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:36:29,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287754.5, ans=0.125 2024-06-20 23:36:38,815 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2024-06-20 23:36:56,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287809.5, ans=0.1 2024-06-20 23:37:02,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2024-06-20 23:37:09,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=287827.8333333333, ans=0.125 2024-06-20 23:37:10,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.52 vs. limit=22.5 2024-06-20 23:37:11,187 INFO [train.py:1028] (0/2) Epoch 16, batch 5250, loss[loss=0.2141, simple_loss=0.2612, pruned_loss=0.08346, over 13299.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2422, pruned_loss=0.0746, over 2569355.43 frames. ], batch size: 52, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:37:14,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287846.1666666667, ans=0.1 2024-06-20 23:37:19,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=287864.5, ans=0.2 2024-06-20 23:37:28,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=15.0 2024-06-20 23:37:33,215 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.860e+02 2.011e+02 2.207e+02 3.279e+02, threshold=4.021e+02, percent-clipped=0.0 2024-06-20 23:37:41,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=287901.1666666667, ans=0.05 2024-06-20 23:37:43,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=287901.1666666667, ans=0.0 2024-06-20 23:37:56,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=287937.8333333333, ans=0.025 2024-06-20 23:37:56,941 INFO [train.py:1028] (0/2) Epoch 16, batch 5300, loss[loss=0.2162, simple_loss=0.2625, pruned_loss=0.08497, over 13053.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2419, pruned_loss=0.0741, over 2565827.85 frames. ], batch size: 144, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:38:05,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=287937.8333333333, ans=0.025 2024-06-20 23:38:06,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=287956.1666666667, ans=0.125 2024-06-20 23:38:22,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=287974.5, ans=0.125 2024-06-20 23:38:28,651 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-20 23:38:35,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=288011.1666666667, ans=0.025 2024-06-20 23:38:50,045 INFO [train.py:1028] (0/2) Epoch 16, batch 5350, loss[loss=0.1876, simple_loss=0.2462, pruned_loss=0.06453, over 11737.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2415, pruned_loss=0.07375, over 2572814.00 frames. ], batch size: 17, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:38:50,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.16 vs. limit=15.0 2024-06-20 23:38:52,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=288029.5, ans=0.0 2024-06-20 23:38:52,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288029.5, ans=0.1 2024-06-20 23:39:06,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=288047.8333333333, ans=0.0 2024-06-20 23:39:24,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=288066.1666666667, ans=0.2 2024-06-20 23:39:25,374 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 1.852e+02 1.936e+02 2.092e+02 2.983e+02, threshold=3.872e+02, percent-clipped=0.0 2024-06-20 23:39:31,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=288084.5, ans=0.125 2024-06-20 23:39:39,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.64 vs. limit=15.0 2024-06-20 23:39:47,882 INFO [train.py:1028] (0/2) Epoch 16, batch 5400, loss[loss=0.2133, simple_loss=0.2473, pruned_loss=0.08963, over 12250.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2419, pruned_loss=0.07427, over 2565448.09 frames. ], batch size: 240, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:40:06,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=288139.5, ans=0.0 2024-06-20 23:40:15,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=288157.8333333333, ans=0.0 2024-06-20 23:40:20,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=288176.1666666667, ans=0.125 2024-06-20 23:40:24,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=288176.1666666667, ans=0.125 2024-06-20 23:40:29,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.16 vs. limit=10.0 2024-06-20 23:40:31,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2024-06-20 23:40:35,627 INFO [train.py:1028] (0/2) Epoch 16, batch 5450, loss[loss=0.1909, simple_loss=0.2383, pruned_loss=0.07176, over 12775.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2419, pruned_loss=0.0742, over 2570668.50 frames. ], batch size: 26, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:40:37,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=288212.8333333333, ans=0.125 2024-06-20 23:40:47,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288231.1666666667, ans=0.125 2024-06-20 23:40:49,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=288231.1666666667, ans=0.125 2024-06-20 23:40:50,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=288231.1666666667, ans=0.0 2024-06-20 23:40:52,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=288231.1666666667, ans=0.0 2024-06-20 23:41:00,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 1.839e+02 1.953e+02 2.091e+02 3.274e+02, threshold=3.906e+02, percent-clipped=0.0 2024-06-20 23:41:20,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=288286.1666666667, ans=0.0 2024-06-20 23:41:22,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=288286.1666666667, ans=0.0 2024-06-20 23:41:24,221 INFO [train.py:1028] (0/2) Epoch 16, batch 5500, loss[loss=0.2247, simple_loss=0.2559, pruned_loss=0.09676, over 12134.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2416, pruned_loss=0.07379, over 2565077.29 frames. ], batch size: 240, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:42:00,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=288359.5, ans=0.04949747468305833 2024-06-20 23:42:24,228 INFO [train.py:1028] (0/2) Epoch 16, batch 5550, loss[loss=0.2104, simple_loss=0.2561, pruned_loss=0.08239, over 13280.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2413, pruned_loss=0.07357, over 2568867.25 frames. ], batch size: 43, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:42:24,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2024-06-20 23:42:48,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.833e+02 1.956e+02 2.099e+02 2.857e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-20 23:42:53,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2024-06-20 23:42:54,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2024-06-20 23:43:11,301 INFO [train.py:1028] (0/2) Epoch 16, batch 5600, loss[loss=0.2008, simple_loss=0.2466, pruned_loss=0.07748, over 13220.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2408, pruned_loss=0.0734, over 2570795.38 frames. ], batch size: 89, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:43:22,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=288506.1666666667, ans=0.2 2024-06-20 23:43:48,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=15.0 2024-06-20 23:43:53,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=288561.1666666667, ans=0.125 2024-06-20 23:43:55,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=288561.1666666667, ans=0.025 2024-06-20 23:43:58,172 INFO [train.py:1028] (0/2) Epoch 16, batch 5650, loss[loss=0.2127, simple_loss=0.2534, pruned_loss=0.08604, over 12517.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2413, pruned_loss=0.0733, over 2576128.17 frames. ], batch size: 202, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:44:19,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=288616.1666666667, ans=0.125 2024-06-20 23:44:20,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=288616.1666666667, ans=6.0 2024-06-20 23:44:23,345 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 1.830e+02 1.952e+02 2.157e+02 2.981e+02, threshold=3.905e+02, percent-clipped=0.0 2024-06-20 23:44:24,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=288616.1666666667, ans=0.125 2024-06-20 23:44:52,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=288652.8333333333, ans=0.0 2024-06-20 23:44:52,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=288671.1666666667, ans=0.0 2024-06-20 23:44:53,690 INFO [train.py:1028] (0/2) Epoch 16, batch 5700, loss[loss=0.1904, simple_loss=0.2446, pruned_loss=0.06806, over 13238.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2409, pruned_loss=0.0732, over 2579519.91 frames. ], batch size: 63, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:44:58,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=288671.1666666667, ans=0.0 2024-06-20 23:45:00,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=288671.1666666667, ans=0.125 2024-06-20 23:45:03,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=288689.5, ans=0.125 2024-06-20 23:45:04,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2024-06-20 23:45:04,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=288689.5, ans=0.07 2024-06-20 23:45:20,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.92 vs. limit=15.0 2024-06-20 23:45:21,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=288707.8333333333, ans=0.0 2024-06-20 23:45:22,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-20 23:45:25,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=288726.1666666667, ans=0.2 2024-06-20 23:45:30,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=288726.1666666667, ans=0.07 2024-06-20 23:45:35,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.96 vs. limit=15.0 2024-06-20 23:45:38,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=288744.5, ans=0.125 2024-06-20 23:45:39,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=288744.5, ans=0.125 2024-06-20 23:45:40,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=288744.5, ans=0.0 2024-06-20 23:45:42,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=288762.8333333333, ans=0.0 2024-06-20 23:45:43,144 INFO [train.py:1028] (0/2) Epoch 16, batch 5750, loss[loss=0.2155, simple_loss=0.2558, pruned_loss=0.08763, over 12786.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2413, pruned_loss=0.0733, over 2580051.75 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 32.0 2024-06-20 23:45:45,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=288762.8333333333, ans=0.2 2024-06-20 23:45:58,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=288781.1666666667, ans=0.0 2024-06-20 23:46:07,135 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.855e+02 2.000e+02 2.153e+02 2.672e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-20 23:46:08,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=288799.5, ans=0.2 2024-06-20 23:46:18,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=288817.8333333333, ans=0.0 2024-06-20 23:46:22,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=288836.1666666667, ans=0.125 2024-06-20 23:46:26,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=288836.1666666667, ans=0.125 2024-06-20 23:46:29,081 INFO [train.py:1028] (0/2) Epoch 16, batch 5800, loss[loss=0.1996, simple_loss=0.2421, pruned_loss=0.07856, over 12729.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2423, pruned_loss=0.07426, over 2579491.40 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:46:33,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=288854.5, ans=0.125 2024-06-20 23:46:45,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288872.8333333333, ans=0.125 2024-06-20 23:46:54,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=288891.1666666667, ans=0.125 2024-06-20 23:47:07,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.88 vs. limit=15.0 2024-06-20 23:47:15,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=288927.8333333333, ans=0.125 2024-06-20 23:47:18,846 INFO [train.py:1028] (0/2) Epoch 16, batch 5850, loss[loss=0.2262, simple_loss=0.2676, pruned_loss=0.09242, over 12590.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2445, pruned_loss=0.07529, over 2577449.00 frames. ], batch size: 202, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:47:39,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=288982.8333333333, ans=0.1 2024-06-20 23:47:41,857 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 1.910e+02 2.099e+02 2.263e+02 2.938e+02, threshold=4.198e+02, percent-clipped=0.0 2024-06-20 23:47:53,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=289001.1666666667, ans=0.125 2024-06-20 23:47:56,355 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:48:00,523 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2024-06-20 23:48:00,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=289019.5, ans=0.125 2024-06-20 23:48:01,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2024-06-20 23:48:02,611 INFO [train.py:1028] (0/2) Epoch 16, batch 5900, loss[loss=0.1995, simple_loss=0.2404, pruned_loss=0.07931, over 13089.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2468, pruned_loss=0.07609, over 2576097.31 frames. ], batch size: 121, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:48:04,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=289037.8333333333, ans=0.09899494936611666 2024-06-20 23:48:11,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=289056.1666666667, ans=0.025 2024-06-20 23:48:28,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-20 23:48:32,489 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.11 vs. limit=22.5 2024-06-20 23:48:43,444 INFO [train.py:1028] (0/2) Epoch 16, batch 5950, loss[loss=0.1957, simple_loss=0.2392, pruned_loss=0.07607, over 13165.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2488, pruned_loss=0.0769, over 2581279.01 frames. ], batch size: 121, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:48:44,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=289129.5, ans=0.125 2024-06-20 23:49:09,086 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.540e+02 1.895e+02 2.011e+02 2.184e+02 3.157e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-20 23:49:15,101 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-20 23:49:16,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=289184.5, ans=0.025 2024-06-20 23:49:27,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=289202.8333333333, ans=0.125 2024-06-20 23:49:32,800 INFO [train.py:1028] (0/2) Epoch 16, batch 6000, loss[loss=0.2435, simple_loss=0.2785, pruned_loss=0.1042, over 12215.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2503, pruned_loss=0.07745, over 2574754.40 frames. ], batch size: 241, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:49:32,803 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-20 23:49:43,880 INFO [train.py:1060] (0/2) Epoch 16, validation: loss=0.1885, simple_loss=0.2532, pruned_loss=0.0619, over 351949.00 frames. 2024-06-20 23:49:43,882 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-20 23:49:55,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.03 vs. limit=15.0 2024-06-20 23:49:56,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289239.5, ans=0.1 2024-06-20 23:50:33,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.33 vs. limit=10.0 2024-06-20 23:50:35,117 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.368e-01 2024-06-20 23:50:41,535 INFO [train.py:1028] (0/2) Epoch 16, batch 6050, loss[loss=0.2066, simple_loss=0.2546, pruned_loss=0.07933, over 12919.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2521, pruned_loss=0.07795, over 2576867.95 frames. ], batch size: 39, lr: 3.64e-03, grad_scale: 64.0 2024-06-20 23:50:43,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289312.8333333333, ans=0.1 2024-06-20 23:51:12,653 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.944e+02 2.087e+02 2.324e+02 3.316e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-20 23:51:14,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=12.0 2024-06-20 23:51:36,276 INFO [train.py:1028] (0/2) Epoch 16, batch 6100, loss[loss=0.2045, simple_loss=0.2455, pruned_loss=0.08171, over 13151.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2534, pruned_loss=0.07832, over 2578834.82 frames. ], batch size: 121, lr: 3.63e-03, grad_scale: 64.0 2024-06-20 23:51:50,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2024-06-20 23:51:53,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2024-06-20 23:52:02,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.65 vs. limit=22.5 2024-06-20 23:52:10,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.07 vs. limit=22.5 2024-06-20 23:52:26,498 INFO [train.py:1028] (0/2) Epoch 16, batch 6150, loss[loss=0.2009, simple_loss=0.2482, pruned_loss=0.07685, over 10938.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.255, pruned_loss=0.07921, over 2577954.48 frames. ], batch size: 304, lr: 3.63e-03, grad_scale: 64.0 2024-06-20 23:52:30,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=289496.1666666667, ans=0.0 2024-06-20 23:52:50,715 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.959e+02 2.236e+02 2.570e+02 4.175e+02, threshold=4.473e+02, percent-clipped=0.0 2024-06-20 23:52:51,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=289532.8333333333, ans=0.2 2024-06-20 23:53:05,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=289551.1666666667, ans=0.025 2024-06-20 23:53:07,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=289551.1666666667, ans=0.125 2024-06-20 23:53:08,825 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:53:14,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=289569.5, ans=0.2 2024-06-20 23:53:19,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=289569.5, ans=0.125 2024-06-20 23:53:20,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=289587.8333333333, ans=0.0 2024-06-20 23:53:21,605 INFO [train.py:1028] (0/2) Epoch 16, batch 6200, loss[loss=0.2399, simple_loss=0.2849, pruned_loss=0.09744, over 13264.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2566, pruned_loss=0.07991, over 2575605.16 frames. ], batch size: 89, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:53:23,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.14 vs. limit=22.5 2024-06-20 23:53:23,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=289587.8333333333, ans=0.125 2024-06-20 23:53:28,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.14 vs. limit=15.0 2024-06-20 23:53:29,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=289587.8333333333, ans=0.0 2024-06-20 23:53:30,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=289606.1666666667, ans=0.125 2024-06-20 23:53:59,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=289642.8333333333, ans=0.1 2024-06-20 23:54:03,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=289642.8333333333, ans=0.125 2024-06-20 23:54:12,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=289661.1666666667, ans=0.0 2024-06-20 23:54:17,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289661.1666666667, ans=0.1 2024-06-20 23:54:19,003 INFO [train.py:1028] (0/2) Epoch 16, batch 6250, loss[loss=0.2095, simple_loss=0.2632, pruned_loss=0.07787, over 13216.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.258, pruned_loss=0.08046, over 2569076.21 frames. ], batch size: 83, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:54:22,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2024-06-20 23:54:32,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=289697.8333333333, ans=0.125 2024-06-20 23:54:37,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289716.1666666667, ans=0.125 2024-06-20 23:54:44,865 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.008e+02 2.208e+02 2.592e+02 3.461e+02, threshold=4.417e+02, percent-clipped=0.0 2024-06-20 23:54:56,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=289752.8333333333, ans=0.0 2024-06-20 23:54:58,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289752.8333333333, ans=0.1 2024-06-20 23:55:00,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2024-06-20 23:55:03,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=289752.8333333333, ans=0.125 2024-06-20 23:55:05,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=289771.1666666667, ans=10.0 2024-06-20 23:55:06,232 INFO [train.py:1028] (0/2) Epoch 16, batch 6300, loss[loss=0.1941, simple_loss=0.2484, pruned_loss=0.06991, over 11376.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2593, pruned_loss=0.08096, over 2564390.64 frames. ], batch size: 16, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:55:11,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=289771.1666666667, ans=0.125 2024-06-20 23:55:12,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.68 vs. limit=12.0 2024-06-20 23:55:15,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=289789.5, ans=0.0 2024-06-20 23:55:19,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.90 vs. limit=22.5 2024-06-20 23:55:42,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289826.1666666667, ans=0.1 2024-06-20 23:55:52,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=289844.5, ans=0.125 2024-06-20 23:55:54,200 INFO [train.py:1028] (0/2) Epoch 16, batch 6350, loss[loss=0.2347, simple_loss=0.2844, pruned_loss=0.09251, over 12560.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.261, pruned_loss=0.08098, over 2574641.85 frames. ], batch size: 202, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:56:05,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=289862.8333333333, ans=0.125 2024-06-20 23:56:24,083 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.023e+02 2.243e+02 2.516e+02 3.825e+02, threshold=4.486e+02, percent-clipped=0.0 2024-06-20 23:56:31,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289917.8333333333, ans=0.0 2024-06-20 23:56:38,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289936.1666666667, ans=0.0 2024-06-20 23:56:39,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289936.1666666667, ans=0.0 2024-06-20 23:56:46,660 INFO [train.py:1028] (0/2) Epoch 16, batch 6400, loss[loss=0.2053, simple_loss=0.2553, pruned_loss=0.07767, over 13228.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2625, pruned_loss=0.08156, over 2575838.01 frames. ], batch size: 67, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:56:56,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=289954.5, ans=0.025 2024-06-20 23:56:57,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=289954.5, ans=0.125 2024-06-20 23:57:03,523 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-20 23:57:30,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=290009.5, ans=0.2 2024-06-20 23:57:34,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=290027.8333333333, ans=0.0 2024-06-20 23:57:39,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290027.8333333333, ans=0.1 2024-06-20 23:57:41,387 INFO [train.py:1028] (0/2) Epoch 16, batch 6450, loss[loss=0.2508, simple_loss=0.2889, pruned_loss=0.1064, over 12528.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2644, pruned_loss=0.0826, over 2581089.47 frames. ], batch size: 202, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:57:42,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=290046.1666666667, ans=0.125 2024-06-20 23:57:56,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=290064.5, ans=0.0 2024-06-20 23:58:03,775 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.008e+02 2.198e+02 2.547e+02 3.640e+02, threshold=4.396e+02, percent-clipped=0.0 2024-06-20 23:58:25,307 INFO [train.py:1028] (0/2) Epoch 16, batch 6500, loss[loss=0.233, simple_loss=0.2754, pruned_loss=0.09532, over 10704.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2653, pruned_loss=0.08273, over 2585460.81 frames. ], batch size: 303, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:58:26,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=290137.8333333333, ans=0.95 2024-06-20 23:58:33,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=290137.8333333333, ans=0.125 2024-06-20 23:58:39,347 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=5.921e-01 2024-06-20 23:58:39,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=290156.1666666667, ans=0.0 2024-06-20 23:59:18,954 INFO [train.py:1028] (0/2) Epoch 16, batch 6550, loss[loss=0.2135, simple_loss=0.2702, pruned_loss=0.07838, over 12586.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2662, pruned_loss=0.08281, over 2589248.04 frames. ], batch size: 22, lr: 3.63e-03, grad_scale: 32.0 2024-06-20 23:59:20,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=290229.5, ans=0.07 2024-06-20 23:59:22,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2024-06-20 23:59:25,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=290229.5, ans=0.125 2024-06-20 23:59:36,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.27 vs. limit=15.0 2024-06-20 23:59:43,668 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 2.054e+02 2.209e+02 2.430e+02 3.221e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-20 23:59:45,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=290284.5, ans=0.125 2024-06-21 00:00:08,703 INFO [train.py:1028] (0/2) Epoch 16, batch 6600, loss[loss=0.1969, simple_loss=0.2556, pruned_loss=0.06912, over 13201.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2666, pruned_loss=0.08279, over 2591818.21 frames. ], batch size: 72, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:00:12,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=290321.1666666667, ans=0.125 2024-06-21 00:00:21,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=290339.5, ans=0.0 2024-06-21 00:00:29,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=290357.8333333333, ans=0.125 2024-06-21 00:00:42,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290376.1666666667, ans=0.125 2024-06-21 00:00:50,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=290394.5, ans=0.1 2024-06-21 00:00:53,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=290394.5, ans=0.025 2024-06-21 00:00:54,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290394.5, ans=0.1 2024-06-21 00:00:54,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2024-06-21 00:00:58,841 INFO [train.py:1028] (0/2) Epoch 16, batch 6650, loss[loss=0.2513, simple_loss=0.2906, pruned_loss=0.106, over 12948.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2681, pruned_loss=0.08358, over 2585490.24 frames. ], batch size: 158, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:00:58,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=290412.8333333333, ans=0.0 2024-06-21 00:01:06,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.21 vs. limit=15.0 2024-06-21 00:01:07,125 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:01:16,120 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2024-06-21 00:01:25,731 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.042e+02 2.235e+02 2.498e+02 3.205e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 00:01:25,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=290449.5, ans=0.125 2024-06-21 00:01:37,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2024-06-21 00:01:39,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.42 vs. limit=15.0 2024-06-21 00:01:48,359 INFO [train.py:1028] (0/2) Epoch 16, batch 6700, loss[loss=0.2383, simple_loss=0.2869, pruned_loss=0.09481, over 12764.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2686, pruned_loss=0.08373, over 2584177.84 frames. ], batch size: 177, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:01:53,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=290504.5, ans=0.0 2024-06-21 00:02:14,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=290541.1666666667, ans=0.0 2024-06-21 00:02:38,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=290577.8333333333, ans=0.0 2024-06-21 00:02:43,106 INFO [train.py:1028] (0/2) Epoch 16, batch 6750, loss[loss=0.2744, simple_loss=0.3142, pruned_loss=0.1173, over 12291.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2696, pruned_loss=0.08444, over 2579456.91 frames. ], batch size: 241, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:03:14,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.950e+02 2.113e+02 2.278e+02 2.957e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-21 00:03:16,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=290651.1666666667, ans=0.0 2024-06-21 00:03:20,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=290651.1666666667, ans=0.0 2024-06-21 00:03:32,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=290669.5, ans=0.0 2024-06-21 00:03:34,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=290669.5, ans=0.125 2024-06-21 00:03:36,836 INFO [train.py:1028] (0/2) Epoch 16, batch 6800, loss[loss=0.1935, simple_loss=0.2556, pruned_loss=0.06569, over 13163.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2712, pruned_loss=0.08494, over 2580490.45 frames. ], batch size: 67, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:03:59,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=15.0 2024-06-21 00:04:00,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=290724.5, ans=0.125 2024-06-21 00:04:22,669 INFO [train.py:1028] (0/2) Epoch 16, batch 6850, loss[loss=0.2271, simple_loss=0.2879, pruned_loss=0.0832, over 13321.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2717, pruned_loss=0.08468, over 2584218.52 frames. ], batch size: 63, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:04:30,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=15.0 2024-06-21 00:04:39,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=290797.8333333333, ans=0.125 2024-06-21 00:04:49,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.990e+02 2.193e+02 2.477e+02 3.705e+02, threshold=4.387e+02, percent-clipped=0.0 2024-06-21 00:04:58,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=290834.5, ans=0.0 2024-06-21 00:05:02,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=290852.8333333333, ans=0.125 2024-06-21 00:05:15,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=290852.8333333333, ans=0.0 2024-06-21 00:05:16,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=290852.8333333333, ans=0.125 2024-06-21 00:05:18,516 INFO [train.py:1028] (0/2) Epoch 16, batch 6900, loss[loss=0.218, simple_loss=0.2728, pruned_loss=0.08163, over 13287.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2727, pruned_loss=0.08507, over 2585769.45 frames. ], batch size: 49, lr: 3.63e-03, grad_scale: 32.0 2024-06-21 00:05:28,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=12.0 2024-06-21 00:05:35,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=290889.5, ans=0.125 2024-06-21 00:05:39,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290907.8333333333, ans=0.125 2024-06-21 00:05:49,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.17 vs. limit=22.5 2024-06-21 00:05:49,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=290926.1666666667, ans=0.0 2024-06-21 00:05:51,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=290926.1666666667, ans=0.2 2024-06-21 00:06:02,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290944.5, ans=0.1 2024-06-21 00:06:06,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=290944.5, ans=0.025 2024-06-21 00:06:13,575 INFO [train.py:1028] (0/2) Epoch 16, batch 6950, loss[loss=0.2053, simple_loss=0.2498, pruned_loss=0.08034, over 12138.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2725, pruned_loss=0.08505, over 2579077.03 frames. ], batch size: 18, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:06:18,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=290962.8333333333, ans=0.2 2024-06-21 00:06:30,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=290981.1666666667, ans=0.0 2024-06-21 00:06:33,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=290999.5, ans=0.125 2024-06-21 00:06:39,240 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 1.996e+02 2.159e+02 2.274e+02 2.965e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 00:06:41,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=291017.8333333333, ans=0.0 2024-06-21 00:06:50,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=291036.1666666667, ans=0.125 2024-06-21 00:07:00,094 INFO [train.py:1028] (0/2) Epoch 16, batch 7000, loss[loss=0.2234, simple_loss=0.2692, pruned_loss=0.08887, over 12967.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2724, pruned_loss=0.08465, over 2576334.72 frames. ], batch size: 158, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:07:09,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=291072.8333333333, ans=0.125 2024-06-21 00:07:09,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=291072.8333333333, ans=0.09899494936611666 2024-06-21 00:07:31,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=291109.5, ans=0.2 2024-06-21 00:07:33,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=291109.5, ans=0.125 2024-06-21 00:07:36,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=15.0 2024-06-21 00:07:47,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=291146.1666666667, ans=0.125 2024-06-21 00:07:47,826 INFO [train.py:1028] (0/2) Epoch 16, batch 7050, loss[loss=0.2254, simple_loss=0.2767, pruned_loss=0.08703, over 12803.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2741, pruned_loss=0.08538, over 2582780.15 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:07:51,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.67 vs. limit=22.5 2024-06-21 00:08:20,430 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.068e+02 2.289e+02 2.568e+02 3.625e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 00:08:37,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=291219.5, ans=0.09899494936611666 2024-06-21 00:08:39,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2024-06-21 00:08:41,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=291219.5, ans=0.125 2024-06-21 00:08:43,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2024-06-21 00:08:43,310 INFO [train.py:1028] (0/2) Epoch 16, batch 7100, loss[loss=0.223, simple_loss=0.2781, pruned_loss=0.08388, over 13207.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.275, pruned_loss=0.08603, over 2575250.52 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:08:46,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=291237.8333333333, ans=0.025 2024-06-21 00:08:59,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.60 vs. limit=15.0 2024-06-21 00:09:05,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-21 00:09:05,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=291274.5, ans=0.125 2024-06-21 00:09:06,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=291274.5, ans=0.125 2024-06-21 00:09:24,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=291311.1666666667, ans=0.2 2024-06-21 00:09:34,684 INFO [train.py:1028] (0/2) Epoch 16, batch 7150, loss[loss=0.2491, simple_loss=0.2954, pruned_loss=0.1013, over 12536.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2752, pruned_loss=0.08567, over 2572887.45 frames. ], batch size: 202, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:09:38,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=291329.5, ans=0.125 2024-06-21 00:09:51,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2024-06-21 00:09:53,278 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2024-06-21 00:09:56,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=291366.1666666667, ans=15.0 2024-06-21 00:09:56,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=291366.1666666667, ans=0.125 2024-06-21 00:10:00,866 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 1.990e+02 2.134e+02 2.356e+02 3.295e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 00:10:12,712 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=15.0 2024-06-21 00:10:22,857 INFO [train.py:1028] (0/2) Epoch 16, batch 7200, loss[loss=0.2504, simple_loss=0.2981, pruned_loss=0.1014, over 13183.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2764, pruned_loss=0.08609, over 2578286.53 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:10:26,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=291421.1666666667, ans=0.2 2024-06-21 00:10:32,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=291439.5, ans=0.1 2024-06-21 00:10:43,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=291476.1666666667, ans=0.0 2024-06-21 00:10:47,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.59 vs. limit=10.0 2024-06-21 00:10:53,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=15.0 2024-06-21 00:10:59,145 INFO [train.py:1028] (0/2) Epoch 16, batch 7250, loss[loss=0.2348, simple_loss=0.2852, pruned_loss=0.09222, over 12906.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2771, pruned_loss=0.0863, over 2580664.50 frames. ], batch size: 36, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:11:24,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=291549.5, ans=0.0 2024-06-21 00:11:29,712 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 1.997e+02 2.143e+02 2.395e+02 3.371e+02, threshold=4.287e+02, percent-clipped=0.0 2024-06-21 00:11:34,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=291567.8333333333, ans=0.0 2024-06-21 00:11:47,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=291586.1666666667, ans=0.1 2024-06-21 00:11:58,685 INFO [train.py:1028] (0/2) Epoch 16, batch 7300, loss[loss=0.2047, simple_loss=0.2718, pruned_loss=0.0688, over 13042.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2783, pruned_loss=0.08676, over 2581265.58 frames. ], batch size: 36, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:12:01,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=291604.5, ans=0.0 2024-06-21 00:12:02,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2024-06-21 00:12:04,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=291604.5, ans=0.125 2024-06-21 00:12:09,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.00 vs. limit=22.5 2024-06-21 00:12:10,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=291622.8333333333, ans=0.5 2024-06-21 00:12:30,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=291659.5, ans=0.125 2024-06-21 00:12:33,680 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:12:45,695 INFO [train.py:1028] (0/2) Epoch 16, batch 7350, loss[loss=0.2519, simple_loss=0.3017, pruned_loss=0.1011, over 13235.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.279, pruned_loss=0.08715, over 2582594.27 frames. ], batch size: 46, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:12:47,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=291696.1666666667, ans=0.0 2024-06-21 00:12:48,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=291696.1666666667, ans=0.0 2024-06-21 00:12:49,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=291696.1666666667, ans=0.2 2024-06-21 00:12:54,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291714.5, ans=0.1 2024-06-21 00:13:03,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2024-06-21 00:13:03,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=291714.5, ans=0.1 2024-06-21 00:13:11,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=291732.8333333333, ans=0.2 2024-06-21 00:13:12,615 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.770e+02 1.960e+02 2.120e+02 2.309e+02 2.996e+02, threshold=4.241e+02, percent-clipped=0.0 2024-06-21 00:13:12,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=291732.8333333333, ans=0.0 2024-06-21 00:13:19,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=291751.1666666667, ans=0.125 2024-06-21 00:13:34,332 INFO [train.py:1028] (0/2) Epoch 16, batch 7400, loss[loss=0.2317, simple_loss=0.2946, pruned_loss=0.08435, over 13257.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2795, pruned_loss=0.08723, over 2587633.30 frames. ], batch size: 63, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:13:43,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=291806.1666666667, ans=0.035 2024-06-21 00:13:48,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.92 vs. limit=6.0 2024-06-21 00:13:58,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.82 vs. limit=12.0 2024-06-21 00:13:59,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2024-06-21 00:14:18,419 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:14:26,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=291861.1666666667, ans=0.125 2024-06-21 00:14:30,400 INFO [train.py:1028] (0/2) Epoch 16, batch 7450, loss[loss=0.2087, simple_loss=0.2625, pruned_loss=0.07747, over 13042.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2795, pruned_loss=0.08712, over 2581876.81 frames. ], batch size: 30, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:14:59,711 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.009e+02 2.161e+02 2.336e+02 2.937e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-21 00:15:15,085 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.59 vs. limit=15.0 2024-06-21 00:15:15,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=291934.5, ans=0.125 2024-06-21 00:15:19,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=291952.8333333333, ans=0.125 2024-06-21 00:15:20,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=291952.8333333333, ans=0.95 2024-06-21 00:15:21,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=291952.8333333333, ans=10.0 2024-06-21 00:15:27,198 INFO [train.py:1028] (0/2) Epoch 16, batch 7500, loss[loss=0.2351, simple_loss=0.2805, pruned_loss=0.09483, over 10512.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2799, pruned_loss=0.08762, over 2578308.46 frames. ], batch size: 303, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:15:33,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=291971.1666666667, ans=0.0 2024-06-21 00:15:37,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=291989.5, ans=0.0 2024-06-21 00:15:39,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2024-06-21 00:15:43,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=291989.5, ans=0.125 2024-06-21 00:15:43,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=22.5 2024-06-21 00:15:51,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=292007.8333333333, ans=0.125 2024-06-21 00:15:59,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=15.0 2024-06-21 00:16:03,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=292026.1666666667, ans=0.0 2024-06-21 00:16:04,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.79 vs. limit=15.0 2024-06-21 00:16:12,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=292044.5, ans=0.125 2024-06-21 00:16:15,124 INFO [train.py:1028] (0/2) Epoch 16, batch 7550, loss[loss=0.2252, simple_loss=0.2724, pruned_loss=0.08904, over 12940.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2803, pruned_loss=0.0881, over 2577642.29 frames. ], batch size: 158, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:16:23,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=292081.1666666667, ans=0.0 2024-06-21 00:16:39,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=292099.5, ans=0.0 2024-06-21 00:16:42,350 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.045e+02 2.229e+02 2.488e+02 3.356e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-21 00:16:46,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292117.8333333333, ans=0.1 2024-06-21 00:16:53,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=292136.1666666667, ans=0.0 2024-06-21 00:17:04,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.03 vs. limit=22.5 2024-06-21 00:17:05,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.43 vs. limit=15.0 2024-06-21 00:17:09,785 INFO [train.py:1028] (0/2) Epoch 16, batch 7600, loss[loss=0.2131, simple_loss=0.2656, pruned_loss=0.08027, over 13235.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2807, pruned_loss=0.0881, over 2576215.98 frames. ], batch size: 83, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:17:30,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=15.27 vs. limit=15.0 2024-06-21 00:17:31,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=292191.1666666667, ans=0.0 2024-06-21 00:17:35,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=292191.1666666667, ans=0.125 2024-06-21 00:17:36,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=292191.1666666667, ans=0.125 2024-06-21 00:17:40,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=292209.5, ans=0.1 2024-06-21 00:18:06,447 INFO [train.py:1028] (0/2) Epoch 16, batch 7650, loss[loss=0.2182, simple_loss=0.2782, pruned_loss=0.07915, over 12968.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2819, pruned_loss=0.08864, over 2573228.43 frames. ], batch size: 33, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:18:09,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.20 vs. limit=10.0 2024-06-21 00:18:17,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2024-06-21 00:18:24,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=292264.5, ans=0.07 2024-06-21 00:18:28,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=292282.8333333333, ans=10.0 2024-06-21 00:18:33,496 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.031e+02 2.163e+02 2.370e+02 8.674e+02, threshold=4.327e+02, percent-clipped=1.0 2024-06-21 00:18:41,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=292301.1666666667, ans=0.125 2024-06-21 00:18:42,618 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:18:42,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=292301.1666666667, ans=0.125 2024-06-21 00:18:50,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=292319.5, ans=0.025 2024-06-21 00:18:54,624 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:18:55,138 INFO [train.py:1028] (0/2) Epoch 16, batch 7700, loss[loss=0.2288, simple_loss=0.2859, pruned_loss=0.08586, over 13248.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2823, pruned_loss=0.08886, over 2569984.55 frames. ], batch size: 63, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:19:02,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292337.8333333333, ans=0.1 2024-06-21 00:19:04,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292356.1666666667, ans=0.1 2024-06-21 00:19:10,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=292356.1666666667, ans=0.125 2024-06-21 00:19:12,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=292374.5, ans=0.125 2024-06-21 00:19:18,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292374.5, ans=0.1 2024-06-21 00:19:31,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=292392.8333333333, ans=0.025 2024-06-21 00:19:34,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=292411.1666666667, ans=0.125 2024-06-21 00:19:41,966 INFO [train.py:1028] (0/2) Epoch 16, batch 7750, loss[loss=0.2509, simple_loss=0.3117, pruned_loss=0.09506, over 13216.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2827, pruned_loss=0.08922, over 2574117.67 frames. ], batch size: 72, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:19:43,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=292429.5, ans=0.1 2024-06-21 00:20:03,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=292447.8333333333, ans=0.125 2024-06-21 00:20:15,053 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.111e+02 2.280e+02 2.502e+02 3.219e+02, threshold=4.560e+02, percent-clipped=0.0 2024-06-21 00:20:34,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.05 vs. limit=12.0 2024-06-21 00:20:36,012 INFO [train.py:1028] (0/2) Epoch 16, batch 7800, loss[loss=0.206, simple_loss=0.2649, pruned_loss=0.07351, over 13164.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2831, pruned_loss=0.08889, over 2578817.24 frames. ], batch size: 95, lr: 3.62e-03, grad_scale: 32.0 2024-06-21 00:20:44,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2024-06-21 00:20:48,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=292539.5, ans=0.125 2024-06-21 00:20:51,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=292539.5, ans=0.125 2024-06-21 00:20:51,984 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.669e+01 2024-06-21 00:21:04,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.94 vs. limit=12.0 2024-06-21 00:21:10,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=292576.1666666667, ans=0.125 2024-06-21 00:21:11,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=292576.1666666667, ans=0.2 2024-06-21 00:21:18,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=292594.5, ans=0.125 2024-06-21 00:21:24,878 INFO [train.py:1028] (0/2) Epoch 16, batch 7850, loss[loss=0.2127, simple_loss=0.2676, pruned_loss=0.07895, over 11324.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2837, pruned_loss=0.08919, over 2573477.59 frames. ], batch size: 16, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:21:48,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=292649.5, ans=0.125 2024-06-21 00:21:50,069 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.016e+02 2.161e+02 2.474e+02 3.343e+02, threshold=4.322e+02, percent-clipped=0.0 2024-06-21 00:21:51,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=292649.5, ans=0.0 2024-06-21 00:22:08,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=292686.1666666667, ans=0.125 2024-06-21 00:22:11,640 INFO [train.py:1028] (0/2) Epoch 16, batch 7900, loss[loss=0.2231, simple_loss=0.2785, pruned_loss=0.08386, over 13232.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2844, pruned_loss=0.08977, over 2572622.18 frames. ], batch size: 77, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:22:38,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=292741.1666666667, ans=0.125 2024-06-21 00:22:39,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2024-06-21 00:22:56,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=292777.8333333333, ans=0.2 2024-06-21 00:23:02,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=292777.8333333333, ans=0.125 2024-06-21 00:23:05,462 INFO [train.py:1028] (0/2) Epoch 16, batch 7950, loss[loss=0.2584, simple_loss=0.2948, pruned_loss=0.111, over 10643.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2848, pruned_loss=0.08965, over 2576018.01 frames. ], batch size: 303, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:23:08,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=292796.1666666667, ans=0.2 2024-06-21 00:23:11,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=292796.1666666667, ans=0.125 2024-06-21 00:23:26,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=292832.8333333333, ans=0.125 2024-06-21 00:23:29,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.35 vs. limit=15.0 2024-06-21 00:23:29,935 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.063e+02 2.247e+02 2.453e+02 3.551e+02, threshold=4.494e+02, percent-clipped=0.0 2024-06-21 00:23:31,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=15.0 2024-06-21 00:23:45,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=292851.1666666667, ans=0.125 2024-06-21 00:23:58,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=292869.5, ans=0.0 2024-06-21 00:23:59,879 INFO [train.py:1028] (0/2) Epoch 16, batch 8000, loss[loss=0.2236, simple_loss=0.2776, pruned_loss=0.0848, over 12622.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2859, pruned_loss=0.08998, over 2572025.84 frames. ], batch size: 29, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:24:10,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=292906.1666666667, ans=0.0 2024-06-21 00:24:26,736 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:24:36,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-21 00:24:45,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=292961.1666666667, ans=0.125 2024-06-21 00:24:48,192 INFO [train.py:1028] (0/2) Epoch 16, batch 8050, loss[loss=0.2312, simple_loss=0.2802, pruned_loss=0.09113, over 13200.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2853, pruned_loss=0.08943, over 2571684.15 frames. ], batch size: 83, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:24:48,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=292979.5, ans=0.025 2024-06-21 00:24:53,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=292979.5, ans=0.125 2024-06-21 00:24:54,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=292979.5, ans=0.0 2024-06-21 00:24:59,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=12.0 2024-06-21 00:25:06,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=293016.1666666667, ans=0.2 2024-06-21 00:25:12,715 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.034e+02 2.253e+02 2.433e+02 3.066e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 00:25:14,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=293034.5, ans=0.2 2024-06-21 00:25:20,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=293034.5, ans=0.125 2024-06-21 00:25:22,910 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2024-06-21 00:25:41,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=293071.1666666667, ans=0.125 2024-06-21 00:25:41,829 INFO [train.py:1028] (0/2) Epoch 16, batch 8100, loss[loss=0.2328, simple_loss=0.2905, pruned_loss=0.0876, over 13160.00 frames. ], tot_loss[loss=0.232, simple_loss=0.285, pruned_loss=0.08949, over 2576408.16 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:25:46,789 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:25:47,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=293071.1666666667, ans=0.125 2024-06-21 00:26:11,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=293126.1666666667, ans=0.0 2024-06-21 00:26:34,802 INFO [train.py:1028] (0/2) Epoch 16, batch 8150, loss[loss=0.235, simple_loss=0.2845, pruned_loss=0.09271, over 13117.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2851, pruned_loss=0.08941, over 2579673.56 frames. ], batch size: 121, lr: 3.61e-03, grad_scale: 32.0 2024-06-21 00:26:44,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=293181.1666666667, ans=0.0 2024-06-21 00:26:44,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=293181.1666666667, ans=0.125 2024-06-21 00:26:56,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=293199.5, ans=0.125 2024-06-21 00:27:00,888 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.070e+02 2.211e+02 2.375e+02 3.623e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 00:27:04,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=293217.8333333333, ans=0.025 2024-06-21 00:27:04,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=293217.8333333333, ans=0.0 2024-06-21 00:27:11,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=293217.8333333333, ans=0.5 2024-06-21 00:27:12,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=293217.8333333333, ans=0.125 2024-06-21 00:27:17,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.35 vs. limit=10.0 2024-06-21 00:27:18,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2024-06-21 00:27:23,497 INFO [train.py:1028] (0/2) Epoch 16, batch 8200, loss[loss=0.2285, simple_loss=0.2783, pruned_loss=0.0894, over 13170.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2849, pruned_loss=0.08909, over 2583107.31 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:27:33,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2024-06-21 00:27:39,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=293272.8333333333, ans=0.125 2024-06-21 00:27:48,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=293291.1666666667, ans=0.2 2024-06-21 00:27:51,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=293291.1666666667, ans=0.95 2024-06-21 00:27:51,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293291.1666666667, ans=0.1 2024-06-21 00:27:58,649 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:28:05,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293327.8333333333, ans=0.1 2024-06-21 00:28:06,290 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-160000.pt 2024-06-21 00:28:17,999 INFO [train.py:1028] (0/2) Epoch 16, batch 8250, loss[loss=0.2441, simple_loss=0.3067, pruned_loss=0.09077, over 13274.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2854, pruned_loss=0.08926, over 2583110.78 frames. ], batch size: 52, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:28:20,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=293346.1666666667, ans=0.0 2024-06-21 00:28:25,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=293346.1666666667, ans=0.125 2024-06-21 00:28:27,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=293364.5, ans=0.125 2024-06-21 00:28:39,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293364.5, ans=0.125 2024-06-21 00:28:50,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.131e+02 2.293e+02 2.526e+02 3.565e+02, threshold=4.587e+02, percent-clipped=0.0 2024-06-21 00:28:53,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=293401.1666666667, ans=0.0 2024-06-21 00:28:54,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=293401.1666666667, ans=0.125 2024-06-21 00:29:01,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=293419.5, ans=0.2 2024-06-21 00:29:04,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=293419.5, ans=0.125 2024-06-21 00:29:12,427 INFO [train.py:1028] (0/2) Epoch 16, batch 8300, loss[loss=0.2061, simple_loss=0.2536, pruned_loss=0.07924, over 13020.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2848, pruned_loss=0.08858, over 2580096.22 frames. ], batch size: 102, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:29:16,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=293437.8333333333, ans=0.125 2024-06-21 00:29:39,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=293474.5, ans=0.0 2024-06-21 00:29:43,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=293474.5, ans=0.2 2024-06-21 00:29:52,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=293492.8333333333, ans=0.0 2024-06-21 00:29:53,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.90 vs. limit=6.0 2024-06-21 00:29:55,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=293511.1666666667, ans=10.0 2024-06-21 00:30:04,096 INFO [train.py:1028] (0/2) Epoch 16, batch 8350, loss[loss=0.2269, simple_loss=0.284, pruned_loss=0.08488, over 13209.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.285, pruned_loss=0.08861, over 2579489.12 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:30:25,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=293566.1666666667, ans=0.125 2024-06-21 00:30:30,506 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.134e+02 2.298e+02 2.669e+02 4.226e+02, threshold=4.596e+02, percent-clipped=0.0 2024-06-21 00:30:30,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.36 vs. limit=12.0 2024-06-21 00:30:32,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=293584.5, ans=0.125 2024-06-21 00:30:36,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=293584.5, ans=0.125 2024-06-21 00:30:37,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2024-06-21 00:30:53,648 INFO [train.py:1028] (0/2) Epoch 16, batch 8400, loss[loss=0.2104, simple_loss=0.2584, pruned_loss=0.08116, over 12966.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2853, pruned_loss=0.08884, over 2577020.30 frames. ], batch size: 39, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:31:18,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=293657.8333333333, ans=0.125 2024-06-21 00:31:19,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=293657.8333333333, ans=0.0 2024-06-21 00:31:44,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=293694.5, ans=0.0 2024-06-21 00:31:46,253 INFO [train.py:1028] (0/2) Epoch 16, batch 8450, loss[loss=0.2423, simple_loss=0.3013, pruned_loss=0.09169, over 13111.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2866, pruned_loss=0.08916, over 2579070.34 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:31:51,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2024-06-21 00:31:53,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=293712.8333333333, ans=0.125 2024-06-21 00:32:02,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=293731.1666666667, ans=0.0 2024-06-21 00:32:11,851 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.091e+02 2.234e+02 2.390e+02 3.677e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 00:32:13,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=293749.5, ans=0.0 2024-06-21 00:32:17,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293767.8333333333, ans=0.125 2024-06-21 00:32:31,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=293786.1666666667, ans=0.125 2024-06-21 00:32:36,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293804.5, ans=0.125 2024-06-21 00:32:36,871 INFO [train.py:1028] (0/2) Epoch 16, batch 8500, loss[loss=0.236, simple_loss=0.2919, pruned_loss=0.09012, over 12717.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2879, pruned_loss=0.0898, over 2577850.07 frames. ], batch size: 29, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:32:43,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=293804.5, ans=0.0 2024-06-21 00:32:54,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=293822.8333333333, ans=0.0 2024-06-21 00:32:54,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=293822.8333333333, ans=0.025 2024-06-21 00:33:05,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=293859.5, ans=0.07 2024-06-21 00:33:14,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=293859.5, ans=0.09899494936611666 2024-06-21 00:33:25,741 INFO [train.py:1028] (0/2) Epoch 16, batch 8550, loss[loss=0.2205, simple_loss=0.2838, pruned_loss=0.07859, over 12633.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.2879, pruned_loss=0.0898, over 2576138.81 frames. ], batch size: 22, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:33:29,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=293896.1666666667, ans=0.125 2024-06-21 00:33:35,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=293914.5, ans=0.0 2024-06-21 00:33:39,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=293914.5, ans=0.2 2024-06-21 00:33:48,643 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.121e+02 2.254e+02 2.627e+02 3.742e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 00:33:51,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=293951.1666666667, ans=10.0 2024-06-21 00:33:54,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=293951.1666666667, ans=0.0 2024-06-21 00:34:04,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.14 vs. limit=12.0 2024-06-21 00:34:11,521 INFO [train.py:1028] (0/2) Epoch 16, batch 8600, loss[loss=0.229, simple_loss=0.281, pruned_loss=0.08854, over 13149.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2878, pruned_loss=0.08949, over 2572836.55 frames. ], batch size: 112, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:34:11,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=293987.8333333333, ans=0.125 2024-06-21 00:34:18,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=293987.8333333333, ans=0.125 2024-06-21 00:34:34,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=294006.1666666667, ans=15.0 2024-06-21 00:34:41,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=294024.5, ans=0.2 2024-06-21 00:34:51,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.44 vs. limit=5.0 2024-06-21 00:35:07,560 INFO [train.py:1028] (0/2) Epoch 16, batch 8650, loss[loss=0.2261, simple_loss=0.286, pruned_loss=0.0831, over 12987.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2881, pruned_loss=0.08958, over 2574972.59 frames. ], batch size: 102, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:35:08,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=294079.5, ans=0.95 2024-06-21 00:35:31,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=294116.1666666667, ans=0.2 2024-06-21 00:35:35,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.96 vs. limit=22.5 2024-06-21 00:35:36,260 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.057e+02 2.180e+02 2.452e+02 3.730e+02, threshold=4.360e+02, percent-clipped=0.0 2024-06-21 00:35:53,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=294152.8333333333, ans=0.125 2024-06-21 00:35:54,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2024-06-21 00:35:55,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2024-06-21 00:35:56,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=12.0 2024-06-21 00:35:57,344 INFO [train.py:1028] (0/2) Epoch 16, batch 8700, loss[loss=0.2446, simple_loss=0.3037, pruned_loss=0.09275, over 13254.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2887, pruned_loss=0.09002, over 2572018.21 frames. ], batch size: 59, lr: 3.61e-03, grad_scale: 64.0 2024-06-21 00:36:20,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=294207.8333333333, ans=0.125 2024-06-21 00:36:33,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=294226.1666666667, ans=0.015 2024-06-21 00:36:35,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294244.5, ans=0.1 2024-06-21 00:36:36,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=294244.5, ans=0.0 2024-06-21 00:36:46,051 INFO [train.py:1028] (0/2) Epoch 16, batch 8750, loss[loss=0.2261, simple_loss=0.2781, pruned_loss=0.0871, over 13095.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.2891, pruned_loss=0.09032, over 2567426.10 frames. ], batch size: 121, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:36:55,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=294281.1666666667, ans=0.125 2024-06-21 00:36:57,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=294281.1666666667, ans=0.125 2024-06-21 00:37:04,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-21 00:37:16,327 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.002e+02 2.159e+02 2.431e+02 2.997e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 00:37:40,168 INFO [train.py:1028] (0/2) Epoch 16, batch 8800, loss[loss=0.2221, simple_loss=0.2783, pruned_loss=0.08298, over 13237.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2896, pruned_loss=0.09046, over 2573150.77 frames. ], batch size: 72, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:37:47,300 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:37:50,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294372.8333333333, ans=0.1 2024-06-21 00:37:50,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=294372.8333333333, ans=0.0 2024-06-21 00:38:16,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=294409.5, ans=0.125 2024-06-21 00:38:17,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=294409.5, ans=0.2 2024-06-21 00:38:26,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=294427.8333333333, ans=0.125 2024-06-21 00:38:29,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=294427.8333333333, ans=0.125 2024-06-21 00:38:33,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=294427.8333333333, ans=0.125 2024-06-21 00:38:35,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=294446.1666666667, ans=0.125 2024-06-21 00:38:35,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=294446.1666666667, ans=0.1 2024-06-21 00:38:36,371 INFO [train.py:1028] (0/2) Epoch 16, batch 8850, loss[loss=0.2548, simple_loss=0.3038, pruned_loss=0.1029, over 12544.00 frames. ], tot_loss[loss=0.235, simple_loss=0.2889, pruned_loss=0.09054, over 2561909.58 frames. ], batch size: 202, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:38:42,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=294446.1666666667, ans=0.0 2024-06-21 00:38:44,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.73 vs. limit=22.5 2024-06-21 00:38:50,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=294464.5, ans=0.125 2024-06-21 00:39:03,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.629e+02 2.060e+02 2.205e+02 2.342e+02 3.169e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 00:39:18,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=294519.5, ans=0.0 2024-06-21 00:39:18,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=294519.5, ans=0.0 2024-06-21 00:39:25,466 INFO [train.py:1028] (0/2) Epoch 16, batch 8900, loss[loss=0.2442, simple_loss=0.3024, pruned_loss=0.09302, over 12878.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2901, pruned_loss=0.09113, over 2559071.92 frames. ], batch size: 33, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:39:33,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.49 vs. limit=22.5 2024-06-21 00:39:57,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=15.0 2024-06-21 00:40:16,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294611.1666666667, ans=0.125 2024-06-21 00:40:20,919 INFO [train.py:1028] (0/2) Epoch 16, batch 8950, loss[loss=0.2541, simple_loss=0.3007, pruned_loss=0.1038, over 12539.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.29, pruned_loss=0.09081, over 2559249.72 frames. ], batch size: 202, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:40:22,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=294629.5, ans=0.125 2024-06-21 00:40:47,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=294666.1666666667, ans=0.125 2024-06-21 00:40:48,072 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 2.060e+02 2.227e+02 2.403e+02 3.562e+02, threshold=4.453e+02, percent-clipped=0.0 2024-06-21 00:40:50,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=294684.5, ans=0.125 2024-06-21 00:40:54,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 00:41:04,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=294702.8333333333, ans=0.0 2024-06-21 00:41:10,616 INFO [train.py:1028] (0/2) Epoch 16, batch 9000, loss[loss=0.2424, simple_loss=0.2979, pruned_loss=0.09342, over 13194.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.29, pruned_loss=0.09051, over 2566782.00 frames. ], batch size: 46, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:41:10,620 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 00:41:23,705 INFO [train.py:1060] (0/2) Epoch 16, validation: loss=0.1882, simple_loss=0.2528, pruned_loss=0.06174, over 351949.00 frames. 2024-06-21 00:41:23,707 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 00:42:10,601 INFO [train.py:1028] (0/2) Epoch 16, batch 9050, loss[loss=0.2087, simple_loss=0.2695, pruned_loss=0.07395, over 11629.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.2907, pruned_loss=0.0909, over 2566255.71 frames. ], batch size: 17, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:42:12,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=294812.8333333333, ans=0.125 2024-06-21 00:42:24,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=294831.1666666667, ans=0.0 2024-06-21 00:42:24,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=294831.1666666667, ans=0.025 2024-06-21 00:42:25,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.53 vs. limit=22.5 2024-06-21 00:42:31,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.49 vs. limit=15.0 2024-06-21 00:42:36,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.067e+02 2.207e+02 2.370e+02 2.974e+02, threshold=4.414e+02, percent-clipped=0.0 2024-06-21 00:42:47,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.21 vs. limit=15.0 2024-06-21 00:42:58,909 INFO [train.py:1028] (0/2) Epoch 16, batch 9100, loss[loss=0.2184, simple_loss=0.2806, pruned_loss=0.0781, over 13257.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2896, pruned_loss=0.09047, over 2567133.03 frames. ], batch size: 72, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:42:59,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=294904.5, ans=0.1 2024-06-21 00:43:02,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294904.5, ans=0.125 2024-06-21 00:43:03,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=294904.5, ans=0.125 2024-06-21 00:43:36,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=294977.8333333333, ans=0.125 2024-06-21 00:43:37,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=294977.8333333333, ans=0.125 2024-06-21 00:43:38,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=15.0 2024-06-21 00:43:41,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2024-06-21 00:43:42,260 INFO [train.py:1028] (0/2) Epoch 16, batch 9150, loss[loss=0.226, simple_loss=0.291, pruned_loss=0.08047, over 13173.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2897, pruned_loss=0.09096, over 2567274.66 frames. ], batch size: 77, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:43:46,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=294996.1666666667, ans=0.125 2024-06-21 00:43:54,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=295014.5, ans=0.125 2024-06-21 00:44:07,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=295032.8333333333, ans=0.0 2024-06-21 00:44:07,840 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.083e+02 2.233e+02 2.458e+02 3.018e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 00:44:17,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=295051.1666666667, ans=0.0 2024-06-21 00:44:28,724 INFO [train.py:1028] (0/2) Epoch 16, batch 9200, loss[loss=0.2376, simple_loss=0.2967, pruned_loss=0.08928, over 12936.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2897, pruned_loss=0.09053, over 2569820.42 frames. ], batch size: 36, lr: 3.60e-03, grad_scale: 64.0 2024-06-21 00:44:34,383 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2024-06-21 00:44:48,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=295106.1666666667, ans=0.2 2024-06-21 00:45:15,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=295161.1666666667, ans=0.04949747468305833 2024-06-21 00:45:16,711 INFO [train.py:1028] (0/2) Epoch 16, batch 9250, loss[loss=0.2119, simple_loss=0.2703, pruned_loss=0.07675, over 13271.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.2894, pruned_loss=0.09019, over 2573434.51 frames. ], batch size: 67, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:45:28,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=295197.8333333333, ans=0.2 2024-06-21 00:45:40,574 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.038e+02 2.172e+02 2.341e+02 3.671e+02, threshold=4.344e+02, percent-clipped=0.0 2024-06-21 00:45:42,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=295234.5, ans=0.125 2024-06-21 00:45:44,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295234.5, ans=0.0 2024-06-21 00:45:50,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=295252.8333333333, ans=0.125 2024-06-21 00:46:01,052 INFO [train.py:1028] (0/2) Epoch 16, batch 9300, loss[loss=0.2084, simple_loss=0.2705, pruned_loss=0.07313, over 12998.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2893, pruned_loss=0.08984, over 2571361.05 frames. ], batch size: 39, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:46:01,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2024-06-21 00:46:02,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=295271.1666666667, ans=0.2 2024-06-21 00:46:06,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=295271.1666666667, ans=0.0 2024-06-21 00:46:11,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=295289.5, ans=0.125 2024-06-21 00:46:13,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=295289.5, ans=0.0 2024-06-21 00:46:17,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=295307.8333333333, ans=0.1 2024-06-21 00:46:18,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295307.8333333333, ans=0.1 2024-06-21 00:46:28,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=295326.1666666667, ans=0.125 2024-06-21 00:46:32,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=15.0 2024-06-21 00:46:39,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=295344.5, ans=0.125 2024-06-21 00:46:40,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=295344.5, ans=0.125 2024-06-21 00:46:44,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.89 vs. limit=15.0 2024-06-21 00:46:46,400 INFO [train.py:1028] (0/2) Epoch 16, batch 9350, loss[loss=0.216, simple_loss=0.2736, pruned_loss=0.07918, over 12757.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.2886, pruned_loss=0.0893, over 2569309.51 frames. ], batch size: 22, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:46:49,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.50 vs. limit=10.0 2024-06-21 00:46:52,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=295362.8333333333, ans=0.1 2024-06-21 00:46:58,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=295381.1666666667, ans=0.125 2024-06-21 00:47:04,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=295399.5, ans=0.1 2024-06-21 00:47:08,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=295399.5, ans=0.125 2024-06-21 00:47:10,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.054e+02 2.151e+02 2.286e+02 2.863e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 00:47:12,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=295417.8333333333, ans=0.5 2024-06-21 00:47:14,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2024-06-21 00:47:18,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=295417.8333333333, ans=0.5 2024-06-21 00:47:19,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=295417.8333333333, ans=0.2 2024-06-21 00:47:30,063 INFO [train.py:1028] (0/2) Epoch 16, batch 9400, loss[loss=0.2187, simple_loss=0.2811, pruned_loss=0.07816, over 13310.00 frames. ], tot_loss[loss=0.234, simple_loss=0.289, pruned_loss=0.08952, over 2569257.68 frames. ], batch size: 52, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:47:34,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=295454.5, ans=0.0 2024-06-21 00:47:40,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.74 vs. limit=10.0 2024-06-21 00:47:40,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=295472.8333333333, ans=0.0 2024-06-21 00:47:40,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=295472.8333333333, ans=0.2 2024-06-21 00:47:48,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=295491.1666666667, ans=0.04949747468305833 2024-06-21 00:47:48,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=295491.1666666667, ans=0.125 2024-06-21 00:48:08,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=295527.8333333333, ans=0.125 2024-06-21 00:48:14,162 INFO [train.py:1028] (0/2) Epoch 16, batch 9450, loss[loss=0.2779, simple_loss=0.3307, pruned_loss=0.1125, over 12597.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.2902, pruned_loss=0.09025, over 2569450.91 frames. ], batch size: 22, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:48:14,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=295546.1666666667, ans=0.0 2024-06-21 00:48:19,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=295546.1666666667, ans=0.2 2024-06-21 00:48:32,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 00:48:37,788 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.031e+02 2.129e+02 2.339e+02 3.092e+02, threshold=4.257e+02, percent-clipped=0.0 2024-06-21 00:48:57,177 INFO [train.py:1028] (0/2) Epoch 16, batch 9500, loss[loss=0.2276, simple_loss=0.2875, pruned_loss=0.08387, over 13274.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.2892, pruned_loss=0.08976, over 2578300.65 frames. ], batch size: 43, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:48:58,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=295637.8333333333, ans=0.125 2024-06-21 00:49:14,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=295656.1666666667, ans=0.2 2024-06-21 00:49:20,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=15.0 2024-06-21 00:49:25,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=295674.5, ans=10.0 2024-06-21 00:49:31,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=295692.8333333333, ans=0.125 2024-06-21 00:49:38,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=295711.1666666667, ans=0.125 2024-06-21 00:49:45,395 INFO [train.py:1028] (0/2) Epoch 16, batch 9550, loss[loss=0.2077, simple_loss=0.2603, pruned_loss=0.07751, over 12838.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2889, pruned_loss=0.08975, over 2572798.15 frames. ], batch size: 39, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:49:48,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=295729.5, ans=6.0 2024-06-21 00:50:03,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=295766.1666666667, ans=0.125 2024-06-21 00:50:08,377 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.091e+02 2.304e+02 2.615e+02 3.563e+02, threshold=4.608e+02, percent-clipped=0.0 2024-06-21 00:50:27,203 INFO [train.py:1028] (0/2) Epoch 16, batch 9600, loss[loss=0.2527, simple_loss=0.2916, pruned_loss=0.1069, over 10627.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2891, pruned_loss=0.08986, over 2571787.96 frames. ], batch size: 304, lr: 3.60e-03, grad_scale: 32.0 2024-06-21 00:51:05,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2024-06-21 00:51:13,472 INFO [train.py:1028] (0/2) Epoch 16, batch 9650, loss[loss=0.2383, simple_loss=0.2831, pruned_loss=0.09671, over 13061.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2898, pruned_loss=0.09093, over 2561052.24 frames. ], batch size: 132, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:51:16,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=295912.8333333333, ans=0.125 2024-06-21 00:51:24,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=295931.1666666667, ans=0.025 2024-06-21 00:51:26,670 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2024-06-21 00:51:30,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=295949.5, ans=0.125 2024-06-21 00:51:36,850 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.111e+02 2.278e+02 2.540e+02 3.381e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 00:51:40,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2024-06-21 00:51:43,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=15.0 2024-06-21 00:51:56,231 INFO [train.py:1028] (0/2) Epoch 16, batch 9700, loss[loss=0.2286, simple_loss=0.2726, pruned_loss=0.09232, over 13055.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.2892, pruned_loss=0.09077, over 2555957.94 frames. ], batch size: 145, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:52:05,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=296022.8333333333, ans=0.125 2024-06-21 00:52:08,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=296022.8333333333, ans=0.2 2024-06-21 00:52:30,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=296059.5, ans=0.07 2024-06-21 00:52:37,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=296077.8333333333, ans=0.125 2024-06-21 00:52:42,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=296096.1666666667, ans=0.0 2024-06-21 00:52:43,317 INFO [train.py:1028] (0/2) Epoch 16, batch 9750, loss[loss=0.2295, simple_loss=0.2758, pruned_loss=0.09165, over 13080.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2884, pruned_loss=0.09026, over 2552529.34 frames. ], batch size: 132, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:52:44,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=296096.1666666667, ans=0.1 2024-06-21 00:52:44,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296096.1666666667, ans=0.1 2024-06-21 00:52:58,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=296114.5, ans=0.125 2024-06-21 00:52:59,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296132.8333333333, ans=0.125 2024-06-21 00:53:04,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2024-06-21 00:53:06,411 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.052e+02 2.160e+02 2.428e+02 3.667e+02, threshold=4.320e+02, percent-clipped=0.0 2024-06-21 00:53:06,945 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2024-06-21 00:53:08,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=296151.1666666667, ans=0.0 2024-06-21 00:53:17,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296169.5, ans=0.1 2024-06-21 00:53:20,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296169.5, ans=0.1 2024-06-21 00:53:24,543 INFO [train.py:1028] (0/2) Epoch 16, batch 9800, loss[loss=0.2131, simple_loss=0.2737, pruned_loss=0.07625, over 12923.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2873, pruned_loss=0.08955, over 2544303.44 frames. ], batch size: 39, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:53:26,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=296187.8333333333, ans=15.0 2024-06-21 00:53:32,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=296206.1666666667, ans=0.125 2024-06-21 00:53:37,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=296206.1666666667, ans=0.0 2024-06-21 00:53:45,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2024-06-21 00:54:02,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.96 vs. limit=22.5 2024-06-21 00:54:04,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.51 vs. limit=22.5 2024-06-21 00:54:07,034 INFO [train.py:1028] (0/2) Epoch 16, batch 9850, loss[loss=0.2273, simple_loss=0.2776, pruned_loss=0.08852, over 12964.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2858, pruned_loss=0.08873, over 2538027.04 frames. ], batch size: 102, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:54:08,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=296279.5, ans=0.0 2024-06-21 00:54:15,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=296297.8333333333, ans=0.125 2024-06-21 00:54:31,718 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.050e+02 2.180e+02 2.433e+02 3.497e+02, threshold=4.359e+02, percent-clipped=0.0 2024-06-21 00:54:44,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.52 vs. limit=22.5 2024-06-21 00:54:47,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2024-06-21 00:54:50,353 INFO [train.py:1028] (0/2) Epoch 16, batch 9900, loss[loss=0.1993, simple_loss=0.2555, pruned_loss=0.07155, over 12989.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2853, pruned_loss=0.08889, over 2531641.85 frames. ], batch size: 39, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:54:54,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.48 vs. limit=15.0 2024-06-21 00:54:58,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=296389.5, ans=0.0 2024-06-21 00:54:59,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=296389.5, ans=0.125 2024-06-21 00:55:03,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=296389.5, ans=0.125 2024-06-21 00:55:05,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2024-06-21 00:55:07,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-21 00:55:25,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296444.5, ans=0.1 2024-06-21 00:55:29,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2024-06-21 00:55:31,504 INFO [train.py:1028] (0/2) Epoch 16, batch 9950, loss[loss=0.2379, simple_loss=0.2914, pruned_loss=0.09215, over 12673.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2848, pruned_loss=0.08891, over 2527007.28 frames. ], batch size: 29, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:55:34,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=15.0 2024-06-21 00:55:42,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=296481.1666666667, ans=0.0 2024-06-21 00:55:44,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=296481.1666666667, ans=0.125 2024-06-21 00:55:48,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=15.0 2024-06-21 00:55:57,109 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.028e+02 2.184e+02 2.354e+02 3.237e+02, threshold=4.367e+02, percent-clipped=0.0 2024-06-21 00:55:57,328 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:56:00,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.90 vs. limit=15.0 2024-06-21 00:56:06,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2024-06-21 00:56:06,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.95 vs. limit=15.0 2024-06-21 00:56:11,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=296536.1666666667, ans=0.125 2024-06-21 00:56:12,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=296536.1666666667, ans=0.125 2024-06-21 00:56:13,351 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 00:56:16,300 INFO [train.py:1028] (0/2) Epoch 16, batch 10000, loss[loss=0.2119, simple_loss=0.2762, pruned_loss=0.07378, over 12674.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.285, pruned_loss=0.08924, over 2486939.32 frames. ], batch size: 22, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:56:18,087 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=8.94 vs. limit=12.0 2024-06-21 00:56:20,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-21 00:56:27,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=296572.8333333333, ans=0.125 2024-06-21 00:56:28,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296572.8333333333, ans=0.1 2024-06-21 00:56:32,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296572.8333333333, ans=0.1 2024-06-21 00:56:47,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=296609.5, ans=0.125 2024-06-21 00:56:48,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-21 00:57:00,198 INFO [train.py:1028] (0/2) Epoch 16, batch 10050, loss[loss=0.2327, simple_loss=0.2904, pruned_loss=0.08756, over 12434.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2853, pruned_loss=0.08982, over 2446395.63 frames. ], batch size: 22, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:57:05,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296646.1666666667, ans=0.1 2024-06-21 00:57:13,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=296664.5, ans=0.07 2024-06-21 00:57:17,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=296682.8333333333, ans=0.09899494936611666 2024-06-21 00:57:18,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=296682.8333333333, ans=0.125 2024-06-21 00:57:19,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2024-06-21 00:57:23,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.106e+02 2.262e+02 2.542e+02 4.440e+02, threshold=4.524e+02, percent-clipped=1.0 2024-06-21 00:57:24,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.24 vs. limit=22.5 2024-06-21 00:57:31,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=296701.1666666667, ans=0.125 2024-06-21 00:57:32,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2024-06-21 00:57:42,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=15.0 2024-06-21 00:57:44,040 INFO [train.py:1028] (0/2) Epoch 16, batch 10100, loss[loss=0.2123, simple_loss=0.2637, pruned_loss=0.08045, over 11165.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2843, pruned_loss=0.08928, over 2426986.06 frames. ], batch size: 16, lr: 3.59e-03, grad_scale: 32.0 2024-06-21 00:57:59,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2024-06-21 00:58:04,554 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-16.pt 2024-06-21 01:00:56,089 INFO [train.py:1028] (0/2) Epoch 17, batch 0, loss[loss=0.2088, simple_loss=0.2619, pruned_loss=0.07783, over 12938.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2619, pruned_loss=0.07783, over 12938.00 frames. ], batch size: 36, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:00:56,091 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 01:01:02,319 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([6.3868, 5.4603, 5.9954, 5.7552], device='cuda:0') 2024-06-21 01:01:04,976 INFO [train.py:1060] (0/2) Epoch 17, validation: loss=0.1896, simple_loss=0.255, pruned_loss=0.06204, over 351949.00 frames. 2024-06-21 01:01:04,977 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 01:01:14,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=296789.1666666667, ans=0.125 2024-06-21 01:01:18,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=296789.1666666667, ans=0.125 2024-06-21 01:01:24,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=296807.5, ans=0.2 2024-06-21 01:01:25,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=296807.5, ans=0.5 2024-06-21 01:01:26,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=296807.5, ans=0.0 2024-06-21 01:01:37,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=296825.8333333333, ans=0.07 2024-06-21 01:01:40,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296825.8333333333, ans=0.1 2024-06-21 01:01:53,373 INFO [train.py:1028] (0/2) Epoch 17, batch 50, loss[loss=0.2141, simple_loss=0.2678, pruned_loss=0.08023, over 12470.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2687, pruned_loss=0.08328, over 575353.63 frames. ], batch size: 29, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:01:53,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=296862.5, ans=0.2 2024-06-21 01:01:56,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2024-06-21 01:02:01,899 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.942e+02 2.093e+02 2.276e+02 2.812e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 01:02:05,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296880.8333333333, ans=0.125 2024-06-21 01:02:08,344 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:02:10,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=296899.1666666667, ans=0.2 2024-06-21 01:02:14,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=296899.1666666667, ans=0.1 2024-06-21 01:02:21,159 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:02:28,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=296935.8333333333, ans=0.0 2024-06-21 01:02:36,429 INFO [train.py:1028] (0/2) Epoch 17, batch 100, loss[loss=0.2142, simple_loss=0.2704, pruned_loss=0.07905, over 13328.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2665, pruned_loss=0.08236, over 1018253.84 frames. ], batch size: 46, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:02:43,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=296954.1666666667, ans=0.0 2024-06-21 01:02:43,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296954.1666666667, ans=0.125 2024-06-21 01:02:45,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2024-06-21 01:02:49,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.68 vs. limit=10.0 2024-06-21 01:03:03,828 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:03:08,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=297009.1666666667, ans=12.0 2024-06-21 01:03:29,004 INFO [train.py:1028] (0/2) Epoch 17, batch 150, loss[loss=0.2159, simple_loss=0.265, pruned_loss=0.08338, over 12746.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2645, pruned_loss=0.08053, over 1366056.63 frames. ], batch size: 29, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:03:30,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=297045.8333333333, ans=0.2 2024-06-21 01:03:38,114 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.949e+02 2.067e+02 2.180e+02 2.825e+02, threshold=4.135e+02, percent-clipped=0.0 2024-06-21 01:03:57,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297100.8333333333, ans=0.125 2024-06-21 01:04:00,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=297100.8333333333, ans=0.025 2024-06-21 01:04:02,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2024-06-21 01:04:07,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297119.1666666667, ans=0.1 2024-06-21 01:04:07,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=297119.1666666667, ans=0.02 2024-06-21 01:04:11,343 INFO [train.py:1028] (0/2) Epoch 17, batch 200, loss[loss=0.2456, simple_loss=0.2916, pruned_loss=0.09977, over 12529.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2645, pruned_loss=0.0809, over 1635744.71 frames. ], batch size: 202, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:04:14,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=297137.5, ans=0.125 2024-06-21 01:04:23,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=297155.8333333333, ans=0.2 2024-06-21 01:04:34,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.62 vs. limit=15.0 2024-06-21 01:04:53,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=297210.8333333333, ans=0.125 2024-06-21 01:05:00,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=297210.8333333333, ans=0.2 2024-06-21 01:05:02,558 INFO [train.py:1028] (0/2) Epoch 17, batch 250, loss[loss=0.2023, simple_loss=0.2433, pruned_loss=0.08062, over 13009.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2651, pruned_loss=0.08109, over 1848043.01 frames. ], batch size: 144, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:05:11,891 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.632e+02 2.034e+02 2.201e+02 2.360e+02 3.228e+02, threshold=4.403e+02, percent-clipped=0.0 2024-06-21 01:05:13,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.21 vs. limit=15.0 2024-06-21 01:05:31,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=297284.1666666667, ans=0.125 2024-06-21 01:05:45,082 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.71 vs. limit=15.0 2024-06-21 01:05:46,192 INFO [train.py:1028] (0/2) Epoch 17, batch 300, loss[loss=0.2148, simple_loss=0.259, pruned_loss=0.08532, over 13179.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2649, pruned_loss=0.08071, over 2010796.81 frames. ], batch size: 112, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:05:52,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=297320.8333333333, ans=0.0 2024-06-21 01:06:07,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=297357.5, ans=0.125 2024-06-21 01:06:30,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=297394.1666666667, ans=0.125 2024-06-21 01:06:37,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=297412.5, ans=0.07 2024-06-21 01:06:38,558 INFO [train.py:1028] (0/2) Epoch 17, batch 350, loss[loss=0.2031, simple_loss=0.2584, pruned_loss=0.07387, over 12970.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2647, pruned_loss=0.08077, over 2139476.02 frames. ], batch size: 33, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:06:44,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=297412.5, ans=0.0 2024-06-21 01:06:47,659 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 1.967e+02 2.096e+02 2.286e+02 3.034e+02, threshold=4.193e+02, percent-clipped=0.0 2024-06-21 01:07:03,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=297467.5, ans=0.125 2024-06-21 01:07:09,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=297467.5, ans=0.125 2024-06-21 01:07:16,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=297485.8333333333, ans=0.125 2024-06-21 01:07:30,089 INFO [train.py:1028] (0/2) Epoch 17, batch 400, loss[loss=0.2036, simple_loss=0.2582, pruned_loss=0.07454, over 13265.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.265, pruned_loss=0.08073, over 2239671.48 frames. ], batch size: 63, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:07:44,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=297522.5, ans=0.2 2024-06-21 01:07:56,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=297559.1666666667, ans=0.125 2024-06-21 01:08:00,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=297559.1666666667, ans=0.2 2024-06-21 01:08:01,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=22.5 2024-06-21 01:08:04,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2024-06-21 01:08:16,026 INFO [train.py:1028] (0/2) Epoch 17, batch 450, loss[loss=0.2073, simple_loss=0.2645, pruned_loss=0.07507, over 13241.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2651, pruned_loss=0.08066, over 2313430.72 frames. ], batch size: 67, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:08:24,586 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 1.955e+02 2.055e+02 2.174e+02 2.775e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-21 01:08:31,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2024-06-21 01:08:42,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2024-06-21 01:08:50,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=297669.1666666667, ans=0.0 2024-06-21 01:08:57,552 INFO [train.py:1028] (0/2) Epoch 17, batch 500, loss[loss=0.2041, simple_loss=0.2535, pruned_loss=0.07736, over 13100.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2659, pruned_loss=0.08063, over 2375301.44 frames. ], batch size: 121, lr: 3.48e-03, grad_scale: 32.0 2024-06-21 01:08:57,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=297687.5, ans=0.025 2024-06-21 01:09:00,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=297687.5, ans=0.2 2024-06-21 01:09:21,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2024-06-21 01:09:24,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=297742.5, ans=0.125 2024-06-21 01:09:31,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=297760.8333333333, ans=0.04949747468305833 2024-06-21 01:09:38,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=297760.8333333333, ans=0.125 2024-06-21 01:09:41,843 INFO [train.py:1028] (0/2) Epoch 17, batch 550, loss[loss=0.224, simple_loss=0.2696, pruned_loss=0.08921, over 12945.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2661, pruned_loss=0.08078, over 2420115.86 frames. ], batch size: 158, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:09:44,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=297779.1666666667, ans=0.0 2024-06-21 01:09:45,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 01:09:50,766 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.928e+02 2.028e+02 2.198e+02 3.160e+02, threshold=4.057e+02, percent-clipped=0.0 2024-06-21 01:09:58,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=297815.8333333333, ans=0.125 2024-06-21 01:10:18,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297852.5, ans=0.1 2024-06-21 01:10:19,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=297852.5, ans=0.0 2024-06-21 01:10:19,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2024-06-21 01:10:23,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=297852.5, ans=0.125 2024-06-21 01:10:25,854 INFO [train.py:1028] (0/2) Epoch 17, batch 600, loss[loss=0.1933, simple_loss=0.2375, pruned_loss=0.0745, over 13034.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2659, pruned_loss=0.08076, over 2457903.70 frames. ], batch size: 144, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:11:07,358 INFO [train.py:1028] (0/2) Epoch 17, batch 650, loss[loss=0.2062, simple_loss=0.2623, pruned_loss=0.07507, over 13208.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2659, pruned_loss=0.08044, over 2489406.49 frames. ], batch size: 59, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:11:17,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.020e+02 2.118e+02 2.265e+02 2.937e+02, threshold=4.235e+02, percent-clipped=0.0 2024-06-21 01:11:17,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=297980.8333333333, ans=0.125 2024-06-21 01:11:28,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.24 vs. limit=15.0 2024-06-21 01:11:28,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=297999.1666666667, ans=0.0 2024-06-21 01:11:30,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=297999.1666666667, ans=15.0 2024-06-21 01:11:35,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=298017.5, ans=0.125 2024-06-21 01:11:52,910 INFO [train.py:1028] (0/2) Epoch 17, batch 700, loss[loss=0.2201, simple_loss=0.27, pruned_loss=0.0851, over 13340.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2657, pruned_loss=0.08076, over 2512589.43 frames. ], batch size: 46, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:12:12,227 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:12:15,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=298090.8333333333, ans=0.125 2024-06-21 01:12:15,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2024-06-21 01:12:17,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2024-06-21 01:12:24,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=298109.1666666667, ans=0.0 2024-06-21 01:12:40,387 INFO [train.py:1028] (0/2) Epoch 17, batch 750, loss[loss=0.2241, simple_loss=0.2852, pruned_loss=0.08148, over 13267.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2653, pruned_loss=0.08041, over 2528587.37 frames. ], batch size: 63, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:12:40,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=298145.8333333333, ans=0.125 2024-06-21 01:12:49,381 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 1.954e+02 2.080e+02 2.263e+02 2.989e+02, threshold=4.161e+02, percent-clipped=0.0 2024-06-21 01:13:32,275 INFO [train.py:1028] (0/2) Epoch 17, batch 800, loss[loss=0.2153, simple_loss=0.2717, pruned_loss=0.0794, over 13219.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2657, pruned_loss=0.08062, over 2540895.17 frames. ], batch size: 37, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:13:39,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298237.5, ans=0.1 2024-06-21 01:13:54,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=298274.1666666667, ans=0.125 2024-06-21 01:14:10,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=298310.8333333333, ans=0.0 2024-06-21 01:14:14,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=298329.1666666667, ans=0.2 2024-06-21 01:14:15,453 INFO [train.py:1028] (0/2) Epoch 17, batch 850, loss[loss=0.2111, simple_loss=0.2655, pruned_loss=0.07838, over 13159.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2653, pruned_loss=0.08012, over 2551286.05 frames. ], batch size: 95, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:14:17,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=298329.1666666667, ans=0.035 2024-06-21 01:14:23,045 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.966e+02 2.081e+02 2.235e+02 2.672e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 01:14:27,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=298347.5, ans=0.0 2024-06-21 01:14:34,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=298365.8333333333, ans=0.025 2024-06-21 01:14:37,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=298365.8333333333, ans=0.2 2024-06-21 01:14:37,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298365.8333333333, ans=0.1 2024-06-21 01:14:42,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=298384.1666666667, ans=0.125 2024-06-21 01:14:53,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=298384.1666666667, ans=0.2 2024-06-21 01:15:04,985 INFO [train.py:1028] (0/2) Epoch 17, batch 900, loss[loss=0.221, simple_loss=0.2736, pruned_loss=0.08424, over 12959.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2648, pruned_loss=0.08022, over 2555706.57 frames. ], batch size: 36, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:15:12,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=298420.8333333333, ans=0.0 2024-06-21 01:15:25,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=298457.5, ans=0.125 2024-06-21 01:15:26,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=298457.5, ans=0.125 2024-06-21 01:15:35,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=298475.8333333333, ans=0.125 2024-06-21 01:15:36,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=298475.8333333333, ans=0.2 2024-06-21 01:15:56,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=298512.5, ans=0.0 2024-06-21 01:15:56,838 INFO [train.py:1028] (0/2) Epoch 17, batch 950, loss[loss=0.185, simple_loss=0.2479, pruned_loss=0.06101, over 13045.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2645, pruned_loss=0.07995, over 2559322.34 frames. ], batch size: 39, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:16:05,997 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.908e+02 2.011e+02 2.199e+02 3.008e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-21 01:16:18,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=298549.1666666667, ans=0.1 2024-06-21 01:16:18,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=298549.1666666667, ans=0.5 2024-06-21 01:16:28,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=298567.5, ans=0.1 2024-06-21 01:16:35,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=298585.8333333333, ans=0.0 2024-06-21 01:16:36,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298585.8333333333, ans=0.1 2024-06-21 01:16:43,759 INFO [train.py:1028] (0/2) Epoch 17, batch 1000, loss[loss=0.2185, simple_loss=0.2765, pruned_loss=0.08021, over 13163.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2651, pruned_loss=0.08045, over 2561148.89 frames. ], batch size: 48, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:16:51,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2024-06-21 01:16:52,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=298622.5, ans=0.125 2024-06-21 01:16:56,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=298622.5, ans=0.2 2024-06-21 01:16:56,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=298622.5, ans=0.0 2024-06-21 01:17:00,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=298622.5, ans=0.125 2024-06-21 01:17:01,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=298640.8333333333, ans=0.2 2024-06-21 01:17:14,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=298659.1666666667, ans=0.2 2024-06-21 01:17:18,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=298677.5, ans=0.04949747468305833 2024-06-21 01:17:23,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2024-06-21 01:17:26,225 INFO [train.py:1028] (0/2) Epoch 17, batch 1050, loss[loss=0.2076, simple_loss=0.2642, pruned_loss=0.07548, over 13165.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.266, pruned_loss=0.0806, over 2565069.98 frames. ], batch size: 77, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:17:37,444 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.973e+02 2.061e+02 2.394e+02 3.486e+02, threshold=4.123e+02, percent-clipped=0.0 2024-06-21 01:17:38,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=298714.1666666667, ans=0.125 2024-06-21 01:17:39,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.29 vs. limit=22.5 2024-06-21 01:17:43,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=298714.1666666667, ans=0.125 2024-06-21 01:17:43,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.63 vs. limit=12.0 2024-06-21 01:17:51,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=298750.8333333333, ans=0.125 2024-06-21 01:17:58,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=298750.8333333333, ans=0.2 2024-06-21 01:17:59,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=298769.1666666667, ans=0.035 2024-06-21 01:18:06,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=298769.1666666667, ans=0.125 2024-06-21 01:18:06,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.10 vs. limit=15.0 2024-06-21 01:18:08,700 INFO [train.py:1028] (0/2) Epoch 17, batch 1100, loss[loss=0.2411, simple_loss=0.2901, pruned_loss=0.09604, over 13268.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2662, pruned_loss=0.0805, over 2569738.13 frames. ], batch size: 52, lr: 3.47e-03, grad_scale: 32.0 2024-06-21 01:18:23,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=298805.8333333333, ans=0.0 2024-06-21 01:18:28,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=298824.1666666667, ans=0.125 2024-06-21 01:18:29,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=298824.1666666667, ans=0.125 2024-06-21 01:18:37,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=298842.5, ans=0.2 2024-06-21 01:18:39,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=298842.5, ans=0.025 2024-06-21 01:18:53,921 INFO [train.py:1028] (0/2) Epoch 17, batch 1150, loss[loss=0.2242, simple_loss=0.2764, pruned_loss=0.086, over 13321.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2669, pruned_loss=0.08108, over 2570937.97 frames. ], batch size: 52, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:18:58,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=298879.1666666667, ans=0.125 2024-06-21 01:18:59,701 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:19:01,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-21 01:19:02,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=298897.5, ans=0.025 2024-06-21 01:19:02,844 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.979e+02 2.080e+02 2.298e+02 3.092e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-21 01:19:11,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=298915.8333333333, ans=0.125 2024-06-21 01:19:13,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=298915.8333333333, ans=0.125 2024-06-21 01:19:21,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=298934.1666666667, ans=0.0 2024-06-21 01:19:34,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=298952.5, ans=0.125 2024-06-21 01:19:39,481 INFO [train.py:1028] (0/2) Epoch 17, batch 1200, loss[loss=0.2012, simple_loss=0.2618, pruned_loss=0.07034, over 13155.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2667, pruned_loss=0.08106, over 2573429.98 frames. ], batch size: 77, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:20:31,200 INFO [train.py:1028] (0/2) Epoch 17, batch 1250, loss[loss=0.2065, simple_loss=0.2561, pruned_loss=0.07841, over 13153.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2666, pruned_loss=0.0812, over 2582814.14 frames. ], batch size: 112, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:20:38,711 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.924e+02 2.117e+02 2.280e+02 3.067e+02, threshold=4.233e+02, percent-clipped=0.0 2024-06-21 01:20:42,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=22.5 2024-06-21 01:20:49,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=299099.1666666667, ans=0.0 2024-06-21 01:21:13,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=299135.8333333333, ans=0.125 2024-06-21 01:21:19,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=299154.1666666667, ans=0.125 2024-06-21 01:21:20,271 INFO [train.py:1028] (0/2) Epoch 17, batch 1300, loss[loss=0.2157, simple_loss=0.2612, pruned_loss=0.08507, over 12758.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.267, pruned_loss=0.08134, over 2583069.82 frames. ], batch size: 176, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:21:53,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2024-06-21 01:22:04,619 INFO [train.py:1028] (0/2) Epoch 17, batch 1350, loss[loss=0.2166, simple_loss=0.2716, pruned_loss=0.08082, over 13246.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.267, pruned_loss=0.08103, over 2585345.74 frames. ], batch size: 59, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:22:05,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=299245.8333333333, ans=0.0 2024-06-21 01:22:07,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=299245.8333333333, ans=0.2 2024-06-21 01:22:12,956 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.951e+02 2.065e+02 2.211e+02 2.588e+02, threshold=4.131e+02, percent-clipped=0.0 2024-06-21 01:22:37,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=299300.8333333333, ans=0.05 2024-06-21 01:22:40,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=299300.8333333333, ans=0.0 2024-06-21 01:22:42,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=299319.1666666667, ans=0.0 2024-06-21 01:22:50,966 INFO [train.py:1028] (0/2) Epoch 17, batch 1400, loss[loss=0.2312, simple_loss=0.2866, pruned_loss=0.08792, over 12479.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2672, pruned_loss=0.08125, over 2586155.63 frames. ], batch size: 25, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:22:58,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=299337.5, ans=0.125 2024-06-21 01:22:59,906 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-21 01:23:12,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=299374.1666666667, ans=0.125 2024-06-21 01:23:34,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=299410.8333333333, ans=0.125 2024-06-21 01:23:35,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2024-06-21 01:23:39,002 INFO [train.py:1028] (0/2) Epoch 17, batch 1450, loss[loss=0.207, simple_loss=0.2546, pruned_loss=0.07977, over 13090.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2664, pruned_loss=0.08109, over 2586838.07 frames. ], batch size: 121, lr: 3.47e-03, grad_scale: 64.0 2024-06-21 01:23:48,522 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 1.943e+02 2.028e+02 2.165e+02 3.264e+02, threshold=4.056e+02, percent-clipped=0.0 2024-06-21 01:24:05,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=299465.8333333333, ans=0.1 2024-06-21 01:24:30,133 INFO [train.py:1028] (0/2) Epoch 17, batch 1500, loss[loss=0.2075, simple_loss=0.259, pruned_loss=0.07802, over 13194.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2665, pruned_loss=0.0812, over 2589578.12 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:24:36,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=299520.8333333333, ans=0.2 2024-06-21 01:24:38,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2024-06-21 01:24:39,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=299539.1666666667, ans=0.125 2024-06-21 01:24:52,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=299557.5, ans=0.0 2024-06-21 01:24:53,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=299557.5, ans=0.125 2024-06-21 01:25:06,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=299594.1666666667, ans=0.025 2024-06-21 01:25:11,271 INFO [train.py:1028] (0/2) Epoch 17, batch 1550, loss[loss=0.2242, simple_loss=0.2763, pruned_loss=0.08602, over 13150.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2666, pruned_loss=0.0812, over 2585078.04 frames. ], batch size: 103, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:25:17,092 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.692e+01 2024-06-21 01:25:18,956 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.934e+02 2.071e+02 2.258e+02 2.984e+02, threshold=4.143e+02, percent-clipped=0.0 2024-06-21 01:25:23,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=299630.8333333333, ans=0.125 2024-06-21 01:25:42,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=299667.5, ans=0.125 2024-06-21 01:25:47,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=299685.8333333333, ans=0.0 2024-06-21 01:25:50,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2024-06-21 01:25:53,956 INFO [train.py:1028] (0/2) Epoch 17, batch 1600, loss[loss=0.1955, simple_loss=0.2543, pruned_loss=0.0684, over 13198.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2661, pruned_loss=0.08099, over 2579654.29 frames. ], batch size: 77, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:25:54,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=299704.1666666667, ans=0.125 2024-06-21 01:25:59,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=299704.1666666667, ans=0.0 2024-06-21 01:26:19,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-21 01:26:28,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=299759.1666666667, ans=0.125 2024-06-21 01:26:33,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=299777.5, ans=0.025 2024-06-21 01:26:38,282 INFO [train.py:1028] (0/2) Epoch 17, batch 1650, loss[loss=0.2144, simple_loss=0.2628, pruned_loss=0.08298, over 13181.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2664, pruned_loss=0.0812, over 2575881.66 frames. ], batch size: 95, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:26:40,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2024-06-21 01:26:47,040 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.984e+02 2.112e+02 2.272e+02 2.758e+02, threshold=4.224e+02, percent-clipped=0.0 2024-06-21 01:26:54,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.86 vs. limit=15.0 2024-06-21 01:27:06,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299850.8333333333, ans=0.1 2024-06-21 01:27:09,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=299850.8333333333, ans=0.125 2024-06-21 01:27:17,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=299869.1666666667, ans=0.125 2024-06-21 01:27:26,082 INFO [train.py:1028] (0/2) Epoch 17, batch 1700, loss[loss=0.2178, simple_loss=0.2707, pruned_loss=0.08245, over 12812.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2661, pruned_loss=0.08099, over 2581813.92 frames. ], batch size: 26, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:27:51,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=299924.1666666667, ans=0.0 2024-06-21 01:27:56,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=299942.5, ans=0.125 2024-06-21 01:28:02,405 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-06-21 01:28:14,626 INFO [train.py:1028] (0/2) Epoch 17, batch 1750, loss[loss=0.2408, simple_loss=0.2953, pruned_loss=0.0931, over 12395.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2667, pruned_loss=0.08127, over 2583380.18 frames. ], batch size: 22, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:28:20,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=299979.1666666667, ans=0.1 2024-06-21 01:28:22,438 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.956e+02 2.039e+02 2.211e+02 2.782e+02, threshold=4.079e+02, percent-clipped=0.0 2024-06-21 01:28:38,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=300015.8333333333, ans=0.125 2024-06-21 01:28:52,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=300034.1666666667, ans=0.125 2024-06-21 01:28:53,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=300034.1666666667, ans=0.95 2024-06-21 01:29:01,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=300052.5, ans=0.015 2024-06-21 01:29:05,061 INFO [train.py:1028] (0/2) Epoch 17, batch 1800, loss[loss=0.205, simple_loss=0.258, pruned_loss=0.07602, over 13207.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.267, pruned_loss=0.08183, over 2583007.60 frames. ], batch size: 67, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:29:26,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=300089.1666666667, ans=0.1 2024-06-21 01:29:32,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=300107.5, ans=0.125 2024-06-21 01:29:57,099 INFO [train.py:1028] (0/2) Epoch 17, batch 1850, loss[loss=0.1976, simple_loss=0.248, pruned_loss=0.07362, over 13237.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.267, pruned_loss=0.08168, over 2584050.15 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:29:59,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=300162.5, ans=0.2 2024-06-21 01:30:06,675 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.968e+02 2.072e+02 2.207e+02 3.153e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-21 01:30:07,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=300180.8333333333, ans=0.125 2024-06-21 01:30:21,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=300199.1666666667, ans=0.2 2024-06-21 01:30:25,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=300217.5, ans=0.2 2024-06-21 01:30:44,538 INFO [train.py:1028] (0/2) Epoch 17, batch 1900, loss[loss=0.2003, simple_loss=0.2453, pruned_loss=0.07762, over 13142.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2657, pruned_loss=0.08123, over 2585542.22 frames. ], batch size: 95, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:30:56,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300272.5, ans=0.1 2024-06-21 01:31:05,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=300290.8333333333, ans=0.125 2024-06-21 01:31:28,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=300309.1666666667, ans=0.125 2024-06-21 01:31:33,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 01:31:38,454 INFO [train.py:1028] (0/2) Epoch 17, batch 1950, loss[loss=0.2002, simple_loss=0.2596, pruned_loss=0.07033, over 13314.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2657, pruned_loss=0.08147, over 2592090.47 frames. ], batch size: 52, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:31:39,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=300345.8333333333, ans=0.125 2024-06-21 01:31:48,011 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 1.949e+02 2.046e+02 2.167e+02 2.912e+02, threshold=4.092e+02, percent-clipped=0.0 2024-06-21 01:32:01,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=300382.5, ans=0.05 2024-06-21 01:32:04,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=300400.8333333333, ans=0.0 2024-06-21 01:32:20,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.45 vs. limit=10.0 2024-06-21 01:32:29,902 INFO [train.py:1028] (0/2) Epoch 17, batch 2000, loss[loss=0.2199, simple_loss=0.2742, pruned_loss=0.08276, over 12296.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2652, pruned_loss=0.08105, over 2587940.33 frames. ], batch size: 22, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:32:42,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=300455.8333333333, ans=0.125 2024-06-21 01:32:47,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=300455.8333333333, ans=0.125 2024-06-21 01:32:56,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2024-06-21 01:32:57,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=300492.5, ans=0.0 2024-06-21 01:33:04,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=300492.5, ans=0.125 2024-06-21 01:33:16,639 INFO [train.py:1028] (0/2) Epoch 17, batch 2050, loss[loss=0.2214, simple_loss=0.2791, pruned_loss=0.08179, over 12754.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2656, pruned_loss=0.08118, over 2583480.09 frames. ], batch size: 29, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:33:17,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=300529.1666666667, ans=0.07 2024-06-21 01:33:18,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=300529.1666666667, ans=0.2 2024-06-21 01:33:20,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=300529.1666666667, ans=0.2 2024-06-21 01:33:21,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=300529.1666666667, ans=0.0 2024-06-21 01:33:25,911 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.992e+02 2.131e+02 2.419e+02 3.758e+02, threshold=4.262e+02, percent-clipped=0.0 2024-06-21 01:33:37,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=300565.8333333333, ans=0.125 2024-06-21 01:33:47,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=300584.1666666667, ans=0.05 2024-06-21 01:33:54,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2024-06-21 01:33:57,843 INFO [train.py:1028] (0/2) Epoch 17, batch 2100, loss[loss=0.2188, simple_loss=0.2698, pruned_loss=0.08388, over 13185.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2657, pruned_loss=0.08073, over 2585747.49 frames. ], batch size: 59, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:33:59,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=300620.8333333333, ans=0.1 2024-06-21 01:34:06,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.30 vs. limit=10.0 2024-06-21 01:34:12,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300639.1666666667, ans=0.1 2024-06-21 01:34:13,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.04 vs. limit=10.0 2024-06-21 01:34:18,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=300657.5, ans=0.125 2024-06-21 01:34:20,134 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-164000.pt 2024-06-21 01:34:27,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=300657.5, ans=0.125 2024-06-21 01:34:27,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2024-06-21 01:34:31,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=300675.8333333333, ans=0.125 2024-06-21 01:34:32,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=300675.8333333333, ans=0.0 2024-06-21 01:34:35,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=300675.8333333333, ans=10.0 2024-06-21 01:34:37,355 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:34:39,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.99 vs. limit=22.5 2024-06-21 01:34:50,539 INFO [train.py:1028] (0/2) Epoch 17, batch 2150, loss[loss=0.2043, simple_loss=0.2611, pruned_loss=0.07376, over 13263.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2658, pruned_loss=0.08061, over 2587849.84 frames. ], batch size: 52, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:34:51,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=300712.5, ans=0.07 2024-06-21 01:34:59,803 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.986e+02 2.133e+02 2.320e+02 2.856e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 01:35:36,625 INFO [train.py:1028] (0/2) Epoch 17, batch 2200, loss[loss=0.2359, simple_loss=0.2802, pruned_loss=0.0958, over 13184.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2663, pruned_loss=0.08074, over 2588108.56 frames. ], batch size: 83, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:35:41,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.06 vs. limit=15.0 2024-06-21 01:35:44,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=300804.1666666667, ans=0.0 2024-06-21 01:35:49,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2024-06-21 01:36:08,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=300859.1666666667, ans=0.0 2024-06-21 01:36:18,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=300877.5, ans=0.0 2024-06-21 01:36:20,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=300877.5, ans=0.125 2024-06-21 01:36:22,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=300877.5, ans=0.2 2024-06-21 01:36:23,716 INFO [train.py:1028] (0/2) Epoch 17, batch 2250, loss[loss=0.2298, simple_loss=0.28, pruned_loss=0.08984, over 13256.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2659, pruned_loss=0.08067, over 2587583.43 frames. ], batch size: 63, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:36:32,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=300914.1666666667, ans=0.09899494936611666 2024-06-21 01:36:33,062 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.974e+02 2.116e+02 2.311e+02 3.847e+02, threshold=4.232e+02, percent-clipped=0.0 2024-06-21 01:36:40,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2024-06-21 01:36:49,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=300932.5, ans=0.125 2024-06-21 01:37:01,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300969.1666666667, ans=0.1 2024-06-21 01:37:03,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=300969.1666666667, ans=0.2 2024-06-21 01:37:13,489 INFO [train.py:1028] (0/2) Epoch 17, batch 2300, loss[loss=0.2143, simple_loss=0.2746, pruned_loss=0.07702, over 13007.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2662, pruned_loss=0.08083, over 2582047.74 frames. ], batch size: 33, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:37:16,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=300987.5, ans=0.125 2024-06-21 01:37:24,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=301005.8333333333, ans=0.0 2024-06-21 01:37:26,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=301005.8333333333, ans=0.09899494936611666 2024-06-21 01:37:27,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=301005.8333333333, ans=0.2 2024-06-21 01:37:34,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-21 01:37:43,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=301042.5, ans=0.0 2024-06-21 01:37:45,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=301042.5, ans=0.125 2024-06-21 01:37:50,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=301060.8333333333, ans=0.1 2024-06-21 01:38:02,711 INFO [train.py:1028] (0/2) Epoch 17, batch 2350, loss[loss=0.2064, simple_loss=0.2573, pruned_loss=0.07775, over 13231.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2664, pruned_loss=0.08104, over 2585189.81 frames. ], batch size: 67, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:38:03,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=301079.1666666667, ans=0.035 2024-06-21 01:38:05,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=301079.1666666667, ans=0.125 2024-06-21 01:38:05,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=301079.1666666667, ans=0.0 2024-06-21 01:38:07,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.61 vs. limit=22.5 2024-06-21 01:38:12,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.937e+02 2.077e+02 2.304e+02 2.921e+02, threshold=4.154e+02, percent-clipped=0.0 2024-06-21 01:38:42,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=8.0 2024-06-21 01:38:49,429 INFO [train.py:1028] (0/2) Epoch 17, batch 2400, loss[loss=0.2014, simple_loss=0.2571, pruned_loss=0.07285, over 13342.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2648, pruned_loss=0.08031, over 2587330.84 frames. ], batch size: 46, lr: 3.46e-03, grad_scale: 64.0 2024-06-21 01:38:50,609 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:39:00,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=301189.1666666667, ans=0.025 2024-06-21 01:39:02,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=301189.1666666667, ans=0.125 2024-06-21 01:39:03,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=301189.1666666667, ans=0.125 2024-06-21 01:39:04,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2024-06-21 01:39:24,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=12.0 2024-06-21 01:39:35,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.28 vs. limit=10.0 2024-06-21 01:39:35,508 INFO [train.py:1028] (0/2) Epoch 17, batch 2450, loss[loss=0.2171, simple_loss=0.2646, pruned_loss=0.08478, over 13254.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.264, pruned_loss=0.08055, over 2583590.55 frames. ], batch size: 63, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:39:35,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=301262.5, ans=0.1 2024-06-21 01:39:50,217 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.950e+02 2.067e+02 2.233e+02 3.113e+02, threshold=4.134e+02, percent-clipped=0.0 2024-06-21 01:39:51,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=301280.8333333333, ans=0.125 2024-06-21 01:39:57,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=301280.8333333333, ans=0.035 2024-06-21 01:40:03,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=301299.1666666667, ans=0.125 2024-06-21 01:40:13,368 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-21 01:40:15,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=301317.5, ans=0.05 2024-06-21 01:40:15,432 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2024-06-21 01:40:26,777 INFO [train.py:1028] (0/2) Epoch 17, batch 2500, loss[loss=0.2034, simple_loss=0.2525, pruned_loss=0.0771, over 13211.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2628, pruned_loss=0.08007, over 2587350.92 frames. ], batch size: 83, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:40:42,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=301372.5, ans=0.125 2024-06-21 01:40:51,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=301390.8333333333, ans=0.125 2024-06-21 01:41:20,336 INFO [train.py:1028] (0/2) Epoch 17, batch 2550, loss[loss=0.2107, simple_loss=0.2694, pruned_loss=0.07604, over 12668.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2624, pruned_loss=0.07994, over 2587718.45 frames. ], batch size: 22, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:41:30,213 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.955e+02 2.086e+02 2.265e+02 2.886e+02, threshold=4.173e+02, percent-clipped=0.0 2024-06-21 01:41:31,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=301464.1666666667, ans=0.125 2024-06-21 01:41:33,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=301464.1666666667, ans=0.125 2024-06-21 01:41:33,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=301464.1666666667, ans=0.125 2024-06-21 01:41:40,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=301482.5, ans=0.04949747468305833 2024-06-21 01:41:42,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301482.5, ans=0.1 2024-06-21 01:41:48,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=301500.8333333333, ans=0.125 2024-06-21 01:42:06,239 INFO [train.py:1028] (0/2) Epoch 17, batch 2600, loss[loss=0.2031, simple_loss=0.2565, pruned_loss=0.07488, over 13195.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2609, pruned_loss=0.0793, over 2587762.26 frames. ], batch size: 52, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:42:12,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=301537.5, ans=0.2 2024-06-21 01:42:22,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=301555.8333333333, ans=0.125 2024-06-21 01:42:24,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=301574.1666666667, ans=0.125 2024-06-21 01:42:29,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.30 vs. limit=15.0 2024-06-21 01:42:32,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=301592.5, ans=0.0 2024-06-21 01:42:37,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=15.0 2024-06-21 01:42:38,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=301592.5, ans=0.2 2024-06-21 01:42:48,855 INFO [train.py:1028] (0/2) Epoch 17, batch 2650, loss[loss=0.1984, simple_loss=0.2439, pruned_loss=0.07645, over 13038.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2593, pruned_loss=0.0788, over 2588303.71 frames. ], batch size: 144, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:42:56,639 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.917e+02 2.101e+02 2.280e+02 3.150e+02, threshold=4.202e+02, percent-clipped=0.0 2024-06-21 01:43:08,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=301647.5, ans=0.07 2024-06-21 01:43:24,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=301684.1666666667, ans=0.125 2024-06-21 01:43:35,143 INFO [train.py:1028] (0/2) Epoch 17, batch 2700, loss[loss=0.1993, simple_loss=0.2518, pruned_loss=0.07338, over 13206.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2577, pruned_loss=0.07853, over 2586273.34 frames. ], batch size: 89, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:43:39,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=301720.8333333333, ans=10.0 2024-06-21 01:43:47,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.42 vs. limit=15.0 2024-06-21 01:43:48,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=301739.1666666667, ans=0.125 2024-06-21 01:43:50,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=301739.1666666667, ans=0.125 2024-06-21 01:43:56,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=301757.5, ans=0.125 2024-06-21 01:43:57,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=15.0 2024-06-21 01:44:03,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301775.8333333333, ans=0.125 2024-06-21 01:44:13,806 INFO [train.py:1028] (0/2) Epoch 17, batch 2750, loss[loss=0.2194, simple_loss=0.2707, pruned_loss=0.08408, over 13249.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2574, pruned_loss=0.07818, over 2583413.10 frames. ], batch size: 43, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:44:15,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=301812.5, ans=0.1 2024-06-21 01:44:20,952 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 1.922e+02 2.080e+02 2.322e+02 3.208e+02, threshold=4.160e+02, percent-clipped=0.0 2024-06-21 01:44:30,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=301849.1666666667, ans=0.0 2024-06-21 01:44:50,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=301885.8333333333, ans=0.0 2024-06-21 01:44:58,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=301885.8333333333, ans=0.125 2024-06-21 01:45:00,740 INFO [train.py:1028] (0/2) Epoch 17, batch 2800, loss[loss=0.2, simple_loss=0.239, pruned_loss=0.08047, over 10892.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2572, pruned_loss=0.07833, over 2580056.19 frames. ], batch size: 304, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:45:07,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-21 01:45:11,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.50 vs. limit=12.0 2024-06-21 01:45:31,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.60 vs. limit=15.0 2024-06-21 01:45:41,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=301959.1666666667, ans=0.125 2024-06-21 01:45:53,483 INFO [train.py:1028] (0/2) Epoch 17, batch 2850, loss[loss=0.1839, simple_loss=0.2357, pruned_loss=0.06602, over 13247.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.256, pruned_loss=0.07783, over 2578368.20 frames. ], batch size: 49, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:46:02,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=302014.1666666667, ans=0.125 2024-06-21 01:46:02,613 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.935e+02 2.086e+02 2.271e+02 3.501e+02, threshold=4.172e+02, percent-clipped=0.0 2024-06-21 01:46:06,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=302014.1666666667, ans=0.125 2024-06-21 01:46:15,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.81 vs. limit=15.0 2024-06-21 01:46:26,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=302050.8333333333, ans=0.09899494936611666 2024-06-21 01:46:40,452 INFO [train.py:1028] (0/2) Epoch 17, batch 2900, loss[loss=0.1726, simple_loss=0.2256, pruned_loss=0.05979, over 13181.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2539, pruned_loss=0.07704, over 2586083.79 frames. ], batch size: 55, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:46:47,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=302087.5, ans=0.125 2024-06-21 01:47:17,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=302160.8333333333, ans=0.04949747468305833 2024-06-21 01:47:25,785 INFO [train.py:1028] (0/2) Epoch 17, batch 2950, loss[loss=0.1937, simple_loss=0.249, pruned_loss=0.06917, over 13229.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.254, pruned_loss=0.07722, over 2580531.38 frames. ], batch size: 43, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:47:39,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=302179.1666666667, ans=0.0 2024-06-21 01:47:42,033 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.853e+02 1.970e+02 2.093e+02 3.192e+02, threshold=3.940e+02, percent-clipped=0.0 2024-06-21 01:47:51,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2024-06-21 01:48:16,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.34 vs. limit=15.0 2024-06-21 01:48:20,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=302252.5, ans=10.0 2024-06-21 01:48:25,577 INFO [train.py:1028] (0/2) Epoch 17, batch 3000, loss[loss=0.2105, simple_loss=0.2624, pruned_loss=0.07932, over 13176.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2528, pruned_loss=0.07668, over 2578693.54 frames. ], batch size: 59, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:48:25,580 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 01:48:35,897 INFO [train.py:1060] (0/2) Epoch 17, validation: loss=0.1874, simple_loss=0.2525, pruned_loss=0.0611, over 351949.00 frames. 2024-06-21 01:48:35,898 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 01:48:52,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=302289.1666666667, ans=0.125 2024-06-21 01:49:04,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=302325.8333333333, ans=0.2 2024-06-21 01:49:18,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=302344.1666666667, ans=0.0 2024-06-21 01:49:20,255 INFO [train.py:1028] (0/2) Epoch 17, batch 3050, loss[loss=0.1923, simple_loss=0.2427, pruned_loss=0.07096, over 13284.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2529, pruned_loss=0.07711, over 2578675.15 frames. ], batch size: 46, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:49:22,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=302362.5, ans=0.125 2024-06-21 01:49:27,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=302362.5, ans=0.125 2024-06-21 01:49:28,874 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 1.947e+02 2.089e+02 2.325e+02 3.780e+02, threshold=4.179e+02, percent-clipped=0.0 2024-06-21 01:49:29,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=302380.8333333333, ans=0.125 2024-06-21 01:49:36,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=302380.8333333333, ans=0.125 2024-06-21 01:49:46,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=302417.5, ans=0.125 2024-06-21 01:49:54,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=302435.8333333333, ans=0.125 2024-06-21 01:49:56,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=302435.8333333333, ans=0.125 2024-06-21 01:50:00,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=302435.8333333333, ans=0.0 2024-06-21 01:50:05,388 INFO [train.py:1028] (0/2) Epoch 17, batch 3100, loss[loss=0.1863, simple_loss=0.2356, pruned_loss=0.06846, over 13090.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2521, pruned_loss=0.07646, over 2579667.79 frames. ], batch size: 144, lr: 3.45e-03, grad_scale: 64.0 2024-06-21 01:50:39,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-21 01:50:43,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.69 vs. limit=15.0 2024-06-21 01:51:01,695 INFO [train.py:1028] (0/2) Epoch 17, batch 3150, loss[loss=0.2061, simple_loss=0.2588, pruned_loss=0.07666, over 12922.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2512, pruned_loss=0.07616, over 2581676.06 frames. ], batch size: 158, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:51:05,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=302545.8333333333, ans=0.125 2024-06-21 01:51:07,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.78 vs. limit=10.0 2024-06-21 01:51:07,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.0 2024-06-21 01:51:10,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=302564.1666666667, ans=0.125 2024-06-21 01:51:11,440 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.861e+02 1.969e+02 2.146e+02 2.870e+02, threshold=3.938e+02, percent-clipped=0.0 2024-06-21 01:51:13,986 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 01:51:15,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=302564.1666666667, ans=0.0 2024-06-21 01:51:20,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=302582.5, ans=0.125 2024-06-21 01:51:40,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=302619.1666666667, ans=0.125 2024-06-21 01:51:49,672 INFO [train.py:1028] (0/2) Epoch 17, batch 3200, loss[loss=0.184, simple_loss=0.2377, pruned_loss=0.06513, over 13123.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2505, pruned_loss=0.07575, over 2582063.58 frames. ], batch size: 55, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:51:53,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.37 vs. limit=22.5 2024-06-21 01:51:53,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.25 vs. limit=22.5 2024-06-21 01:51:58,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=302655.8333333333, ans=0.2 2024-06-21 01:52:00,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302655.8333333333, ans=0.0 2024-06-21 01:52:05,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=302655.8333333333, ans=0.125 2024-06-21 01:52:13,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=302674.1666666667, ans=0.125 2024-06-21 01:52:26,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=302710.8333333333, ans=0.125 2024-06-21 01:52:34,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-21 01:52:34,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.71 vs. limit=22.5 2024-06-21 01:52:36,429 INFO [train.py:1028] (0/2) Epoch 17, batch 3250, loss[loss=0.194, simple_loss=0.2509, pruned_loss=0.06855, over 13211.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.25, pruned_loss=0.07578, over 2586744.27 frames. ], batch size: 72, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:52:39,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.52 vs. limit=22.5 2024-06-21 01:52:40,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=302729.1666666667, ans=0.0 2024-06-21 01:52:44,815 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 1.912e+02 2.016e+02 2.135e+02 2.872e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-21 01:53:22,230 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.58 vs. limit=15.0 2024-06-21 01:53:25,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=302802.5, ans=0.125 2024-06-21 01:53:30,802 INFO [train.py:1028] (0/2) Epoch 17, batch 3300, loss[loss=0.2044, simple_loss=0.2517, pruned_loss=0.07854, over 12754.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2498, pruned_loss=0.07556, over 2583297.10 frames. ], batch size: 176, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:53:30,968 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.597e-01 2024-06-21 01:53:31,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=302820.8333333333, ans=0.125 2024-06-21 01:54:02,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302857.5, ans=0.1 2024-06-21 01:54:10,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=302875.8333333333, ans=0.2 2024-06-21 01:54:20,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2024-06-21 01:54:23,700 INFO [train.py:1028] (0/2) Epoch 17, batch 3350, loss[loss=0.2157, simple_loss=0.2534, pruned_loss=0.08903, over 12931.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2497, pruned_loss=0.07587, over 2577450.93 frames. ], batch size: 158, lr: 3.45e-03, grad_scale: 128.0 2024-06-21 01:54:24,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302912.5, ans=0.1 2024-06-21 01:54:26,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2024-06-21 01:54:31,625 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.993e+02 2.110e+02 2.278e+02 2.978e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 01:54:34,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=302930.8333333333, ans=0.125 2024-06-21 01:54:37,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=302930.8333333333, ans=0.0 2024-06-21 01:54:39,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302949.1666666667, ans=0.1 2024-06-21 01:54:44,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=302949.1666666667, ans=0.0 2024-06-21 01:54:53,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=302967.5, ans=0.0 2024-06-21 01:54:56,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.74 vs. limit=22.5 2024-06-21 01:55:05,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.69 vs. limit=15.0 2024-06-21 01:55:07,441 INFO [train.py:1028] (0/2) Epoch 17, batch 3400, loss[loss=0.2191, simple_loss=0.261, pruned_loss=0.08854, over 12743.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2492, pruned_loss=0.07589, over 2576006.46 frames. ], batch size: 22, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:55:07,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303004.1666666667, ans=0.125 2024-06-21 01:55:12,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=303004.1666666667, ans=0.125 2024-06-21 01:55:14,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-21 01:55:16,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.73 vs. limit=12.0 2024-06-21 01:55:16,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=303022.5, ans=0.0 2024-06-21 01:55:20,117 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2024-06-21 01:55:41,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=303059.1666666667, ans=0.125 2024-06-21 01:55:53,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=303077.5, ans=0.2 2024-06-21 01:56:04,687 INFO [train.py:1028] (0/2) Epoch 17, batch 3450, loss[loss=0.2309, simple_loss=0.2667, pruned_loss=0.09756, over 12707.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2483, pruned_loss=0.07549, over 2577227.14 frames. ], batch size: 176, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:56:14,071 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 1.930e+02 2.049e+02 2.266e+02 2.726e+02, threshold=4.097e+02, percent-clipped=0.0 2024-06-21 01:56:14,794 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2024-06-21 01:56:30,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=8.0 2024-06-21 01:56:42,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.69 vs. limit=22.5 2024-06-21 01:56:46,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=303150.8333333333, ans=0.125 2024-06-21 01:56:47,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=303150.8333333333, ans=0.035 2024-06-21 01:56:59,282 INFO [train.py:1028] (0/2) Epoch 17, batch 3500, loss[loss=0.189, simple_loss=0.24, pruned_loss=0.06902, over 12859.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.248, pruned_loss=0.07525, over 2576747.94 frames. ], batch size: 33, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:57:00,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=303187.5, ans=0.125 2024-06-21 01:57:34,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=303242.5, ans=10.0 2024-06-21 01:57:43,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=303260.8333333333, ans=0.0 2024-06-21 01:57:45,639 INFO [train.py:1028] (0/2) Epoch 17, batch 3550, loss[loss=0.1753, simple_loss=0.2321, pruned_loss=0.0592, over 13196.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2471, pruned_loss=0.07485, over 2577825.29 frames. ], batch size: 95, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:57:54,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.652e+02 1.883e+02 1.984e+02 2.125e+02 3.280e+02, threshold=3.967e+02, percent-clipped=0.0 2024-06-21 01:57:59,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2024-06-21 01:58:04,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303315.8333333333, ans=0.1 2024-06-21 01:58:28,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=303352.5, ans=0.125 2024-06-21 01:58:30,600 INFO [train.py:1028] (0/2) Epoch 17, batch 3600, loss[loss=0.1847, simple_loss=0.2401, pruned_loss=0.06467, over 13293.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2474, pruned_loss=0.07516, over 2580681.43 frames. ], batch size: 49, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:58:39,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=303389.1666666667, ans=0.125 2024-06-21 01:59:15,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303444.1666666667, ans=0.125 2024-06-21 01:59:17,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=303444.1666666667, ans=0.2 2024-06-21 01:59:21,199 INFO [train.py:1028] (0/2) Epoch 17, batch 3650, loss[loss=0.1878, simple_loss=0.2339, pruned_loss=0.0708, over 13051.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2471, pruned_loss=0.075, over 2578434.74 frames. ], batch size: 102, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 01:59:23,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=303462.5, ans=0.125 2024-06-21 01:59:23,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=303462.5, ans=0.125 2024-06-21 01:59:36,728 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.681e+02 1.919e+02 2.031e+02 2.156e+02 2.684e+02, threshold=4.062e+02, percent-clipped=0.0 2024-06-21 01:59:36,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=303480.8333333333, ans=0.125 2024-06-21 01:59:44,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=303480.8333333333, ans=0.0 2024-06-21 01:59:51,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=303499.1666666667, ans=0.1 2024-06-21 02:00:01,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=303517.5, ans=0.125 2024-06-21 02:00:10,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=303535.8333333333, ans=0.125 2024-06-21 02:00:12,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=22.5 2024-06-21 02:00:14,270 INFO [train.py:1028] (0/2) Epoch 17, batch 3700, loss[loss=0.1946, simple_loss=0.2515, pruned_loss=0.06884, over 13239.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2462, pruned_loss=0.07439, over 2583476.00 frames. ], batch size: 72, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:00:16,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=303554.1666666667, ans=0.2 2024-06-21 02:00:16,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=303554.1666666667, ans=0.125 2024-06-21 02:00:35,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303590.8333333333, ans=0.125 2024-06-21 02:00:41,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2024-06-21 02:00:56,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2024-06-21 02:01:01,476 INFO [train.py:1028] (0/2) Epoch 17, batch 3750, loss[loss=0.1987, simple_loss=0.2601, pruned_loss=0.0686, over 12648.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2464, pruned_loss=0.07448, over 2586274.64 frames. ], batch size: 22, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:01:04,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2024-06-21 02:01:11,382 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.908e+02 2.025e+02 2.256e+02 3.117e+02, threshold=4.049e+02, percent-clipped=0.0 2024-06-21 02:01:14,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.86 vs. limit=6.0 2024-06-21 02:01:17,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.41 vs. limit=22.5 2024-06-21 02:01:23,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=303682.5, ans=0.2 2024-06-21 02:01:25,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.25 vs. limit=15.0 2024-06-21 02:01:35,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=303700.8333333333, ans=0.125 2024-06-21 02:01:37,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=303700.8333333333, ans=0.0 2024-06-21 02:01:48,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-21 02:01:48,463 INFO [train.py:1028] (0/2) Epoch 17, batch 3800, loss[loss=0.192, simple_loss=0.2407, pruned_loss=0.07162, over 13191.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2465, pruned_loss=0.07435, over 2583572.20 frames. ], batch size: 83, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:01:50,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=303737.5, ans=0.035 2024-06-21 02:02:03,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=303755.8333333333, ans=0.1 2024-06-21 02:02:04,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2024-06-21 02:02:24,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=303792.5, ans=0.2 2024-06-21 02:02:44,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=303810.8333333333, ans=0.125 2024-06-21 02:02:45,979 INFO [train.py:1028] (0/2) Epoch 17, batch 3850, loss[loss=0.1722, simple_loss=0.2141, pruned_loss=0.06513, over 13032.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2456, pruned_loss=0.07366, over 2582738.53 frames. ], batch size: 144, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:02:52,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=303829.1666666667, ans=0.0 2024-06-21 02:02:55,752 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 1.912e+02 2.110e+02 2.317e+02 3.462e+02, threshold=4.219e+02, percent-clipped=0.0 2024-06-21 02:03:07,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303865.8333333333, ans=0.1 2024-06-21 02:03:19,268 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2024-06-21 02:03:32,969 INFO [train.py:1028] (0/2) Epoch 17, batch 3900, loss[loss=0.1871, simple_loss=0.2291, pruned_loss=0.07255, over 13210.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2456, pruned_loss=0.07369, over 2585531.09 frames. ], batch size: 83, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:03:41,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=303920.8333333333, ans=0.1 2024-06-21 02:03:44,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=303939.1666666667, ans=0.125 2024-06-21 02:03:45,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=303939.1666666667, ans=0.2 2024-06-21 02:03:53,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=303957.5, ans=0.125 2024-06-21 02:04:00,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=303975.8333333333, ans=0.125 2024-06-21 02:04:01,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.26 vs. limit=15.0 2024-06-21 02:04:03,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=303975.8333333333, ans=0.02 2024-06-21 02:04:06,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2024-06-21 02:04:13,364 INFO [train.py:1028] (0/2) Epoch 17, batch 3950, loss[loss=0.1808, simple_loss=0.2227, pruned_loss=0.06944, over 13117.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2443, pruned_loss=0.07272, over 2587937.90 frames. ], batch size: 132, lr: 3.44e-03, grad_scale: 128.0 2024-06-21 02:04:19,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304012.5, ans=0.1 2024-06-21 02:04:20,333 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.853e+02 2.013e+02 2.170e+02 3.320e+02, threshold=4.026e+02, percent-clipped=0.0 2024-06-21 02:04:37,772 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.163e-01 2024-06-21 02:04:38,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=304067.5, ans=0.125 2024-06-21 02:04:41,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.57 vs. limit=22.5 2024-06-21 02:04:50,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=304085.8333333333, ans=0.125 2024-06-21 02:04:53,991 INFO [train.py:1028] (0/2) Epoch 17, batch 4000, loss[loss=0.2331, simple_loss=0.2872, pruned_loss=0.0895, over 13009.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2447, pruned_loss=0.07316, over 2583995.63 frames. ], batch size: 39, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:05:17,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=304140.8333333333, ans=0.125 2024-06-21 02:05:25,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=304159.1666666667, ans=0.0 2024-06-21 02:05:27,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=304159.1666666667, ans=0.125 2024-06-21 02:05:38,167 INFO [train.py:1028] (0/2) Epoch 17, batch 4050, loss[loss=0.2106, simple_loss=0.2447, pruned_loss=0.08823, over 11063.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2448, pruned_loss=0.07351, over 2581179.61 frames. ], batch size: 304, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:05:46,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.911e+02 2.030e+02 2.140e+02 2.792e+02, threshold=4.061e+02, percent-clipped=0.0 2024-06-21 02:05:55,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=304232.5, ans=0.2 2024-06-21 02:06:01,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304250.8333333333, ans=0.1 2024-06-21 02:06:03,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=304250.8333333333, ans=0.1 2024-06-21 02:06:08,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=304269.1666666667, ans=0.125 2024-06-21 02:06:14,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=304269.1666666667, ans=0.0 2024-06-21 02:06:18,241 INFO [train.py:1028] (0/2) Epoch 17, batch 4100, loss[loss=0.1767, simple_loss=0.2214, pruned_loss=0.06596, over 13160.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2444, pruned_loss=0.07358, over 2577964.42 frames. ], batch size: 103, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:06:28,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=304305.8333333333, ans=0.125 2024-06-21 02:06:28,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=304305.8333333333, ans=0.125 2024-06-21 02:06:28,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=304305.8333333333, ans=0.125 2024-06-21 02:06:32,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=304305.8333333333, ans=0.1 2024-06-21 02:06:41,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=304342.5, ans=0.07 2024-06-21 02:06:57,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=304360.8333333333, ans=0.125 2024-06-21 02:06:58,693 INFO [train.py:1028] (0/2) Epoch 17, batch 4150, loss[loss=0.1793, simple_loss=0.2365, pruned_loss=0.06106, over 13168.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2443, pruned_loss=0.07339, over 2576615.65 frames. ], batch size: 55, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:07:01,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=304379.1666666667, ans=0.125 2024-06-21 02:07:01,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=304379.1666666667, ans=0.0 2024-06-21 02:07:10,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=304397.5, ans=0.5 2024-06-21 02:07:11,449 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.859e+02 2.024e+02 2.252e+02 2.858e+02, threshold=4.048e+02, percent-clipped=0.0 2024-06-21 02:07:28,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=304434.1666666667, ans=0.1 2024-06-21 02:07:29,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=304434.1666666667, ans=0.025 2024-06-21 02:07:41,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=304452.5, ans=0.0 2024-06-21 02:07:46,354 INFO [train.py:1028] (0/2) Epoch 17, batch 4200, loss[loss=0.2142, simple_loss=0.2558, pruned_loss=0.08633, over 13137.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2437, pruned_loss=0.07342, over 2579217.48 frames. ], batch size: 103, lr: 3.44e-03, grad_scale: 64.0 2024-06-21 02:07:53,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=304489.1666666667, ans=0.125 2024-06-21 02:07:54,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2024-06-21 02:08:03,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=304507.5, ans=0.125 2024-06-21 02:08:09,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.09 vs. limit=15.0 2024-06-21 02:08:12,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=304525.8333333333, ans=0.125 2024-06-21 02:08:15,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=304525.8333333333, ans=0.05 2024-06-21 02:08:21,764 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=15.0 2024-06-21 02:08:22,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=304544.1666666667, ans=0.125 2024-06-21 02:08:23,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=304544.1666666667, ans=0.125 2024-06-21 02:08:26,082 INFO [train.py:1028] (0/2) Epoch 17, batch 4250, loss[loss=0.1907, simple_loss=0.2425, pruned_loss=0.06948, over 13315.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.243, pruned_loss=0.07301, over 2581275.00 frames. ], batch size: 46, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:08:35,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 1.909e+02 2.016e+02 2.185e+02 3.151e+02, threshold=4.031e+02, percent-clipped=0.0 2024-06-21 02:08:35,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2024-06-21 02:08:37,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=304580.8333333333, ans=0.0 2024-06-21 02:08:38,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=304580.8333333333, ans=0.125 2024-06-21 02:08:40,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=304599.1666666667, ans=0.125 2024-06-21 02:08:41,205 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.60 vs. limit=15.0 2024-06-21 02:08:48,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=304599.1666666667, ans=0.09899494936611666 2024-06-21 02:08:49,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=304617.5, ans=0.125 2024-06-21 02:08:52,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=304617.5, ans=0.125 2024-06-21 02:08:54,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=304617.5, ans=0.2 2024-06-21 02:09:03,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=304635.8333333333, ans=0.125 2024-06-21 02:09:05,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.94 vs. limit=15.0 2024-06-21 02:09:05,468 INFO [train.py:1028] (0/2) Epoch 17, batch 4300, loss[loss=0.2024, simple_loss=0.2427, pruned_loss=0.08105, over 13214.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2426, pruned_loss=0.07286, over 2582069.74 frames. ], batch size: 59, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:09:15,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304672.5, ans=0.1 2024-06-21 02:09:18,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=304672.5, ans=0.125 2024-06-21 02:09:35,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=304709.1666666667, ans=0.125 2024-06-21 02:09:38,306 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.15 vs. limit=22.5 2024-06-21 02:09:47,795 INFO [train.py:1028] (0/2) Epoch 17, batch 4350, loss[loss=0.179, simple_loss=0.2333, pruned_loss=0.06231, over 13161.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2414, pruned_loss=0.07248, over 2585936.96 frames. ], batch size: 59, lr: 3.44e-03, grad_scale: 32.0 2024-06-21 02:10:01,026 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.834e+02 1.960e+02 2.109e+02 2.921e+02, threshold=3.920e+02, percent-clipped=0.0 2024-06-21 02:10:18,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=304800.8333333333, ans=0.0 2024-06-21 02:10:24,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.18 vs. limit=22.5 2024-06-21 02:10:26,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=304819.1666666667, ans=0.0 2024-06-21 02:10:31,548 INFO [train.py:1028] (0/2) Epoch 17, batch 4400, loss[loss=0.2014, simple_loss=0.2472, pruned_loss=0.07781, over 13200.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.242, pruned_loss=0.07265, over 2585767.67 frames. ], batch size: 83, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:10:37,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=304837.5, ans=0.025 2024-06-21 02:10:38,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=304855.8333333333, ans=0.2 2024-06-21 02:10:40,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304855.8333333333, ans=0.1 2024-06-21 02:10:47,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=304874.1666666667, ans=0.125 2024-06-21 02:10:49,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=304874.1666666667, ans=0.0 2024-06-21 02:10:50,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=304874.1666666667, ans=0.0 2024-06-21 02:10:53,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2024-06-21 02:10:55,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304892.5, ans=0.1 2024-06-21 02:10:58,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=304892.5, ans=0.125 2024-06-21 02:10:59,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=304892.5, ans=0.025 2024-06-21 02:11:10,968 INFO [train.py:1028] (0/2) Epoch 17, batch 4450, loss[loss=0.1919, simple_loss=0.2459, pruned_loss=0.06895, over 12880.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2431, pruned_loss=0.07343, over 2580115.87 frames. ], batch size: 33, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:11:11,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=304929.1666666667, ans=0.1 2024-06-21 02:11:20,164 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 1.925e+02 2.122e+02 2.408e+02 3.035e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-21 02:11:20,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2024-06-21 02:11:29,512 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:11:35,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=304984.1666666667, ans=0.125 2024-06-21 02:11:48,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=305002.5, ans=0.125 2024-06-21 02:11:50,017 INFO [train.py:1028] (0/2) Epoch 17, batch 4500, loss[loss=0.1812, simple_loss=0.2293, pruned_loss=0.06655, over 13226.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2423, pruned_loss=0.07304, over 2585230.44 frames. ], batch size: 89, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:11:52,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=305020.8333333333, ans=0.0 2024-06-21 02:12:03,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=305039.1666666667, ans=0.125 2024-06-21 02:12:08,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=305039.1666666667, ans=0.125 2024-06-21 02:12:08,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.30 vs. limit=15.0 2024-06-21 02:12:15,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=305057.5, ans=0.125 2024-06-21 02:12:37,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=16.33 vs. limit=15.0 2024-06-21 02:12:37,364 INFO [train.py:1028] (0/2) Epoch 17, batch 4550, loss[loss=0.209, simple_loss=0.2662, pruned_loss=0.07588, over 13289.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2416, pruned_loss=0.07257, over 2588336.64 frames. ], batch size: 52, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:12:44,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.05 vs. limit=22.5 2024-06-21 02:12:46,556 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 1.819e+02 1.917e+02 2.043e+02 2.609e+02, threshold=3.833e+02, percent-clipped=0.0 2024-06-21 02:12:48,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=305130.8333333333, ans=0.0 2024-06-21 02:12:54,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=305149.1666666667, ans=0.125 2024-06-21 02:13:01,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=305167.5, ans=0.125 2024-06-21 02:13:07,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-21 02:13:07,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2024-06-21 02:13:12,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305185.8333333333, ans=0.125 2024-06-21 02:13:15,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.58 vs. limit=22.5 2024-06-21 02:13:16,546 INFO [train.py:1028] (0/2) Epoch 17, batch 4600, loss[loss=0.2117, simple_loss=0.2519, pruned_loss=0.08578, over 12482.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2422, pruned_loss=0.0728, over 2585267.19 frames. ], batch size: 202, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:13:23,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=305222.5, ans=0.015 2024-06-21 02:13:48,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=305277.5, ans=0.125 2024-06-21 02:13:50,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.97 vs. limit=15.0 2024-06-21 02:13:55,384 INFO [train.py:1028] (0/2) Epoch 17, batch 4650, loss[loss=0.186, simple_loss=0.224, pruned_loss=0.07395, over 13072.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2419, pruned_loss=0.07294, over 2589033.20 frames. ], batch size: 132, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:14:04,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 1.932e+02 2.073e+02 2.235e+02 3.046e+02, threshold=4.147e+02, percent-clipped=0.0 2024-06-21 02:14:28,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=305350.8333333333, ans=0.0 2024-06-21 02:14:36,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=305369.1666666667, ans=0.125 2024-06-21 02:14:37,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-21 02:14:39,164 INFO [train.py:1028] (0/2) Epoch 17, batch 4700, loss[loss=0.1827, simple_loss=0.2335, pruned_loss=0.06594, over 12368.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2423, pruned_loss=0.07324, over 2583152.98 frames. ], batch size: 25, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:14:40,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=305387.5, ans=0.2 2024-06-21 02:14:42,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=305387.5, ans=0.0 2024-06-21 02:14:45,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=305387.5, ans=0.125 2024-06-21 02:14:50,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=305405.8333333333, ans=0.125 2024-06-21 02:14:50,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=12.0 2024-06-21 02:14:55,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.46 vs. limit=15.0 2024-06-21 02:15:03,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=305424.1666666667, ans=0.125 2024-06-21 02:15:23,098 INFO [train.py:1028] (0/2) Epoch 17, batch 4750, loss[loss=0.2048, simple_loss=0.2517, pruned_loss=0.07897, over 12576.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2418, pruned_loss=0.07315, over 2579970.18 frames. ], batch size: 202, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:15:32,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305497.5, ans=0.125 2024-06-21 02:15:33,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 1.869e+02 2.004e+02 2.200e+02 2.630e+02, threshold=4.007e+02, percent-clipped=0.0 2024-06-21 02:15:36,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=305497.5, ans=0.125 2024-06-21 02:15:48,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=305534.1666666667, ans=0.125 2024-06-21 02:16:01,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=305552.5, ans=6.0 2024-06-21 02:16:03,060 INFO [train.py:1028] (0/2) Epoch 17, batch 4800, loss[loss=0.205, simple_loss=0.2498, pruned_loss=0.08013, over 13302.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2411, pruned_loss=0.0727, over 2577554.99 frames. ], batch size: 63, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:16:15,521 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 02:16:34,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305644.1666666667, ans=0.1 2024-06-21 02:16:37,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305644.1666666667, ans=0.1 2024-06-21 02:16:39,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.39 vs. limit=6.0 2024-06-21 02:16:45,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2024-06-21 02:16:45,537 INFO [train.py:1028] (0/2) Epoch 17, batch 4850, loss[loss=0.2119, simple_loss=0.2593, pruned_loss=0.08229, over 13203.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2404, pruned_loss=0.07221, over 2576468.23 frames. ], batch size: 89, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:16:47,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=305662.5, ans=0.0 2024-06-21 02:16:48,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2024-06-21 02:16:55,518 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.817e+02 1.926e+02 2.083e+02 2.693e+02, threshold=3.853e+02, percent-clipped=0.0 2024-06-21 02:16:57,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=305680.8333333333, ans=0.125 2024-06-21 02:16:58,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=305680.8333333333, ans=0.025 2024-06-21 02:17:04,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=305699.1666666667, ans=0.125 2024-06-21 02:17:05,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-21 02:17:12,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=305699.1666666667, ans=0.04949747468305833 2024-06-21 02:17:12,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=305699.1666666667, ans=0.125 2024-06-21 02:17:13,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2024-06-21 02:17:14,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.21 vs. limit=15.0 2024-06-21 02:17:31,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=305754.1666666667, ans=0.0 2024-06-21 02:17:32,078 INFO [train.py:1028] (0/2) Epoch 17, batch 4900, loss[loss=0.176, simple_loss=0.23, pruned_loss=0.06101, over 13169.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2406, pruned_loss=0.07221, over 2577097.90 frames. ], batch size: 59, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:17:39,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=305772.5, ans=0.125 2024-06-21 02:17:40,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305772.5, ans=0.125 2024-06-21 02:17:40,551 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.76 vs. limit=22.5 2024-06-21 02:17:42,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305772.5, ans=0.0 2024-06-21 02:17:48,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=305790.8333333333, ans=0.2 2024-06-21 02:17:56,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=305809.1666666667, ans=0.05 2024-06-21 02:18:06,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=305827.5, ans=0.2 2024-06-21 02:18:10,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305827.5, ans=0.1 2024-06-21 02:18:10,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=305827.5, ans=0.5 2024-06-21 02:18:12,467 INFO [train.py:1028] (0/2) Epoch 17, batch 4950, loss[loss=0.2022, simple_loss=0.2473, pruned_loss=0.07856, over 11173.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2408, pruned_loss=0.07263, over 2571033.35 frames. ], batch size: 303, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:18:17,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=305845.8333333333, ans=0.125 2024-06-21 02:18:18,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=305845.8333333333, ans=0.2 2024-06-21 02:18:19,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305864.1666666667, ans=0.1 2024-06-21 02:18:22,125 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.588e+02 1.855e+02 1.973e+02 2.141e+02 2.935e+02, threshold=3.945e+02, percent-clipped=0.0 2024-06-21 02:18:43,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=305900.8333333333, ans=0.125 2024-06-21 02:18:43,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2024-06-21 02:18:43,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=305919.1666666667, ans=0.0 2024-06-21 02:18:44,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305919.1666666667, ans=0.1 2024-06-21 02:18:52,370 INFO [train.py:1028] (0/2) Epoch 17, batch 5000, loss[loss=0.1987, simple_loss=0.2415, pruned_loss=0.07795, over 13189.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2404, pruned_loss=0.07237, over 2575180.14 frames. ], batch size: 95, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:18:58,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305937.5, ans=0.0 2024-06-21 02:19:11,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305955.8333333333, ans=0.1 2024-06-21 02:19:28,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=306010.8333333333, ans=0.125 2024-06-21 02:19:40,925 INFO [train.py:1028] (0/2) Epoch 17, batch 5050, loss[loss=0.1821, simple_loss=0.2371, pruned_loss=0.0635, over 12947.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2406, pruned_loss=0.07237, over 2573176.02 frames. ], batch size: 36, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:19:42,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2024-06-21 02:19:43,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306029.1666666667, ans=0.125 2024-06-21 02:19:47,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=306029.1666666667, ans=0.125 2024-06-21 02:19:49,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=306047.5, ans=0.2 2024-06-21 02:19:50,385 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 1.876e+02 1.996e+02 2.240e+02 2.941e+02, threshold=3.993e+02, percent-clipped=0.0 2024-06-21 02:19:51,756 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.86 vs. limit=22.5 2024-06-21 02:19:56,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=306065.8333333333, ans=0.125 2024-06-21 02:20:15,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 02:20:21,923 INFO [train.py:1028] (0/2) Epoch 17, batch 5100, loss[loss=0.1752, simple_loss=0.2248, pruned_loss=0.06284, over 12950.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2403, pruned_loss=0.07257, over 2569204.67 frames. ], batch size: 39, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:20:22,222 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.986e+01 2024-06-21 02:20:24,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=306120.8333333333, ans=0.125 2024-06-21 02:20:24,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=306120.8333333333, ans=0.125 2024-06-21 02:20:29,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.11 vs. limit=15.0 2024-06-21 02:20:31,647 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:20:35,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=306139.1666666667, ans=0.125 2024-06-21 02:20:40,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=306157.5, ans=0.125 2024-06-21 02:21:00,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2024-06-21 02:21:01,062 INFO [train.py:1028] (0/2) Epoch 17, batch 5150, loss[loss=0.1826, simple_loss=0.2256, pruned_loss=0.06975, over 13078.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2406, pruned_loss=0.07279, over 2571112.02 frames. ], batch size: 132, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:21:07,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2024-06-21 02:21:10,228 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.934e+02 2.140e+02 2.309e+02 4.282e+02, threshold=4.279e+02, percent-clipped=1.0 2024-06-21 02:21:11,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=306230.8333333333, ans=0.2 2024-06-21 02:21:18,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306249.1666666667, ans=0.125 2024-06-21 02:21:34,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306285.8333333333, ans=0.1 2024-06-21 02:21:42,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=306304.1666666667, ans=0.2 2024-06-21 02:21:43,259 INFO [train.py:1028] (0/2) Epoch 17, batch 5200, loss[loss=0.182, simple_loss=0.2294, pruned_loss=0.06735, over 13219.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2404, pruned_loss=0.07287, over 2574913.25 frames. ], batch size: 95, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:21:44,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2024-06-21 02:21:54,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=306322.5, ans=0.2 2024-06-21 02:22:00,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=306322.5, ans=0.125 2024-06-21 02:22:27,113 INFO [train.py:1028] (0/2) Epoch 17, batch 5250, loss[loss=0.1984, simple_loss=0.2504, pruned_loss=0.07324, over 13282.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2409, pruned_loss=0.07294, over 2570858.65 frames. ], batch size: 52, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:22:34,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306414.1666666667, ans=0.1 2024-06-21 02:22:36,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 1.994e+02 2.125e+02 2.430e+02 3.341e+02, threshold=4.249e+02, percent-clipped=0.0 2024-06-21 02:22:38,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=306414.1666666667, ans=0.2 2024-06-21 02:22:40,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=306414.1666666667, ans=0.2 2024-06-21 02:22:48,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=306432.5, ans=0.0 2024-06-21 02:22:51,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306450.8333333333, ans=0.1 2024-06-21 02:22:51,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=306450.8333333333, ans=0.125 2024-06-21 02:23:01,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=306469.1666666667, ans=0.2 2024-06-21 02:23:05,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=306469.1666666667, ans=0.125 2024-06-21 02:23:07,502 INFO [train.py:1028] (0/2) Epoch 17, batch 5300, loss[loss=0.1979, simple_loss=0.2421, pruned_loss=0.07689, over 12987.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2408, pruned_loss=0.07312, over 2567858.60 frames. ], batch size: 144, lr: 3.43e-03, grad_scale: 32.0 2024-06-21 02:23:32,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=306542.5, ans=0.125 2024-06-21 02:23:36,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=306542.5, ans=0.2 2024-06-21 02:23:39,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=306560.8333333333, ans=0.0 2024-06-21 02:23:39,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=306560.8333333333, ans=0.125 2024-06-21 02:23:51,972 INFO [train.py:1028] (0/2) Epoch 17, batch 5350, loss[loss=0.1849, simple_loss=0.243, pruned_loss=0.06341, over 12013.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2404, pruned_loss=0.073, over 2574754.86 frames. ], batch size: 17, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:23:53,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=306579.1666666667, ans=0.0 2024-06-21 02:23:54,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=306579.1666666667, ans=0.09899494936611666 2024-06-21 02:24:01,252 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.903e+02 2.064e+02 2.231e+02 2.904e+02, threshold=4.127e+02, percent-clipped=0.0 2024-06-21 02:24:02,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=306597.5, ans=0.025 2024-06-21 02:24:16,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=306615.8333333333, ans=0.2 2024-06-21 02:24:16,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306615.8333333333, ans=0.1 2024-06-21 02:24:20,987 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.966e-01 2024-06-21 02:24:25,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=306634.1666666667, ans=0.125 2024-06-21 02:24:25,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=306634.1666666667, ans=0.0 2024-06-21 02:24:34,268 INFO [train.py:1028] (0/2) Epoch 17, batch 5400, loss[loss=0.2154, simple_loss=0.2519, pruned_loss=0.0894, over 12180.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2404, pruned_loss=0.07312, over 2567530.19 frames. ], batch size: 240, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:24:39,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=306670.8333333333, ans=0.05 2024-06-21 02:24:40,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306670.8333333333, ans=0.125 2024-06-21 02:24:46,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306689.1666666667, ans=0.125 2024-06-21 02:24:48,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=15.0 2024-06-21 02:24:57,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=306707.5, ans=0.0 2024-06-21 02:24:59,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=306725.8333333333, ans=0.125 2024-06-21 02:25:07,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=306744.1666666667, ans=0.125 2024-06-21 02:25:07,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.95 vs. limit=22.5 2024-06-21 02:25:08,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=306744.1666666667, ans=0.09899494936611666 2024-06-21 02:25:14,994 INFO [train.py:1028] (0/2) Epoch 17, batch 5450, loss[loss=0.1781, simple_loss=0.2356, pruned_loss=0.06032, over 12732.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2408, pruned_loss=0.0732, over 2572090.29 frames. ], batch size: 26, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:25:20,034 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:25:24,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=306780.8333333333, ans=0.125 2024-06-21 02:25:24,556 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.899e+02 2.055e+02 2.242e+02 3.252e+02, threshold=4.111e+02, percent-clipped=0.0 2024-06-21 02:25:28,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=306780.8333333333, ans=0.2 2024-06-21 02:25:30,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=306799.1666666667, ans=0.0 2024-06-21 02:25:30,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=306799.1666666667, ans=0.2 2024-06-21 02:25:33,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=306799.1666666667, ans=0.125 2024-06-21 02:25:51,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=306835.8333333333, ans=0.05 2024-06-21 02:25:51,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=306835.8333333333, ans=0.125 2024-06-21 02:25:54,852 INFO [train.py:1028] (0/2) Epoch 17, batch 5500, loss[loss=0.2142, simple_loss=0.2496, pruned_loss=0.08939, over 12263.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2405, pruned_loss=0.07301, over 2564772.86 frames. ], batch size: 241, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:26:12,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2024-06-21 02:26:35,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=306927.5, ans=0.0 2024-06-21 02:26:36,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=306927.5, ans=0.125 2024-06-21 02:26:38,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306927.5, ans=0.0 2024-06-21 02:26:42,028 INFO [train.py:1028] (0/2) Epoch 17, batch 5550, loss[loss=0.1862, simple_loss=0.2354, pruned_loss=0.06845, over 13273.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2399, pruned_loss=0.07251, over 2568501.42 frames. ], batch size: 43, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:26:49,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=306964.1666666667, ans=0.125 2024-06-21 02:26:51,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.868e+02 1.995e+02 2.166e+02 3.302e+02, threshold=3.990e+02, percent-clipped=0.0 2024-06-21 02:26:54,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=306964.1666666667, ans=0.2 2024-06-21 02:26:58,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306982.5, ans=0.1 2024-06-21 02:27:05,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.83 vs. limit=15.0 2024-06-21 02:27:06,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=307000.8333333333, ans=0.125 2024-06-21 02:27:16,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=307019.1666666667, ans=0.0 2024-06-21 02:27:21,688 INFO [train.py:1028] (0/2) Epoch 17, batch 5600, loss[loss=0.1876, simple_loss=0.2326, pruned_loss=0.07135, over 13248.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2392, pruned_loss=0.07226, over 2569805.76 frames. ], batch size: 89, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:27:30,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=307055.8333333333, ans=0.0 2024-06-21 02:27:31,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=307055.8333333333, ans=0.025 2024-06-21 02:27:55,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2024-06-21 02:28:01,616 INFO [train.py:1028] (0/2) Epoch 17, batch 5650, loss[loss=0.1961, simple_loss=0.2448, pruned_loss=0.07374, over 12465.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2399, pruned_loss=0.07248, over 2574747.04 frames. ], batch size: 202, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:28:11,413 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.678e+02 1.916e+02 2.067e+02 2.261e+02 3.879e+02, threshold=4.134e+02, percent-clipped=0.0 2024-06-21 02:28:13,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307147.5, ans=0.1 2024-06-21 02:28:17,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=307165.8333333333, ans=0.09899494936611666 2024-06-21 02:28:19,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=307165.8333333333, ans=0.025 2024-06-21 02:28:44,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=307220.8333333333, ans=0.0 2024-06-21 02:28:44,839 INFO [train.py:1028] (0/2) Epoch 17, batch 5700, loss[loss=0.19, simple_loss=0.2432, pruned_loss=0.06845, over 13298.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2396, pruned_loss=0.07221, over 2577003.94 frames. ], batch size: 63, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:28:52,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=307220.8333333333, ans=0.2 2024-06-21 02:29:15,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.81 vs. limit=22.5 2024-06-21 02:29:28,283 INFO [train.py:1028] (0/2) Epoch 17, batch 5750, loss[loss=0.2097, simple_loss=0.2507, pruned_loss=0.08435, over 12764.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2403, pruned_loss=0.07244, over 2578467.36 frames. ], batch size: 176, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:29:36,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307330.8333333333, ans=0.1 2024-06-21 02:29:37,945 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.898e+02 2.007e+02 2.225e+02 3.342e+02, threshold=4.014e+02, percent-clipped=0.0 2024-06-21 02:29:39,390 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.86 vs. limit=12.0 2024-06-21 02:29:58,958 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:29:59,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.52 vs. limit=22.5 2024-06-21 02:30:03,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=307385.8333333333, ans=0.125 2024-06-21 02:30:07,518 INFO [train.py:1028] (0/2) Epoch 17, batch 5800, loss[loss=0.2073, simple_loss=0.2497, pruned_loss=0.08243, over 12792.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2419, pruned_loss=0.07345, over 2578145.26 frames. ], batch size: 176, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:30:10,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=307404.1666666667, ans=0.1 2024-06-21 02:30:29,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=307440.8333333333, ans=0.125 2024-06-21 02:30:32,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=307459.1666666667, ans=0.2 2024-06-21 02:30:34,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2024-06-21 02:30:34,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2024-06-21 02:30:40,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.98 vs. limit=15.0 2024-06-21 02:30:42,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307477.5, ans=0.1 2024-06-21 02:30:48,625 INFO [train.py:1028] (0/2) Epoch 17, batch 5850, loss[loss=0.2107, simple_loss=0.2552, pruned_loss=0.08311, over 12509.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2436, pruned_loss=0.07401, over 2577637.28 frames. ], batch size: 202, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:30:54,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=307495.8333333333, ans=0.125 2024-06-21 02:31:02,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 1.955e+02 2.132e+02 2.291e+02 3.003e+02, threshold=4.264e+02, percent-clipped=0.0 2024-06-21 02:31:05,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=307514.1666666667, ans=0.125 2024-06-21 02:31:05,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307514.1666666667, ans=0.125 2024-06-21 02:31:33,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=307569.1666666667, ans=0.07 2024-06-21 02:31:37,972 INFO [train.py:1028] (0/2) Epoch 17, batch 5900, loss[loss=0.2174, simple_loss=0.2538, pruned_loss=0.09056, over 13169.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2452, pruned_loss=0.07445, over 2577571.37 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:31:47,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=307605.8333333333, ans=0.2 2024-06-21 02:31:51,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2024-06-21 02:31:52,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.65 vs. limit=22.5 2024-06-21 02:32:18,774 INFO [train.py:1028] (0/2) Epoch 17, batch 5950, loss[loss=0.1937, simple_loss=0.2383, pruned_loss=0.0745, over 13080.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2462, pruned_loss=0.07495, over 2581674.79 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:32:20,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=307679.1666666667, ans=0.0 2024-06-21 02:32:20,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=307679.1666666667, ans=0.125 2024-06-21 02:32:26,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=307697.5, ans=0.0 2024-06-21 02:32:28,134 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.957e+02 2.125e+02 2.371e+02 3.118e+02, threshold=4.249e+02, percent-clipped=0.0 2024-06-21 02:32:37,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=307715.8333333333, ans=0.0 2024-06-21 02:32:39,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=307715.8333333333, ans=0.025 2024-06-21 02:32:42,263 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:32:58,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=307770.8333333333, ans=0.0 2024-06-21 02:32:58,746 INFO [train.py:1028] (0/2) Epoch 17, batch 6000, loss[loss=0.231, simple_loss=0.274, pruned_loss=0.09396, over 12256.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2482, pruned_loss=0.07575, over 2574982.17 frames. ], batch size: 241, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:32:58,747 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 02:33:07,630 INFO [train.py:1060] (0/2) Epoch 17, validation: loss=0.188, simple_loss=0.253, pruned_loss=0.06152, over 351949.00 frames. 2024-06-21 02:33:07,630 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 02:33:09,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=307770.8333333333, ans=0.07 2024-06-21 02:33:18,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307789.1666666667, ans=0.1 2024-06-21 02:33:52,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=307844.1666666667, ans=0.125 2024-06-21 02:33:55,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=307844.1666666667, ans=0.125 2024-06-21 02:33:56,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.07 vs. limit=10.0 2024-06-21 02:33:56,807 INFO [train.py:1028] (0/2) Epoch 17, batch 6050, loss[loss=0.1822, simple_loss=0.24, pruned_loss=0.0622, over 13257.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2497, pruned_loss=0.07605, over 2578402.43 frames. ], batch size: 40, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:33:58,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=307862.5, ans=0.0 2024-06-21 02:34:00,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=307862.5, ans=0.2 2024-06-21 02:34:06,388 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.704e+02 1.954e+02 2.082e+02 2.313e+02 3.295e+02, threshold=4.164e+02, percent-clipped=0.0 2024-06-21 02:34:15,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2024-06-21 02:34:21,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=307917.5, ans=0.125 2024-06-21 02:34:36,963 INFO [train.py:1028] (0/2) Epoch 17, batch 6100, loss[loss=0.1926, simple_loss=0.2394, pruned_loss=0.07293, over 13107.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2515, pruned_loss=0.0767, over 2580046.75 frames. ], batch size: 121, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:34:54,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=307990.8333333333, ans=0.125 2024-06-21 02:34:56,830 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-168000.pt 2024-06-21 02:35:10,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=308009.1666666667, ans=0.5 2024-06-21 02:35:23,587 INFO [train.py:1028] (0/2) Epoch 17, batch 6150, loss[loss=0.1993, simple_loss=0.2413, pruned_loss=0.0786, over 10975.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2527, pruned_loss=0.07739, over 2578115.25 frames. ], batch size: 304, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:35:23,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308045.8333333333, ans=0.1 2024-06-21 02:35:24,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=308045.8333333333, ans=0.0 2024-06-21 02:35:33,173 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 2.006e+02 2.190e+02 2.618e+02 4.126e+02, threshold=4.380e+02, percent-clipped=0.0 2024-06-21 02:35:34,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=308064.1666666667, ans=0.025 2024-06-21 02:35:36,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.06 vs. limit=15.0 2024-06-21 02:35:47,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308100.8333333333, ans=0.125 2024-06-21 02:35:48,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=308100.8333333333, ans=0.125 2024-06-21 02:35:50,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=308100.8333333333, ans=0.2 2024-06-21 02:35:59,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.41 vs. limit=15.0 2024-06-21 02:36:01,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=308119.1666666667, ans=0.2 2024-06-21 02:36:07,207 INFO [train.py:1028] (0/2) Epoch 17, batch 6200, loss[loss=0.2336, simple_loss=0.2895, pruned_loss=0.08889, over 13227.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2544, pruned_loss=0.07816, over 2574404.41 frames. ], batch size: 89, lr: 3.42e-03, grad_scale: 32.0 2024-06-21 02:36:34,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.30 vs. limit=22.5 2024-06-21 02:36:35,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=308192.5, ans=0.05 2024-06-21 02:36:42,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308192.5, ans=0.1 2024-06-21 02:36:43,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=308210.8333333333, ans=0.125 2024-06-21 02:36:51,770 INFO [train.py:1028] (0/2) Epoch 17, batch 6250, loss[loss=0.208, simple_loss=0.2623, pruned_loss=0.07687, over 13262.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2549, pruned_loss=0.07833, over 2568247.46 frames. ], batch size: 83, lr: 3.42e-03, grad_scale: 64.0 2024-06-21 02:36:54,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=308229.1666666667, ans=0.125 2024-06-21 02:37:01,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.960e+02 2.071e+02 2.319e+02 2.862e+02, threshold=4.142e+02, percent-clipped=0.0 2024-06-21 02:37:03,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.21 vs. limit=15.0 2024-06-21 02:37:07,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308265.8333333333, ans=0.125 2024-06-21 02:37:11,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=308265.8333333333, ans=0.0 2024-06-21 02:37:27,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=308302.5, ans=0.125 2024-06-21 02:37:30,458 INFO [train.py:1028] (0/2) Epoch 17, batch 6300, loss[loss=0.1887, simple_loss=0.248, pruned_loss=0.06472, over 11368.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.257, pruned_loss=0.07912, over 2563621.03 frames. ], batch size: 16, lr: 3.42e-03, grad_scale: 64.0 2024-06-21 02:37:32,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.60 vs. limit=15.0 2024-06-21 02:37:42,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=308339.1666666667, ans=0.125 2024-06-21 02:37:47,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-21 02:37:50,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.72 vs. limit=22.5 2024-06-21 02:37:57,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=308375.8333333333, ans=15.0 2024-06-21 02:37:57,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=308375.8333333333, ans=0.2 2024-06-21 02:37:59,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=308375.8333333333, ans=0.125 2024-06-21 02:38:00,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308375.8333333333, ans=0.125 2024-06-21 02:38:04,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=308394.1666666667, ans=0.125 2024-06-21 02:38:10,094 INFO [train.py:1028] (0/2) Epoch 17, batch 6350, loss[loss=0.2302, simple_loss=0.2784, pruned_loss=0.09105, over 12495.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2587, pruned_loss=0.07942, over 2573049.31 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:38:14,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=308412.5, ans=0.125 2024-06-21 02:38:14,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=308412.5, ans=0.0 2024-06-21 02:38:19,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.994e+02 2.158e+02 2.308e+02 3.507e+02, threshold=4.316e+02, percent-clipped=0.0 2024-06-21 02:38:30,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.49 vs. limit=6.0 2024-06-21 02:38:32,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.35 vs. limit=12.0 2024-06-21 02:38:57,110 INFO [train.py:1028] (0/2) Epoch 17, batch 6400, loss[loss=0.2156, simple_loss=0.2706, pruned_loss=0.08034, over 13246.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2607, pruned_loss=0.08022, over 2573473.78 frames. ], batch size: 67, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:38:59,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=308504.1666666667, ans=0.05 2024-06-21 02:39:00,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=308504.1666666667, ans=0.0 2024-06-21 02:39:08,068 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:39:09,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=308522.5, ans=0.125 2024-06-21 02:39:21,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2024-06-21 02:39:34,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=308577.5, ans=0.0 2024-06-21 02:39:37,326 INFO [train.py:1028] (0/2) Epoch 17, batch 6450, loss[loss=0.25, simple_loss=0.2834, pruned_loss=0.1083, over 12568.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2617, pruned_loss=0.08046, over 2579932.04 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:39:46,728 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.033e+02 2.254e+02 2.536e+02 3.302e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 02:39:54,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=308632.5, ans=15.0 2024-06-21 02:40:06,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308650.8333333333, ans=0.1 2024-06-21 02:40:09,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=308669.1666666667, ans=0.0 2024-06-21 02:40:09,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=308669.1666666667, ans=0.125 2024-06-21 02:40:10,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=308669.1666666667, ans=0.0 2024-06-21 02:40:15,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2024-06-21 02:40:15,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=308687.5, ans=0.125 2024-06-21 02:40:16,274 INFO [train.py:1028] (0/2) Epoch 17, batch 6500, loss[loss=0.2226, simple_loss=0.2599, pruned_loss=0.09267, over 10846.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2625, pruned_loss=0.08066, over 2582724.91 frames. ], batch size: 304, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:40:18,595 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.37 vs. limit=6.0 2024-06-21 02:40:29,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308705.8333333333, ans=0.1 2024-06-21 02:40:45,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2024-06-21 02:40:47,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=308760.8333333333, ans=0.07 2024-06-21 02:40:48,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=308760.8333333333, ans=0.1 2024-06-21 02:40:49,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2024-06-21 02:40:56,185 INFO [train.py:1028] (0/2) Epoch 17, batch 6550, loss[loss=0.1912, simple_loss=0.259, pruned_loss=0.06165, over 12543.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2638, pruned_loss=0.08079, over 2587359.15 frames. ], batch size: 22, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:41:05,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=308779.1666666667, ans=0.025 2024-06-21 02:41:09,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.016e+02 2.193e+02 2.421e+02 2.919e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-21 02:41:09,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308797.5, ans=0.125 2024-06-21 02:41:23,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=308815.8333333333, ans=0.0 2024-06-21 02:41:32,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=308834.1666666667, ans=0.2 2024-06-21 02:41:43,228 INFO [train.py:1028] (0/2) Epoch 17, batch 6600, loss[loss=0.1883, simple_loss=0.2484, pruned_loss=0.06405, over 13275.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2637, pruned_loss=0.08062, over 2590253.40 frames. ], batch size: 72, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:41:58,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=308907.5, ans=0.125 2024-06-21 02:42:03,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=308907.5, ans=0.025 2024-06-21 02:42:18,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=308944.1666666667, ans=0.125 2024-06-21 02:42:19,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.24 vs. limit=22.5 2024-06-21 02:42:21,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=308962.5, ans=0.125 2024-06-21 02:42:22,527 INFO [train.py:1028] (0/2) Epoch 17, batch 6650, loss[loss=0.2272, simple_loss=0.2771, pruned_loss=0.08867, over 12999.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2654, pruned_loss=0.08114, over 2583759.21 frames. ], batch size: 158, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:42:32,562 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.053e+02 2.192e+02 2.343e+02 3.136e+02, threshold=4.383e+02, percent-clipped=0.0 2024-06-21 02:43:02,841 INFO [train.py:1028] (0/2) Epoch 17, batch 6700, loss[loss=0.2291, simple_loss=0.2791, pruned_loss=0.08955, over 12773.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2667, pruned_loss=0.08174, over 2584314.19 frames. ], batch size: 176, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:43:09,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=309054.1666666667, ans=0.2 2024-06-21 02:43:22,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=309090.8333333333, ans=0.0 2024-06-21 02:43:28,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=309090.8333333333, ans=0.125 2024-06-21 02:43:49,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.0 2024-06-21 02:43:51,895 INFO [train.py:1028] (0/2) Epoch 17, batch 6750, loss[loss=0.2705, simple_loss=0.313, pruned_loss=0.114, over 12158.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2675, pruned_loss=0.08238, over 2577763.23 frames. ], batch size: 241, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:43:52,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2024-06-21 02:43:52,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=309145.8333333333, ans=0.0 2024-06-21 02:43:54,723 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2024-06-21 02:44:00,849 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.167e+02 2.337e+02 2.607e+02 3.712e+02, threshold=4.674e+02, percent-clipped=0.0 2024-06-21 02:44:05,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309164.1666666667, ans=0.125 2024-06-21 02:44:13,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=309182.5, ans=0.125 2024-06-21 02:44:19,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=309200.8333333333, ans=0.09899494936611666 2024-06-21 02:44:27,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=309219.1666666667, ans=22.5 2024-06-21 02:44:31,647 INFO [train.py:1028] (0/2) Epoch 17, batch 6800, loss[loss=0.197, simple_loss=0.2499, pruned_loss=0.07202, over 13256.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2693, pruned_loss=0.08308, over 2579605.13 frames. ], batch size: 67, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:44:42,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=309255.8333333333, ans=0.125 2024-06-21 02:44:43,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=15.0 2024-06-21 02:44:46,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=309274.1666666667, ans=0.0 2024-06-21 02:44:55,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=309292.5, ans=0.2 2024-06-21 02:44:55,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=12.0 2024-06-21 02:45:05,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=309310.8333333333, ans=0.125 2024-06-21 02:45:10,870 INFO [train.py:1028] (0/2) Epoch 17, batch 6850, loss[loss=0.2067, simple_loss=0.271, pruned_loss=0.07123, over 13266.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2693, pruned_loss=0.08261, over 2583191.21 frames. ], batch size: 63, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:45:11,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=309329.1666666667, ans=0.0 2024-06-21 02:45:18,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=309347.5, ans=0.125 2024-06-21 02:45:20,181 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.084e+02 2.229e+02 2.468e+02 3.335e+02, threshold=4.459e+02, percent-clipped=0.0 2024-06-21 02:45:30,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=309365.8333333333, ans=0.025 2024-06-21 02:45:33,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=309365.8333333333, ans=0.125 2024-06-21 02:45:34,199 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2024-06-21 02:45:37,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=309384.1666666667, ans=0.125 2024-06-21 02:45:42,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2024-06-21 02:45:46,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=309402.5, ans=0.125 2024-06-21 02:45:50,060 INFO [train.py:1028] (0/2) Epoch 17, batch 6900, loss[loss=0.2309, simple_loss=0.2855, pruned_loss=0.08814, over 13034.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2705, pruned_loss=0.08318, over 2584506.55 frames. ], batch size: 48, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:45:52,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=309420.8333333333, ans=0.125 2024-06-21 02:46:26,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=309475.8333333333, ans=0.0 2024-06-21 02:46:27,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=309475.8333333333, ans=0.0 2024-06-21 02:46:34,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=309494.1666666667, ans=0.125 2024-06-21 02:46:34,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=309494.1666666667, ans=0.2 2024-06-21 02:46:37,518 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.882e+01 2024-06-21 02:46:38,062 INFO [train.py:1028] (0/2) Epoch 17, batch 6950, loss[loss=0.1876, simple_loss=0.2376, pruned_loss=0.06874, over 11235.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2708, pruned_loss=0.08295, over 2578687.32 frames. ], batch size: 16, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:46:39,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=309512.5, ans=0.07 2024-06-21 02:46:47,308 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.090e+02 2.208e+02 2.461e+02 3.231e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 02:46:48,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=12.0 2024-06-21 02:46:55,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=309549.1666666667, ans=0.2 2024-06-21 02:47:00,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=309549.1666666667, ans=0.125 2024-06-21 02:47:03,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=309567.5, ans=0.125 2024-06-21 02:47:17,300 INFO [train.py:1028] (0/2) Epoch 17, batch 7000, loss[loss=0.2298, simple_loss=0.2761, pruned_loss=0.09174, over 12907.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2703, pruned_loss=0.08244, over 2575329.77 frames. ], batch size: 158, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:47:25,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=309622.5, ans=0.125 2024-06-21 02:47:35,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309640.8333333333, ans=0.1 2024-06-21 02:47:48,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=309659.1666666667, ans=0.125 2024-06-21 02:47:50,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2024-06-21 02:47:54,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=309677.5, ans=0.025 2024-06-21 02:47:58,349 INFO [train.py:1028] (0/2) Epoch 17, batch 7050, loss[loss=0.2371, simple_loss=0.2806, pruned_loss=0.09685, over 12697.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2713, pruned_loss=0.08278, over 2580867.85 frames. ], batch size: 176, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:48:00,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=309695.8333333333, ans=0.0 2024-06-21 02:48:03,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=309695.8333333333, ans=0.2 2024-06-21 02:48:04,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=12.0 2024-06-21 02:48:07,856 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.154e+02 2.339e+02 2.609e+02 3.493e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 02:48:16,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=309732.5, ans=0.125 2024-06-21 02:48:17,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=309732.5, ans=0.0 2024-06-21 02:48:19,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309732.5, ans=0.1 2024-06-21 02:48:37,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=309769.1666666667, ans=0.125 2024-06-21 02:48:37,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2024-06-21 02:48:44,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-21 02:48:44,570 INFO [train.py:1028] (0/2) Epoch 17, batch 7100, loss[loss=0.2313, simple_loss=0.288, pruned_loss=0.08732, over 13209.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2724, pruned_loss=0.08351, over 2572828.45 frames. ], batch size: 112, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:48:51,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=309787.5, ans=0.0 2024-06-21 02:49:13,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=309842.5, ans=0.0 2024-06-21 02:49:18,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=309860.8333333333, ans=0.125 2024-06-21 02:49:24,660 INFO [train.py:1028] (0/2) Epoch 17, batch 7150, loss[loss=0.2424, simple_loss=0.2912, pruned_loss=0.09683, over 12468.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2731, pruned_loss=0.08367, over 2571392.56 frames. ], batch size: 202, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:49:26,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=309879.1666666667, ans=0.125 2024-06-21 02:49:32,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=309897.5, ans=0.0 2024-06-21 02:49:34,108 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.058e+02 2.196e+02 2.401e+02 3.381e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 02:49:34,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=309897.5, ans=0.0 2024-06-21 02:49:55,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=309952.5, ans=0.0 2024-06-21 02:49:59,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=309952.5, ans=0.09899494936611666 2024-06-21 02:50:03,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2024-06-21 02:50:03,883 INFO [train.py:1028] (0/2) Epoch 17, batch 7200, loss[loss=0.2381, simple_loss=0.2882, pruned_loss=0.09401, over 13135.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2744, pruned_loss=0.08397, over 2576488.63 frames. ], batch size: 112, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:50:09,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-21 02:50:13,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=309989.1666666667, ans=0.0 2024-06-21 02:50:17,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=309989.1666666667, ans=0.125 2024-06-21 02:50:19,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=310007.5, ans=0.0 2024-06-21 02:50:42,201 INFO [train.py:1028] (0/2) Epoch 17, batch 7250, loss[loss=0.2035, simple_loss=0.2593, pruned_loss=0.07385, over 12923.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2749, pruned_loss=0.08381, over 2577228.18 frames. ], batch size: 36, lr: 3.41e-03, grad_scale: 64.0 2024-06-21 02:50:47,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=310062.5, ans=0.125 2024-06-21 02:50:51,422 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.039e+02 2.176e+02 2.441e+02 2.972e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 02:50:58,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2024-06-21 02:51:20,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=310135.8333333333, ans=0.035 2024-06-21 02:51:24,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=310135.8333333333, ans=10.0 2024-06-21 02:51:29,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=310135.8333333333, ans=0.0 2024-06-21 02:51:32,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=310154.1666666667, ans=0.125 2024-06-21 02:51:34,399 INFO [train.py:1028] (0/2) Epoch 17, batch 7300, loss[loss=0.2345, simple_loss=0.2903, pruned_loss=0.08931, over 12992.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2757, pruned_loss=0.08445, over 2577972.89 frames. ], batch size: 36, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:51:38,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=310154.1666666667, ans=0.0 2024-06-21 02:51:43,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=310172.5, ans=0.2 2024-06-21 02:51:49,076 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:52:14,673 INFO [train.py:1028] (0/2) Epoch 17, batch 7350, loss[loss=0.2459, simple_loss=0.3005, pruned_loss=0.09568, over 13318.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2759, pruned_loss=0.08456, over 2580767.68 frames. ], batch size: 46, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:52:17,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=310245.8333333333, ans=0.125 2024-06-21 02:52:23,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=22.5 2024-06-21 02:52:23,836 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.069e+02 2.292e+02 2.446e+02 3.199e+02, threshold=4.584e+02, percent-clipped=0.0 2024-06-21 02:52:26,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=310264.1666666667, ans=0.0 2024-06-21 02:52:28,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2024-06-21 02:52:31,578 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:52:36,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=310282.5, ans=0.0 2024-06-21 02:52:53,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310319.1666666667, ans=0.125 2024-06-21 02:52:54,405 INFO [train.py:1028] (0/2) Epoch 17, batch 7400, loss[loss=0.2296, simple_loss=0.2918, pruned_loss=0.08368, over 13210.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2761, pruned_loss=0.08455, over 2585991.85 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 64.0 2024-06-21 02:53:03,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.01 vs. limit=12.0 2024-06-21 02:53:06,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=310355.8333333333, ans=0.95 2024-06-21 02:53:10,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=310374.1666666667, ans=0.2 2024-06-21 02:53:15,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=310374.1666666667, ans=0.125 2024-06-21 02:53:15,669 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2024-06-21 02:53:17,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.13 vs. limit=10.0 2024-06-21 02:53:18,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.89 vs. limit=22.5 2024-06-21 02:53:20,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=310392.5, ans=0.5 2024-06-21 02:53:21,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310392.5, ans=0.125 2024-06-21 02:53:23,973 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.372e+01 2024-06-21 02:53:24,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=310392.5, ans=0.0 2024-06-21 02:53:27,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=310410.8333333333, ans=0.0 2024-06-21 02:53:34,242 INFO [train.py:1028] (0/2) Epoch 17, batch 7450, loss[loss=0.2279, simple_loss=0.2804, pruned_loss=0.08777, over 12695.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2763, pruned_loss=0.08455, over 2579893.46 frames. ], batch size: 29, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:53:52,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.071e+02 2.236e+02 2.556e+02 4.142e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 02:54:01,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310465.8333333333, ans=0.125 2024-06-21 02:54:01,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=310465.8333333333, ans=0.0 2024-06-21 02:54:08,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2024-06-21 02:54:10,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=310484.1666666667, ans=0.025 2024-06-21 02:54:21,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=310502.5, ans=0.0 2024-06-21 02:54:23,182 INFO [train.py:1028] (0/2) Epoch 17, batch 7500, loss[loss=0.2207, simple_loss=0.2636, pruned_loss=0.08891, over 10680.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2773, pruned_loss=0.08523, over 2576778.47 frames. ], batch size: 303, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:54:25,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2024-06-21 02:54:25,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=310520.8333333333, ans=0.125 2024-06-21 02:54:25,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=310520.8333333333, ans=0.125 2024-06-21 02:54:29,825 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2024-06-21 02:54:32,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=310539.1666666667, ans=0.015 2024-06-21 02:54:41,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=310557.5, ans=0.0 2024-06-21 02:54:48,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=310575.8333333333, ans=0.025 2024-06-21 02:54:52,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310575.8333333333, ans=0.125 2024-06-21 02:54:55,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2024-06-21 02:55:02,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=310612.5, ans=0.1 2024-06-21 02:55:02,869 INFO [train.py:1028] (0/2) Epoch 17, batch 7550, loss[loss=0.2329, simple_loss=0.2765, pruned_loss=0.09466, over 12922.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2778, pruned_loss=0.08571, over 2576496.95 frames. ], batch size: 158, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:55:05,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=310612.5, ans=0.0 2024-06-21 02:55:06,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310612.5, ans=0.125 2024-06-21 02:55:12,830 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.154e+02 2.395e+02 2.681e+02 4.237e+02, threshold=4.790e+02, percent-clipped=0.0 2024-06-21 02:55:12,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310630.8333333333, ans=0.125 2024-06-21 02:55:26,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.15 vs. limit=10.0 2024-06-21 02:55:29,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=310667.5, ans=0.02 2024-06-21 02:55:30,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=310667.5, ans=0.0 2024-06-21 02:55:38,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=310685.8333333333, ans=0.0 2024-06-21 02:55:42,336 INFO [train.py:1028] (0/2) Epoch 17, batch 7600, loss[loss=0.224, simple_loss=0.2683, pruned_loss=0.08989, over 13217.00 frames. ], tot_loss[loss=0.226, simple_loss=0.279, pruned_loss=0.08644, over 2574554.94 frames. ], batch size: 83, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:55:44,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310704.1666666667, ans=0.125 2024-06-21 02:56:30,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=310795.8333333333, ans=0.025 2024-06-21 02:56:30,854 INFO [train.py:1028] (0/2) Epoch 17, batch 7650, loss[loss=0.2153, simple_loss=0.273, pruned_loss=0.07875, over 13022.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2794, pruned_loss=0.0864, over 2570470.27 frames. ], batch size: 33, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:56:31,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=310795.8333333333, ans=0.125 2024-06-21 02:56:34,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=310795.8333333333, ans=0.0 2024-06-21 02:56:41,477 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.115e+02 2.289e+02 2.506e+02 3.514e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 02:56:59,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=310850.8333333333, ans=0.125 2024-06-21 02:57:07,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=310869.1666666667, ans=0.125 2024-06-21 02:57:10,719 INFO [train.py:1028] (0/2) Epoch 17, batch 7700, loss[loss=0.2206, simple_loss=0.2849, pruned_loss=0.07809, over 13247.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2806, pruned_loss=0.08692, over 2568329.65 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:57:12,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=310887.5, ans=0.0 2024-06-21 02:57:19,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=310905.8333333333, ans=0.0 2024-06-21 02:57:29,362 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 02:57:48,810 INFO [train.py:1028] (0/2) Epoch 17, batch 7750, loss[loss=0.2275, simple_loss=0.2894, pruned_loss=0.08282, over 13268.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2807, pruned_loss=0.08676, over 2573441.87 frames. ], batch size: 72, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:57:54,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=310979.1666666667, ans=0.0 2024-06-21 02:57:59,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.069e+02 2.191e+02 2.394e+02 3.000e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-21 02:58:33,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=311052.5, ans=0.125 2024-06-21 02:58:36,138 INFO [train.py:1028] (0/2) Epoch 17, batch 7800, loss[loss=0.2483, simple_loss=0.2911, pruned_loss=0.1028, over 13138.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2809, pruned_loss=0.08687, over 2578412.86 frames. ], batch size: 95, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:58:49,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=311089.1666666667, ans=0.125 2024-06-21 02:58:56,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=311107.5, ans=0.125 2024-06-21 02:58:58,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=311107.5, ans=0.125 2024-06-21 02:59:16,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=15.0 2024-06-21 02:59:17,255 INFO [train.py:1028] (0/2) Epoch 17, batch 7850, loss[loss=0.2266, simple_loss=0.2836, pruned_loss=0.08483, over 11532.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2821, pruned_loss=0.08736, over 2571675.23 frames. ], batch size: 17, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:59:27,335 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.118e+02 2.274e+02 2.526e+02 3.590e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-21 02:59:30,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.60 vs. limit=10.0 2024-06-21 02:59:33,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=311199.1666666667, ans=0.05 2024-06-21 02:59:47,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=311217.5, ans=0.2 2024-06-21 02:59:51,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=311235.8333333333, ans=0.0 2024-06-21 02:59:51,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.58 vs. limit=10.0 2024-06-21 02:59:56,342 INFO [train.py:1028] (0/2) Epoch 17, batch 7900, loss[loss=0.2154, simple_loss=0.2813, pruned_loss=0.07474, over 13122.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2816, pruned_loss=0.08696, over 2570989.35 frames. ], batch size: 77, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 02:59:59,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311254.1666666667, ans=0.0 2024-06-21 03:00:05,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=311272.5, ans=0.0 2024-06-21 03:00:08,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=311272.5, ans=0.125 2024-06-21 03:00:10,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311272.5, ans=0.1 2024-06-21 03:00:20,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311309.1666666667, ans=0.1 2024-06-21 03:00:23,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2024-06-21 03:00:35,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=311327.5, ans=0.2 2024-06-21 03:00:43,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=311327.5, ans=0.125 2024-06-21 03:00:45,111 INFO [train.py:1028] (0/2) Epoch 17, batch 7950, loss[loss=0.2397, simple_loss=0.2826, pruned_loss=0.09839, over 10645.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2821, pruned_loss=0.08704, over 2574672.45 frames. ], batch size: 305, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:00:53,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311364.1666666667, ans=0.0 2024-06-21 03:00:53,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2024-06-21 03:00:54,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-21 03:00:55,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=311364.1666666667, ans=0.125 2024-06-21 03:00:55,936 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.094e+02 2.268e+02 2.488e+02 3.230e+02, threshold=4.536e+02, percent-clipped=0.0 2024-06-21 03:01:14,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=311400.8333333333, ans=0.125 2024-06-21 03:01:26,372 INFO [train.py:1028] (0/2) Epoch 17, batch 8000, loss[loss=0.1981, simple_loss=0.258, pruned_loss=0.06911, over 12534.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2829, pruned_loss=0.08757, over 2570924.09 frames. ], batch size: 29, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:01:34,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=311455.8333333333, ans=0.125 2024-06-21 03:01:44,765 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:01:48,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=311474.1666666667, ans=0.2 2024-06-21 03:01:51,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=15.0 2024-06-21 03:01:55,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=311492.5, ans=0.2 2024-06-21 03:02:02,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2024-06-21 03:02:04,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=311510.8333333333, ans=0.125 2024-06-21 03:02:07,176 INFO [train.py:1028] (0/2) Epoch 17, batch 8050, loss[loss=0.2236, simple_loss=0.2764, pruned_loss=0.08541, over 13160.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2826, pruned_loss=0.08742, over 2570494.36 frames. ], batch size: 83, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:02:17,186 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.127e+02 2.265e+02 2.528e+02 3.662e+02, threshold=4.531e+02, percent-clipped=0.0 2024-06-21 03:02:22,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=311565.8333333333, ans=0.0 2024-06-21 03:02:24,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=311565.8333333333, ans=0.0 2024-06-21 03:02:29,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=311584.1666666667, ans=0.5 2024-06-21 03:02:31,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=311584.1666666667, ans=0.5 2024-06-21 03:02:35,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=311584.1666666667, ans=0.125 2024-06-21 03:02:42,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=311602.5, ans=0.0 2024-06-21 03:02:42,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311602.5, ans=0.1 2024-06-21 03:02:46,124 INFO [train.py:1028] (0/2) Epoch 17, batch 8100, loss[loss=0.2232, simple_loss=0.2828, pruned_loss=0.08175, over 13153.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2833, pruned_loss=0.08758, over 2575286.02 frames. ], batch size: 112, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:02:46,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=311620.8333333333, ans=0.1 2024-06-21 03:02:47,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=311620.8333333333, ans=0.125 2024-06-21 03:02:57,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=311639.1666666667, ans=6.0 2024-06-21 03:03:28,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2024-06-21 03:03:33,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.43 vs. limit=22.5 2024-06-21 03:03:34,637 INFO [train.py:1028] (0/2) Epoch 17, batch 8150, loss[loss=0.2217, simple_loss=0.2773, pruned_loss=0.08303, over 13119.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2836, pruned_loss=0.08747, over 2578646.03 frames. ], batch size: 121, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:03:35,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=311712.5, ans=0.07 2024-06-21 03:03:36,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=311712.5, ans=0.125 2024-06-21 03:03:42,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2024-06-21 03:03:45,519 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.090e+02 2.194e+02 2.430e+02 2.940e+02, threshold=4.389e+02, percent-clipped=0.0 2024-06-21 03:03:47,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=311730.8333333333, ans=0.125 2024-06-21 03:03:48,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=311730.8333333333, ans=0.125 2024-06-21 03:03:51,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=311749.1666666667, ans=0.125 2024-06-21 03:03:56,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.82 vs. limit=6.0 2024-06-21 03:04:02,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2024-06-21 03:04:09,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.86 vs. limit=10.0 2024-06-21 03:04:13,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=311785.8333333333, ans=0.125 2024-06-21 03:04:15,000 INFO [train.py:1028] (0/2) Epoch 17, batch 8200, loss[loss=0.2276, simple_loss=0.2825, pruned_loss=0.08631, over 13109.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.283, pruned_loss=0.08709, over 2582467.90 frames. ], batch size: 112, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:04:21,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.60 vs. limit=5.0 2024-06-21 03:04:30,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=311840.8333333333, ans=0.2 2024-06-21 03:04:31,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2024-06-21 03:04:36,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311840.8333333333, ans=0.1 2024-06-21 03:04:43,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=311859.1666666667, ans=0.125 2024-06-21 03:04:54,344 INFO [train.py:1028] (0/2) Epoch 17, batch 8250, loss[loss=0.2112, simple_loss=0.2765, pruned_loss=0.07299, over 13327.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2843, pruned_loss=0.08752, over 2582134.58 frames. ], batch size: 52, lr: 3.40e-03, grad_scale: 32.0 2024-06-21 03:04:55,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=311895.8333333333, ans=0.125 2024-06-21 03:05:04,411 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.107e+02 2.277e+02 2.503e+02 3.038e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-21 03:05:04,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=311914.1666666667, ans=0.125 2024-06-21 03:05:05,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.33 vs. limit=15.0 2024-06-21 03:05:13,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=311932.5, ans=0.0 2024-06-21 03:05:22,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.09 vs. limit=12.0 2024-06-21 03:05:32,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311969.1666666667, ans=0.1 2024-06-21 03:05:40,982 INFO [train.py:1028] (0/2) Epoch 17, batch 8300, loss[loss=0.2309, simple_loss=0.2825, pruned_loss=0.0897, over 13086.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2838, pruned_loss=0.08721, over 2579703.40 frames. ], batch size: 102, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:05:52,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312005.8333333333, ans=0.1 2024-06-21 03:06:00,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.53 vs. limit=10.0 2024-06-21 03:06:02,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.56 vs. limit=22.5 2024-06-21 03:06:05,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=15.0 2024-06-21 03:06:12,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=312060.8333333333, ans=0.125 2024-06-21 03:06:18,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=312060.8333333333, ans=0.125 2024-06-21 03:06:20,603 INFO [train.py:1028] (0/2) Epoch 17, batch 8350, loss[loss=0.2265, simple_loss=0.2805, pruned_loss=0.08621, over 13182.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2832, pruned_loss=0.08667, over 2580243.39 frames. ], batch size: 112, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:06:21,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=312079.1666666667, ans=0.2 2024-06-21 03:06:23,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2024-06-21 03:06:27,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=312079.1666666667, ans=0.0 2024-06-21 03:06:30,867 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.163e+02 2.342e+02 2.629e+02 3.742e+02, threshold=4.683e+02, percent-clipped=0.0 2024-06-21 03:06:36,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=312115.8333333333, ans=0.125 2024-06-21 03:06:44,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=312134.1666666667, ans=0.125 2024-06-21 03:06:47,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2024-06-21 03:06:48,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=312134.1666666667, ans=0.025 2024-06-21 03:06:48,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=312134.1666666667, ans=0.125 2024-06-21 03:06:56,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=312152.5, ans=0.125 2024-06-21 03:07:06,089 INFO [train.py:1028] (0/2) Epoch 17, batch 8400, loss[loss=0.2087, simple_loss=0.2529, pruned_loss=0.08225, over 12909.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2835, pruned_loss=0.08715, over 2576339.80 frames. ], batch size: 39, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:07:19,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.32 vs. limit=15.0 2024-06-21 03:07:21,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312189.1666666667, ans=0.1 2024-06-21 03:07:24,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=312189.1666666667, ans=0.0 2024-06-21 03:07:46,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312225.8333333333, ans=0.1 2024-06-21 03:07:58,562 INFO [train.py:1028] (0/2) Epoch 17, batch 8450, loss[loss=0.2291, simple_loss=0.286, pruned_loss=0.08607, over 13157.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2848, pruned_loss=0.08741, over 2578740.07 frames. ], batch size: 112, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:08:05,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.82 vs. limit=15.0 2024-06-21 03:08:18,289 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.170e+02 2.340e+02 2.535e+02 3.087e+02, threshold=4.681e+02, percent-clipped=0.0 2024-06-21 03:08:18,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=312280.8333333333, ans=0.0 2024-06-21 03:08:35,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=312299.1666666667, ans=0.2 2024-06-21 03:08:40,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2024-06-21 03:08:40,913 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:08:41,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=312317.5, ans=0.125 2024-06-21 03:08:43,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=312317.5, ans=0.5 2024-06-21 03:08:57,636 INFO [train.py:1028] (0/2) Epoch 17, batch 8500, loss[loss=0.2362, simple_loss=0.2872, pruned_loss=0.09267, over 12605.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2855, pruned_loss=0.08782, over 2577968.64 frames. ], batch size: 29, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:08:58,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2024-06-21 03:09:11,892 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:09:13,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=312372.5, ans=0.2 2024-06-21 03:09:14,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312372.5, ans=0.1 2024-06-21 03:09:28,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=312409.1666666667, ans=0.125 2024-06-21 03:09:32,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=312409.1666666667, ans=0.125 2024-06-21 03:09:37,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=312409.1666666667, ans=0.2 2024-06-21 03:09:50,957 INFO [train.py:1028] (0/2) Epoch 17, batch 8550, loss[loss=0.2161, simple_loss=0.276, pruned_loss=0.07814, over 12327.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2854, pruned_loss=0.08799, over 2576547.53 frames. ], batch size: 22, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:10:00,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=312464.1666666667, ans=0.0 2024-06-21 03:10:03,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.096e+02 2.193e+02 2.434e+02 3.488e+02, threshold=4.386e+02, percent-clipped=0.0 2024-06-21 03:10:14,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=312482.5, ans=0.125 2024-06-21 03:10:17,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=312482.5, ans=0.0 2024-06-21 03:10:32,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.29 vs. limit=22.5 2024-06-21 03:10:34,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=312519.1666666667, ans=0.0 2024-06-21 03:10:34,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=312519.1666666667, ans=0.125 2024-06-21 03:10:41,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312537.5, ans=0.1 2024-06-21 03:10:41,943 INFO [train.py:1028] (0/2) Epoch 17, batch 8600, loss[loss=0.2206, simple_loss=0.2797, pruned_loss=0.08071, over 13134.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2852, pruned_loss=0.08789, over 2574535.97 frames. ], batch size: 112, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:10:48,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=312537.5, ans=0.0 2024-06-21 03:11:02,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=312555.8333333333, ans=0.02 2024-06-21 03:11:47,733 INFO [train.py:1028] (0/2) Epoch 17, batch 8650, loss[loss=0.2342, simple_loss=0.2871, pruned_loss=0.09069, over 13005.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2857, pruned_loss=0.08763, over 2577743.72 frames. ], batch size: 102, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:11:58,352 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.110e+02 2.234e+02 2.522e+02 3.672e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 03:12:08,288 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:12:11,281 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=7.685e-03 2024-06-21 03:12:17,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=15.0 2024-06-21 03:12:27,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=312702.5, ans=15.0 2024-06-21 03:12:30,686 INFO [train.py:1028] (0/2) Epoch 17, batch 8700, loss[loss=0.2378, simple_loss=0.3005, pruned_loss=0.08756, over 13165.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2862, pruned_loss=0.08816, over 2575162.42 frames. ], batch size: 59, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:12:37,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=312720.8333333333, ans=0.125 2024-06-21 03:12:41,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=312739.1666666667, ans=0.2 2024-06-21 03:12:48,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=312739.1666666667, ans=0.0 2024-06-21 03:12:58,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=312757.5, ans=0.125 2024-06-21 03:13:01,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=15.0 2024-06-21 03:13:10,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=312775.8333333333, ans=0.125 2024-06-21 03:13:19,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=312794.1666666667, ans=0.0 2024-06-21 03:13:22,765 INFO [train.py:1028] (0/2) Epoch 17, batch 8750, loss[loss=0.2362, simple_loss=0.2807, pruned_loss=0.09584, over 13105.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2855, pruned_loss=0.0879, over 2571935.45 frames. ], batch size: 121, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:13:27,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.82 vs. limit=10.0 2024-06-21 03:13:36,037 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.131e+02 2.293e+02 2.540e+02 3.300e+02, threshold=4.586e+02, percent-clipped=0.0 2024-06-21 03:13:45,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=312849.1666666667, ans=0.125 2024-06-21 03:13:54,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=312867.5, ans=0.125 2024-06-21 03:13:54,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=312867.5, ans=0.2 2024-06-21 03:13:57,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.11 vs. limit=15.0 2024-06-21 03:13:59,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=312867.5, ans=0.025 2024-06-21 03:13:59,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=312867.5, ans=0.1 2024-06-21 03:14:21,351 INFO [train.py:1028] (0/2) Epoch 17, batch 8800, loss[loss=0.2442, simple_loss=0.3004, pruned_loss=0.09401, over 13240.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.286, pruned_loss=0.08826, over 2576991.33 frames. ], batch size: 72, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:14:22,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312904.1666666667, ans=0.1 2024-06-21 03:14:50,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=312940.8333333333, ans=0.125 2024-06-21 03:14:52,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=312940.8333333333, ans=0.025 2024-06-21 03:15:15,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=312977.5, ans=0.025 2024-06-21 03:15:20,731 INFO [train.py:1028] (0/2) Epoch 17, batch 8850, loss[loss=0.267, simple_loss=0.3034, pruned_loss=0.1153, over 12556.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2863, pruned_loss=0.08865, over 2566359.95 frames. ], batch size: 202, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:15:33,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 2.228e+02 2.381e+02 2.691e+02 3.706e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 03:15:35,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=313014.1666666667, ans=0.0 2024-06-21 03:15:36,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=313014.1666666667, ans=0.125 2024-06-21 03:15:37,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=313014.1666666667, ans=0.125 2024-06-21 03:15:42,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=313032.5, ans=0.125 2024-06-21 03:15:44,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=313032.5, ans=0.0 2024-06-21 03:16:01,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=313069.1666666667, ans=0.125 2024-06-21 03:16:03,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313087.5, ans=0.1 2024-06-21 03:16:03,968 INFO [train.py:1028] (0/2) Epoch 17, batch 8900, loss[loss=0.2252, simple_loss=0.2839, pruned_loss=0.08323, over 12834.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2868, pruned_loss=0.089, over 2564877.70 frames. ], batch size: 33, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:16:08,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=313087.5, ans=15.0 2024-06-21 03:16:34,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-21 03:16:36,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=313142.5, ans=0.125 2024-06-21 03:16:46,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=313160.8333333333, ans=0.125 2024-06-21 03:16:49,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=313160.8333333333, ans=0.125 2024-06-21 03:16:54,673 INFO [train.py:1028] (0/2) Epoch 17, batch 8950, loss[loss=0.2615, simple_loss=0.3087, pruned_loss=0.1071, over 12429.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2877, pruned_loss=0.08912, over 2564980.19 frames. ], batch size: 202, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:16:54,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=313179.1666666667, ans=0.125 2024-06-21 03:16:57,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=313179.1666666667, ans=0.125 2024-06-21 03:17:08,193 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.099e+02 2.235e+02 2.416e+02 3.104e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 03:17:09,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=313197.5, ans=0.125 2024-06-21 03:17:11,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.76 vs. limit=6.0 2024-06-21 03:17:12,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313197.5, ans=0.125 2024-06-21 03:17:46,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.44 vs. limit=15.0 2024-06-21 03:17:50,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=313252.5, ans=0.125 2024-06-21 03:17:53,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=313252.5, ans=0.025 2024-06-21 03:17:56,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313252.5, ans=0.0 2024-06-21 03:17:56,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=313252.5, ans=0.025 2024-06-21 03:18:01,782 INFO [train.py:1028] (0/2) Epoch 17, batch 9000, loss[loss=0.212, simple_loss=0.2733, pruned_loss=0.07536, over 13272.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2878, pruned_loss=0.08895, over 2569734.51 frames. ], batch size: 46, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:18:01,786 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 03:18:14,043 INFO [train.py:1060] (0/2) Epoch 17, validation: loss=0.1873, simple_loss=0.2522, pruned_loss=0.06125, over 351949.00 frames. 2024-06-21 03:18:14,045 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 03:18:15,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=313270.8333333333, ans=0.025 2024-06-21 03:18:30,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.35 vs. limit=22.5 2024-06-21 03:18:40,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=313307.5, ans=0.125 2024-06-21 03:18:45,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=313325.8333333333, ans=10.0 2024-06-21 03:18:56,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=313344.1666666667, ans=0.2 2024-06-21 03:18:58,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=313344.1666666667, ans=0.1 2024-06-21 03:18:59,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=313344.1666666667, ans=0.0 2024-06-21 03:19:07,606 INFO [train.py:1028] (0/2) Epoch 17, batch 9050, loss[loss=0.2247, simple_loss=0.2825, pruned_loss=0.08348, over 11301.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2881, pruned_loss=0.08911, over 2567939.02 frames. ], batch size: 17, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:19:20,389 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.075e+02 2.198e+02 2.447e+02 3.562e+02, threshold=4.396e+02, percent-clipped=0.0 2024-06-21 03:19:20,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.58 vs. limit=15.0 2024-06-21 03:19:28,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2024-06-21 03:19:33,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=313417.5, ans=0.125 2024-06-21 03:19:33,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=313417.5, ans=0.95 2024-06-21 03:19:34,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=313417.5, ans=0.1 2024-06-21 03:19:47,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=313435.8333333333, ans=0.1 2024-06-21 03:19:53,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313454.1666666667, ans=0.0 2024-06-21 03:19:54,559 INFO [train.py:1028] (0/2) Epoch 17, batch 9100, loss[loss=0.2393, simple_loss=0.2961, pruned_loss=0.0913, over 13269.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2872, pruned_loss=0.08841, over 2568092.72 frames. ], batch size: 72, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:20:05,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=313472.5, ans=0.0 2024-06-21 03:20:40,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=313527.5, ans=0.0 2024-06-21 03:20:41,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.15 vs. limit=15.0 2024-06-21 03:20:42,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=313527.5, ans=0.125 2024-06-21 03:20:45,273 INFO [train.py:1028] (0/2) Epoch 17, batch 9150, loss[loss=0.231, simple_loss=0.2871, pruned_loss=0.08747, over 13133.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2875, pruned_loss=0.08878, over 2569041.06 frames. ], batch size: 77, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:20:46,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=313545.8333333333, ans=0.125 2024-06-21 03:20:51,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=313545.8333333333, ans=0.125 2024-06-21 03:20:52,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=313545.8333333333, ans=0.025 2024-06-21 03:20:58,422 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.106e+02 2.233e+02 2.420e+02 3.001e+02, threshold=4.466e+02, percent-clipped=0.0 2024-06-21 03:21:12,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=313582.5, ans=12.0 2024-06-21 03:21:22,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.72 vs. limit=22.5 2024-06-21 03:21:23,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=313619.1666666667, ans=0.025 2024-06-21 03:21:31,259 INFO [train.py:1028] (0/2) Epoch 17, batch 9200, loss[loss=0.2138, simple_loss=0.2761, pruned_loss=0.07578, over 13320.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2871, pruned_loss=0.08817, over 2571415.19 frames. ], batch size: 37, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:21:31,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=313637.5, ans=0.0 2024-06-21 03:21:32,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=313637.5, ans=0.2 2024-06-21 03:21:36,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=313637.5, ans=0.125 2024-06-21 03:21:45,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=313655.8333333333, ans=0.0 2024-06-21 03:22:02,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=313692.5, ans=0.125 2024-06-21 03:22:09,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=313710.8333333333, ans=0.0 2024-06-21 03:22:16,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=313710.8333333333, ans=0.125 2024-06-21 03:22:18,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=313729.1666666667, ans=0.125 2024-06-21 03:22:19,709 INFO [train.py:1028] (0/2) Epoch 17, batch 9250, loss[loss=0.2002, simple_loss=0.2633, pruned_loss=0.06852, over 13255.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2869, pruned_loss=0.08794, over 2572080.44 frames. ], batch size: 67, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:22:22,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.20 vs. limit=22.5 2024-06-21 03:22:32,884 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.071e+02 2.199e+02 2.336e+02 3.323e+02, threshold=4.399e+02, percent-clipped=0.0 2024-06-21 03:22:43,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=313747.5, ans=0.0 2024-06-21 03:22:44,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2024-06-21 03:22:47,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=313765.8333333333, ans=0.0 2024-06-21 03:22:56,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=15.0 2024-06-21 03:22:56,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=313784.1666666667, ans=0.0 2024-06-21 03:23:15,670 INFO [train.py:1028] (0/2) Epoch 17, batch 9300, loss[loss=0.2205, simple_loss=0.2763, pruned_loss=0.08232, over 12935.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2872, pruned_loss=0.08792, over 2569652.85 frames. ], batch size: 39, lr: 3.39e-03, grad_scale: 32.0 2024-06-21 03:23:15,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=313820.8333333333, ans=0.125 2024-06-21 03:23:24,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313839.1666666667, ans=0.1 2024-06-21 03:23:46,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=313875.8333333333, ans=0.0 2024-06-21 03:23:48,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=313875.8333333333, ans=0.125 2024-06-21 03:23:51,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=313894.1666666667, ans=0.125 2024-06-21 03:23:51,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.70 vs. limit=15.0 2024-06-21 03:23:53,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=313894.1666666667, ans=0.0 2024-06-21 03:23:55,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2024-06-21 03:24:00,845 INFO [train.py:1028] (0/2) Epoch 17, batch 9350, loss[loss=0.2384, simple_loss=0.2892, pruned_loss=0.09376, over 12595.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2878, pruned_loss=0.08841, over 2567161.96 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:24:01,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=313912.5, ans=0.0 2024-06-21 03:24:01,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=313912.5, ans=0.0 2024-06-21 03:24:14,191 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.155e+02 2.381e+02 2.659e+02 4.074e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 03:24:38,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=313967.5, ans=0.025 2024-06-21 03:24:49,131 INFO [train.py:1028] (0/2) Epoch 17, batch 9400, loss[loss=0.2229, simple_loss=0.2909, pruned_loss=0.0774, over 13213.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2878, pruned_loss=0.0883, over 2567715.89 frames. ], batch size: 52, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:24:57,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=314022.5, ans=0.125 2024-06-21 03:25:12,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=314040.8333333333, ans=0.1 2024-06-21 03:25:23,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.75 vs. limit=15.0 2024-06-21 03:25:26,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=314077.5, ans=0.0 2024-06-21 03:25:31,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=314077.5, ans=0.0 2024-06-21 03:25:34,479 INFO [train.py:1028] (0/2) Epoch 17, batch 9450, loss[loss=0.2492, simple_loss=0.3016, pruned_loss=0.0984, over 12444.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.2887, pruned_loss=0.08916, over 2567591.03 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:25:35,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=314095.8333333333, ans=0.125 2024-06-21 03:25:44,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=314114.1666666667, ans=0.2 2024-06-21 03:25:45,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=314114.1666666667, ans=0.125 2024-06-21 03:25:46,506 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.054e+02 2.206e+02 2.385e+02 3.032e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 03:25:47,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=314114.1666666667, ans=0.07 2024-06-21 03:25:49,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=314114.1666666667, ans=0.125 2024-06-21 03:25:50,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=314114.1666666667, ans=0.125 2024-06-21 03:25:50,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=314114.1666666667, ans=0.125 2024-06-21 03:25:56,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=314132.5, ans=0.0 2024-06-21 03:25:57,379 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2024-06-21 03:26:11,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=314169.1666666667, ans=0.125 2024-06-21 03:26:17,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314187.5, ans=0.125 2024-06-21 03:26:18,412 INFO [train.py:1028] (0/2) Epoch 17, batch 9500, loss[loss=0.2135, simple_loss=0.2733, pruned_loss=0.07689, over 13287.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.288, pruned_loss=0.08842, over 2576323.24 frames. ], batch size: 43, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:26:23,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=314187.5, ans=0.025 2024-06-21 03:26:26,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2024-06-21 03:26:29,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=314205.8333333333, ans=0.125 2024-06-21 03:26:30,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=314205.8333333333, ans=0.125 2024-06-21 03:26:33,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=314205.8333333333, ans=0.0 2024-06-21 03:26:35,086 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.78 vs. limit=10.0 2024-06-21 03:26:40,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=314224.1666666667, ans=0.1 2024-06-21 03:26:42,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=314224.1666666667, ans=0.2 2024-06-21 03:26:46,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=314242.5, ans=0.0 2024-06-21 03:26:49,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=15.0 2024-06-21 03:27:01,537 INFO [train.py:1028] (0/2) Epoch 17, batch 9550, loss[loss=0.2109, simple_loss=0.2655, pruned_loss=0.07815, over 13174.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2873, pruned_loss=0.08804, over 2572512.28 frames. ], batch size: 40, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:27:13,293 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.068e+02 2.209e+02 2.456e+02 2.946e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-21 03:27:16,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=314297.5, ans=0.0 2024-06-21 03:27:27,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314315.8333333333, ans=0.125 2024-06-21 03:27:31,269 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-21 03:27:44,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.20 vs. limit=10.0 2024-06-21 03:27:51,321 INFO [train.py:1028] (0/2) Epoch 17, batch 9600, loss[loss=0.2477, simple_loss=0.2883, pruned_loss=0.1035, over 10653.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2872, pruned_loss=0.08804, over 2572902.35 frames. ], batch size: 303, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:27:57,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=314370.8333333333, ans=0.0 2024-06-21 03:28:30,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2024-06-21 03:28:38,760 INFO [train.py:1028] (0/2) Epoch 17, batch 9650, loss[loss=0.2129, simple_loss=0.2636, pruned_loss=0.08111, over 13069.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2872, pruned_loss=0.08825, over 2562896.35 frames. ], batch size: 132, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:28:48,315 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.119e+02 2.324e+02 2.560e+02 3.634e+02, threshold=4.647e+02, percent-clipped=0.0 2024-06-21 03:28:52,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2024-06-21 03:28:52,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=314480.8333333333, ans=0.125 2024-06-21 03:28:54,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314499.1666666667, ans=0.1 2024-06-21 03:29:07,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=314517.5, ans=0.2 2024-06-21 03:29:11,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2024-06-21 03:29:12,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.0 2024-06-21 03:29:12,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=314535.8333333333, ans=0.125 2024-06-21 03:29:19,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2024-06-21 03:29:23,286 INFO [train.py:1028] (0/2) Epoch 17, batch 9700, loss[loss=0.2282, simple_loss=0.2771, pruned_loss=0.08967, over 13037.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2876, pruned_loss=0.08875, over 2556594.23 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:29:32,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=314572.5, ans=0.125 2024-06-21 03:29:35,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.98 vs. limit=22.5 2024-06-21 03:29:41,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=314590.8333333333, ans=0.0 2024-06-21 03:29:48,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=314590.8333333333, ans=0.125 2024-06-21 03:29:56,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-21 03:30:12,756 INFO [train.py:1028] (0/2) Epoch 17, batch 9750, loss[loss=0.2403, simple_loss=0.2883, pruned_loss=0.09615, over 13121.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2859, pruned_loss=0.08813, over 2553189.61 frames. ], batch size: 132, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:30:14,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=314645.8333333333, ans=0.125 2024-06-21 03:30:18,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314645.8333333333, ans=0.125 2024-06-21 03:30:24,115 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.066e+02 2.274e+02 2.497e+02 3.042e+02, threshold=4.547e+02, percent-clipped=0.0 2024-06-21 03:30:27,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314664.1666666667, ans=0.125 2024-06-21 03:30:28,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314664.1666666667, ans=0.125 2024-06-21 03:30:33,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=314682.5, ans=0.1 2024-06-21 03:30:47,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314719.1666666667, ans=0.125 2024-06-21 03:30:50,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=314719.1666666667, ans=0.125 2024-06-21 03:30:52,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.29 vs. limit=15.0 2024-06-21 03:30:53,039 INFO [train.py:1028] (0/2) Epoch 17, batch 9800, loss[loss=0.2223, simple_loss=0.2797, pruned_loss=0.08248, over 12944.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2857, pruned_loss=0.08769, over 2545795.48 frames. ], batch size: 39, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:31:06,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2024-06-21 03:31:17,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=314774.1666666667, ans=0.0 2024-06-21 03:31:17,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314774.1666666667, ans=0.125 2024-06-21 03:31:22,869 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.07 vs. limit=15.0 2024-06-21 03:31:23,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314792.5, ans=0.1 2024-06-21 03:31:32,928 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2024-06-21 03:31:33,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=314810.8333333333, ans=0.2 2024-06-21 03:31:34,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314810.8333333333, ans=0.1 2024-06-21 03:31:38,887 INFO [train.py:1028] (0/2) Epoch 17, batch 9850, loss[loss=0.207, simple_loss=0.2651, pruned_loss=0.07445, over 13035.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2851, pruned_loss=0.08723, over 2537716.80 frames. ], batch size: 102, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:31:48,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=314847.5, ans=0.125 2024-06-21 03:31:50,938 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.098e+02 2.218e+02 2.413e+02 3.321e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-21 03:31:55,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=314847.5, ans=0.125 2024-06-21 03:31:59,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=314865.8333333333, ans=0.0 2024-06-21 03:32:21,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314902.5, ans=0.1 2024-06-21 03:32:26,115 INFO [train.py:1028] (0/2) Epoch 17, batch 9900, loss[loss=0.2307, simple_loss=0.2947, pruned_loss=0.08335, over 12988.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2851, pruned_loss=0.08774, over 2530756.91 frames. ], batch size: 39, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:32:28,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=314920.8333333333, ans=0.025 2024-06-21 03:32:42,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.95 vs. limit=15.0 2024-06-21 03:32:44,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=314957.5, ans=0.0 2024-06-21 03:33:07,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=12.0 2024-06-21 03:33:11,951 INFO [train.py:1028] (0/2) Epoch 17, batch 9950, loss[loss=0.2109, simple_loss=0.2709, pruned_loss=0.07541, over 12690.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2828, pruned_loss=0.08708, over 2526696.28 frames. ], batch size: 29, lr: 3.38e-03, grad_scale: 64.0 2024-06-21 03:33:13,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.14 vs. limit=15.0 2024-06-21 03:33:24,123 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.068e+02 2.291e+02 2.525e+02 3.594e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-21 03:33:32,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=315049.1666666667, ans=0.0 2024-06-21 03:33:58,147 INFO [train.py:1028] (0/2) Epoch 17, batch 10000, loss[loss=0.2392, simple_loss=0.2972, pruned_loss=0.09067, over 12399.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2835, pruned_loss=0.08773, over 2487282.97 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:34:20,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315140.8333333333, ans=0.125 2024-06-21 03:34:25,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=315140.8333333333, ans=0.125 2024-06-21 03:34:28,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.14 vs. limit=12.0 2024-06-21 03:34:40,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=315177.5, ans=0.0 2024-06-21 03:34:40,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=315177.5, ans=0.2 2024-06-21 03:34:42,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315177.5, ans=0.1 2024-06-21 03:34:42,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=12.0 2024-06-21 03:34:44,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=315195.8333333333, ans=0.2 2024-06-21 03:34:44,471 INFO [train.py:1028] (0/2) Epoch 17, batch 10050, loss[loss=0.2201, simple_loss=0.28, pruned_loss=0.08005, over 12655.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2839, pruned_loss=0.08845, over 2445499.67 frames. ], batch size: 22, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:34:46,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=315195.8333333333, ans=0.025 2024-06-21 03:34:49,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=315195.8333333333, ans=0.5 2024-06-21 03:34:55,520 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.178e+02 2.340e+02 2.506e+02 4.052e+02, threshold=4.680e+02, percent-clipped=0.0 2024-06-21 03:35:19,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=315269.1666666667, ans=0.0 2024-06-21 03:35:23,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=315269.1666666667, ans=0.2 2024-06-21 03:35:26,265 INFO [train.py:1028] (0/2) Epoch 17, batch 10100, loss[loss=0.2451, simple_loss=0.291, pruned_loss=0.09957, over 11305.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2835, pruned_loss=0.08817, over 2428430.77 frames. ], batch size: 17, lr: 3.38e-03, grad_scale: 32.0 2024-06-21 03:35:46,123 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-17.pt 2024-06-21 03:38:50,495 INFO [train.py:1028] (0/2) Epoch 18, batch 0, loss[loss=0.2081, simple_loss=0.2623, pruned_loss=0.07697, over 12979.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2623, pruned_loss=0.07697, over 12979.00 frames. ], batch size: 36, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:38:50,496 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 03:38:58,572 INFO [train.py:1060] (0/2) Epoch 18, validation: loss=0.1887, simple_loss=0.2537, pruned_loss=0.0619, over 351949.00 frames. 2024-06-21 03:38:58,572 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 03:39:01,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=315318.6666666667, ans=0.0 2024-06-21 03:39:02,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=315318.6666666667, ans=0.125 2024-06-21 03:39:03,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=315318.6666666667, ans=22.5 2024-06-21 03:39:04,564 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-172000.pt 2024-06-21 03:39:13,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=15.0 2024-06-21 03:39:25,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=315355.3333333333, ans=0.025 2024-06-21 03:39:31,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=315355.3333333333, ans=0.2 2024-06-21 03:39:44,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.45 vs. limit=22.5 2024-06-21 03:39:48,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.72 vs. limit=15.0 2024-06-21 03:39:50,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=315392.0, ans=0.0 2024-06-21 03:39:54,017 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 1.973e+02 2.112e+02 2.294e+02 2.900e+02, threshold=4.224e+02, percent-clipped=0.0 2024-06-21 03:39:57,295 INFO [train.py:1028] (0/2) Epoch 18, batch 50, loss[loss=0.1802, simple_loss=0.2422, pruned_loss=0.05905, over 12604.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2647, pruned_loss=0.08096, over 575382.45 frames. ], batch size: 29, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:40:12,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=315428.6666666667, ans=0.0 2024-06-21 03:40:21,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=315447.0, ans=0.125 2024-06-21 03:40:44,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=315483.6666666667, ans=0.0 2024-06-21 03:40:49,969 INFO [train.py:1028] (0/2) Epoch 18, batch 100, loss[loss=0.2134, simple_loss=0.272, pruned_loss=0.07735, over 13275.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.264, pruned_loss=0.08062, over 1018045.88 frames. ], batch size: 46, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:40:50,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=315502.0, ans=0.5 2024-06-21 03:41:06,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=315538.6666666667, ans=0.025 2024-06-21 03:41:19,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=315538.6666666667, ans=0.125 2024-06-21 03:41:20,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=315538.6666666667, ans=0.0 2024-06-21 03:41:40,145 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.699e+02 2.014e+02 2.173e+02 2.471e+02 3.470e+02, threshold=4.347e+02, percent-clipped=0.0 2024-06-21 03:41:41,029 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2024-06-21 03:41:41,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=315575.3333333333, ans=0.2 2024-06-21 03:41:43,398 INFO [train.py:1028] (0/2) Epoch 18, batch 150, loss[loss=0.1833, simple_loss=0.242, pruned_loss=0.06231, over 12640.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2627, pruned_loss=0.07894, over 1366254.56 frames. ], batch size: 29, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:41:51,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=315612.0, ans=0.125 2024-06-21 03:42:11,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2024-06-21 03:42:13,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.77 vs. limit=10.0 2024-06-21 03:42:18,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=315648.6666666667, ans=0.0 2024-06-21 03:42:24,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=315667.0, ans=0.025 2024-06-21 03:42:34,085 INFO [train.py:1028] (0/2) Epoch 18, batch 200, loss[loss=0.2377, simple_loss=0.2833, pruned_loss=0.09605, over 12535.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2636, pruned_loss=0.0794, over 1636663.76 frames. ], batch size: 202, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:42:42,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=15.0 2024-06-21 03:42:57,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=315722.0, ans=0.125 2024-06-21 03:43:00,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=315740.3333333333, ans=0.0 2024-06-21 03:43:07,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=315740.3333333333, ans=0.125 2024-06-21 03:43:08,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=12.0 2024-06-21 03:43:18,743 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 1.985e+02 2.121e+02 2.296e+02 2.873e+02, threshold=4.243e+02, percent-clipped=0.0 2024-06-21 03:43:22,011 INFO [train.py:1028] (0/2) Epoch 18, batch 250, loss[loss=0.21, simple_loss=0.2525, pruned_loss=0.08378, over 13000.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2639, pruned_loss=0.07915, over 1848432.55 frames. ], batch size: 144, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:43:30,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=315777.0, ans=0.0 2024-06-21 03:43:33,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=315795.3333333333, ans=0.95 2024-06-21 03:43:45,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=315795.3333333333, ans=0.0 2024-06-21 03:44:01,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=315813.6666666667, ans=0.125 2024-06-21 03:44:08,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=315832.0, ans=0.125 2024-06-21 03:44:08,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=315832.0, ans=0.125 2024-06-21 03:44:14,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.61 vs. limit=5.0 2024-06-21 03:44:18,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2024-06-21 03:44:21,767 INFO [train.py:1028] (0/2) Epoch 18, batch 300, loss[loss=0.2065, simple_loss=0.2538, pruned_loss=0.07957, over 13192.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2641, pruned_loss=0.07934, over 2011281.91 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:44:38,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=315887.0, ans=0.125 2024-06-21 03:44:57,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2024-06-21 03:44:59,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=315923.6666666667, ans=0.09899494936611666 2024-06-21 03:45:00,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=315923.6666666667, ans=0.0 2024-06-21 03:45:04,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.76 vs. limit=10.0 2024-06-21 03:45:08,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=315942.0, ans=0.0 2024-06-21 03:45:11,356 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.010e+02 2.150e+02 2.373e+02 3.036e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-21 03:45:12,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=315942.0, ans=0.125 2024-06-21 03:45:14,535 INFO [train.py:1028] (0/2) Epoch 18, batch 350, loss[loss=0.1971, simple_loss=0.254, pruned_loss=0.07013, over 12812.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2636, pruned_loss=0.07912, over 2140422.76 frames. ], batch size: 33, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:45:25,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315978.6666666667, ans=0.1 2024-06-21 03:45:25,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=15.0 2024-06-21 03:45:43,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=315997.0, ans=10.0 2024-06-21 03:46:04,958 INFO [train.py:1028] (0/2) Epoch 18, batch 400, loss[loss=0.2089, simple_loss=0.2588, pruned_loss=0.07948, over 13218.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.263, pruned_loss=0.07868, over 2240541.42 frames. ], batch size: 63, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:46:11,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=316052.0, ans=0.2 2024-06-21 03:46:37,092 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:46:38,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=316107.0, ans=0.125 2024-06-21 03:46:46,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.978e+02 2.083e+02 2.198e+02 3.635e+02, threshold=4.166e+02, percent-clipped=0.0 2024-06-21 03:46:48,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=316125.3333333333, ans=0.0 2024-06-21 03:46:58,374 INFO [train.py:1028] (0/2) Epoch 18, batch 450, loss[loss=0.2299, simple_loss=0.2872, pruned_loss=0.08631, over 13211.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2632, pruned_loss=0.07851, over 2314461.83 frames. ], batch size: 67, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:47:11,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2024-06-21 03:47:15,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=316162.0, ans=0.1 2024-06-21 03:47:17,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=316162.0, ans=0.125 2024-06-21 03:47:28,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316198.6666666667, ans=0.125 2024-06-21 03:47:54,701 INFO [train.py:1028] (0/2) Epoch 18, batch 500, loss[loss=0.1923, simple_loss=0.2401, pruned_loss=0.07226, over 13106.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2637, pruned_loss=0.07863, over 2377077.86 frames. ], batch size: 121, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:47:54,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=316235.3333333333, ans=10.0 2024-06-21 03:47:56,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=316235.3333333333, ans=0.125 2024-06-21 03:48:11,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=316253.6666666667, ans=0.0 2024-06-21 03:48:15,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2024-06-21 03:48:31,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=316308.6666666667, ans=0.0 2024-06-21 03:48:34,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=316308.6666666667, ans=0.0 2024-06-21 03:48:36,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=316308.6666666667, ans=15.0 2024-06-21 03:48:38,201 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 1.997e+02 2.144e+02 2.377e+02 3.208e+02, threshold=4.288e+02, percent-clipped=0.0 2024-06-21 03:48:41,225 INFO [train.py:1028] (0/2) Epoch 18, batch 550, loss[loss=0.2264, simple_loss=0.2686, pruned_loss=0.09212, over 12955.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2631, pruned_loss=0.07821, over 2421445.93 frames. ], batch size: 158, lr: 3.28e-03, grad_scale: 32.0 2024-06-21 03:48:48,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=316327.0, ans=0.05 2024-06-21 03:48:57,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=316345.3333333333, ans=0.125 2024-06-21 03:49:04,854 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:49:25,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316400.3333333333, ans=0.125 2024-06-21 03:49:30,146 INFO [train.py:1028] (0/2) Epoch 18, batch 600, loss[loss=0.2027, simple_loss=0.2471, pruned_loss=0.07915, over 13060.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2633, pruned_loss=0.0781, over 2458978.79 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:49:31,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316418.6666666667, ans=0.125 2024-06-21 03:49:35,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.36 vs. limit=22.5 2024-06-21 03:49:43,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2024-06-21 03:49:47,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2024-06-21 03:50:08,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=316473.6666666667, ans=0.125 2024-06-21 03:50:19,527 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.935e+02 2.075e+02 2.295e+02 3.429e+02, threshold=4.150e+02, percent-clipped=0.0 2024-06-21 03:50:22,021 INFO [train.py:1028] (0/2) Epoch 18, batch 650, loss[loss=0.2214, simple_loss=0.2689, pruned_loss=0.08696, over 13203.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2636, pruned_loss=0.07809, over 2489396.20 frames. ], batch size: 59, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:50:25,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=316510.3333333333, ans=0.2 2024-06-21 03:50:26,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2024-06-21 03:50:51,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316547.0, ans=0.1 2024-06-21 03:50:57,958 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 03:51:03,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=316565.3333333333, ans=0.0 2024-06-21 03:51:15,204 INFO [train.py:1028] (0/2) Epoch 18, batch 700, loss[loss=0.2068, simple_loss=0.2611, pruned_loss=0.07629, over 13265.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2626, pruned_loss=0.0778, over 2512087.36 frames. ], batch size: 46, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:51:18,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=316602.0, ans=0.1 2024-06-21 03:51:22,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=316602.0, ans=0.0 2024-06-21 03:51:32,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=316620.3333333333, ans=0.025 2024-06-21 03:51:38,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=316638.6666666667, ans=0.125 2024-06-21 03:51:52,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=316657.0, ans=0.025 2024-06-21 03:51:58,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2024-06-21 03:51:58,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.40 vs. limit=22.5 2024-06-21 03:52:00,683 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.943e+02 2.100e+02 2.255e+02 2.773e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 03:52:01,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2024-06-21 03:52:02,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=316693.6666666667, ans=0.0 2024-06-21 03:52:03,305 INFO [train.py:1028] (0/2) Epoch 18, batch 750, loss[loss=0.1966, simple_loss=0.2527, pruned_loss=0.0702, over 13296.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2628, pruned_loss=0.07756, over 2527560.84 frames. ], batch size: 63, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:52:03,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=316693.6666666667, ans=0.125 2024-06-21 03:52:11,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=316693.6666666667, ans=0.125 2024-06-21 03:52:13,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=316712.0, ans=0.125 2024-06-21 03:52:13,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2024-06-21 03:52:21,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=316730.3333333333, ans=0.1 2024-06-21 03:52:29,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=316748.6666666667, ans=0.125 2024-06-21 03:52:40,267 INFO [train.py:1028] (0/2) Epoch 18, batch 800, loss[loss=0.2154, simple_loss=0.2746, pruned_loss=0.07813, over 12941.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2628, pruned_loss=0.07765, over 2540132.10 frames. ], batch size: 36, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:52:41,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=316785.3333333333, ans=0.125 2024-06-21 03:52:45,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2024-06-21 03:53:29,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.85 vs. limit=22.5 2024-06-21 03:53:30,017 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.965e+02 2.084e+02 2.227e+02 2.823e+02, threshold=4.168e+02, percent-clipped=0.0 2024-06-21 03:53:33,421 INFO [train.py:1028] (0/2) Epoch 18, batch 850, loss[loss=0.2214, simple_loss=0.2721, pruned_loss=0.08542, over 13182.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2628, pruned_loss=0.07782, over 2550842.35 frames. ], batch size: 95, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:53:48,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=316895.3333333333, ans=0.015 2024-06-21 03:54:08,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-21 03:54:08,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=316932.0, ans=0.125 2024-06-21 03:54:12,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=316932.0, ans=0.1 2024-06-21 03:54:14,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316932.0, ans=0.1 2024-06-21 03:54:28,710 INFO [train.py:1028] (0/2) Epoch 18, batch 900, loss[loss=0.2075, simple_loss=0.266, pruned_loss=0.07454, over 12940.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2626, pruned_loss=0.07797, over 2555488.98 frames. ], batch size: 36, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:54:33,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=316968.6666666667, ans=0.1 2024-06-21 03:54:38,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.37 vs. limit=22.5 2024-06-21 03:54:55,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=317005.3333333333, ans=0.125 2024-06-21 03:55:03,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=317023.6666666667, ans=0.0 2024-06-21 03:55:09,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.86 vs. limit=15.0 2024-06-21 03:55:11,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=317042.0, ans=0.125 2024-06-21 03:55:15,354 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.009e+02 2.079e+02 2.246e+02 2.911e+02, threshold=4.159e+02, percent-clipped=0.0 2024-06-21 03:55:18,316 INFO [train.py:1028] (0/2) Epoch 18, batch 950, loss[loss=0.216, simple_loss=0.2715, pruned_loss=0.08018, over 12982.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2627, pruned_loss=0.07801, over 2559325.09 frames. ], batch size: 39, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:55:19,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=317060.3333333333, ans=0.0 2024-06-21 03:55:20,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=15.0 2024-06-21 03:55:26,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2024-06-21 03:55:44,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.88 vs. limit=22.5 2024-06-21 03:55:46,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=317097.0, ans=0.025 2024-06-21 03:55:53,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=317115.3333333333, ans=0.95 2024-06-21 03:56:06,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=317133.6666666667, ans=0.2 2024-06-21 03:56:13,288 INFO [train.py:1028] (0/2) Epoch 18, batch 1000, loss[loss=0.1967, simple_loss=0.2529, pruned_loss=0.07029, over 13265.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2628, pruned_loss=0.07816, over 2561311.31 frames. ], batch size: 49, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:56:25,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=317170.3333333333, ans=0.2 2024-06-21 03:56:30,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=317170.3333333333, ans=0.125 2024-06-21 03:56:32,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=317170.3333333333, ans=0.125 2024-06-21 03:56:33,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=317170.3333333333, ans=0.125 2024-06-21 03:57:01,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=317225.3333333333, ans=0.0 2024-06-21 03:57:06,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=317225.3333333333, ans=0.95 2024-06-21 03:57:08,946 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.027e+02 2.140e+02 2.367e+02 3.053e+02, threshold=4.280e+02, percent-clipped=0.0 2024-06-21 03:57:12,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=317243.6666666667, ans=0.025 2024-06-21 03:57:12,727 INFO [train.py:1028] (0/2) Epoch 18, batch 1050, loss[loss=0.2005, simple_loss=0.2473, pruned_loss=0.07691, over 13153.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2638, pruned_loss=0.07849, over 2564059.28 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:57:12,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=317243.6666666667, ans=10.0 2024-06-21 03:57:37,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=317280.3333333333, ans=0.0 2024-06-21 03:57:40,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=317280.3333333333, ans=0.0 2024-06-21 03:57:56,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=317317.0, ans=0.125 2024-06-21 03:57:59,577 INFO [train.py:1028] (0/2) Epoch 18, batch 1100, loss[loss=0.2071, simple_loss=0.261, pruned_loss=0.0766, over 13266.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2641, pruned_loss=0.07841, over 2569280.28 frames. ], batch size: 52, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:58:02,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=317335.3333333333, ans=0.125 2024-06-21 03:58:10,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=317353.6666666667, ans=0.0 2024-06-21 03:58:24,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=317372.0, ans=0.125 2024-06-21 03:58:30,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=317390.3333333333, ans=0.0 2024-06-21 03:58:44,203 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.958e+02 2.069e+02 2.214e+02 2.943e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-21 03:58:47,424 INFO [train.py:1028] (0/2) Epoch 18, batch 1150, loss[loss=0.1962, simple_loss=0.2533, pruned_loss=0.06955, over 13219.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2644, pruned_loss=0.07882, over 2571168.27 frames. ], batch size: 52, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 03:59:06,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=317445.3333333333, ans=0.0 2024-06-21 03:59:33,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.58 vs. limit=15.0 2024-06-21 03:59:41,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=317500.3333333333, ans=0.0 2024-06-21 03:59:55,659 INFO [train.py:1028] (0/2) Epoch 18, batch 1200, loss[loss=0.1884, simple_loss=0.2478, pruned_loss=0.06457, over 13179.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2639, pruned_loss=0.07874, over 2572853.27 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:00:00,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=317518.6666666667, ans=0.2 2024-06-21 04:00:01,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=317518.6666666667, ans=0.0 2024-06-21 04:00:27,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317573.6666666667, ans=0.1 2024-06-21 04:00:30,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=317573.6666666667, ans=0.0 2024-06-21 04:00:30,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=317573.6666666667, ans=0.125 2024-06-21 04:00:31,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=317573.6666666667, ans=0.025 2024-06-21 04:00:41,784 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 1.994e+02 2.072e+02 2.262e+02 2.890e+02, threshold=4.144e+02, percent-clipped=0.0 2024-06-21 04:00:44,564 INFO [train.py:1028] (0/2) Epoch 18, batch 1250, loss[loss=0.2335, simple_loss=0.2748, pruned_loss=0.09614, over 13206.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2641, pruned_loss=0.07884, over 2582867.72 frames. ], batch size: 112, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:01:16,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=317665.3333333333, ans=0.0 2024-06-21 04:01:24,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=317683.6666666667, ans=0.125 2024-06-21 04:01:34,000 INFO [train.py:1028] (0/2) Epoch 18, batch 1300, loss[loss=0.2016, simple_loss=0.2565, pruned_loss=0.07338, over 12726.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2644, pruned_loss=0.07884, over 2583740.34 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:01:48,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=317720.3333333333, ans=0.125 2024-06-21 04:01:55,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.19 vs. limit=22.5 2024-06-21 04:01:57,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.96 vs. limit=15.0 2024-06-21 04:01:59,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2024-06-21 04:02:08,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=317757.0, ans=0.1 2024-06-21 04:02:15,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.17 vs. limit=15.0 2024-06-21 04:02:15,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=317775.3333333333, ans=0.0 2024-06-21 04:02:17,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.974e+02 2.072e+02 2.201e+02 3.078e+02, threshold=4.143e+02, percent-clipped=0.0 2024-06-21 04:02:20,470 INFO [train.py:1028] (0/2) Epoch 18, batch 1350, loss[loss=0.2175, simple_loss=0.2769, pruned_loss=0.07903, over 13194.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2641, pruned_loss=0.07882, over 2586458.84 frames. ], batch size: 59, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:03:19,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2024-06-21 04:03:23,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=317867.0, ans=0.0 2024-06-21 04:03:24,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.83 vs. limit=10.0 2024-06-21 04:03:27,253 INFO [train.py:1028] (0/2) Epoch 18, batch 1400, loss[loss=0.2408, simple_loss=0.2899, pruned_loss=0.09581, over 13053.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2642, pruned_loss=0.07876, over 2588798.89 frames. ], batch size: 26, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:03:33,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=317885.3333333333, ans=0.2 2024-06-21 04:03:34,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=317885.3333333333, ans=0.0 2024-06-21 04:04:02,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=317958.6666666667, ans=0.125 2024-06-21 04:04:07,883 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.979e+02 2.093e+02 2.234e+02 2.831e+02, threshold=4.186e+02, percent-clipped=0.0 2024-06-21 04:04:10,992 INFO [train.py:1028] (0/2) Epoch 18, batch 1450, loss[loss=0.2127, simple_loss=0.2599, pruned_loss=0.08277, over 13126.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2637, pruned_loss=0.07874, over 2588313.96 frames. ], batch size: 121, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:04:24,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317995.3333333333, ans=0.1 2024-06-21 04:04:35,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318013.6666666667, ans=0.1 2024-06-21 04:05:01,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=318068.6666666667, ans=0.125 2024-06-21 04:05:02,501 INFO [train.py:1028] (0/2) Epoch 18, batch 1500, loss[loss=0.2022, simple_loss=0.2533, pruned_loss=0.07558, over 13162.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2635, pruned_loss=0.07885, over 2590467.18 frames. ], batch size: 83, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:05:04,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=318068.6666666667, ans=0.125 2024-06-21 04:05:04,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=318068.6666666667, ans=0.0 2024-06-21 04:05:09,279 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:05:40,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=318123.6666666667, ans=0.025 2024-06-21 04:05:40,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=318123.6666666667, ans=0.125 2024-06-21 04:05:55,276 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 1.979e+02 2.158e+02 2.473e+02 3.670e+02, threshold=4.315e+02, percent-clipped=0.0 2024-06-21 04:05:57,655 INFO [train.py:1028] (0/2) Epoch 18, batch 1550, loss[loss=0.1911, simple_loss=0.2359, pruned_loss=0.07319, over 13051.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2638, pruned_loss=0.07903, over 2584810.59 frames. ], batch size: 102, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:06:01,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=318160.3333333333, ans=0.125 2024-06-21 04:06:09,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=318160.3333333333, ans=0.125 2024-06-21 04:06:11,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=318160.3333333333, ans=10.0 2024-06-21 04:06:12,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=318178.6666666667, ans=0.125 2024-06-21 04:06:16,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2024-06-21 04:06:26,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=318197.0, ans=0.125 2024-06-21 04:06:37,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2024-06-21 04:06:41,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=15.0 2024-06-21 04:06:50,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2024-06-21 04:06:53,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318233.6666666667, ans=0.125 2024-06-21 04:06:55,347 INFO [train.py:1028] (0/2) Epoch 18, batch 1600, loss[loss=0.2028, simple_loss=0.2557, pruned_loss=0.07495, over 13168.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2642, pruned_loss=0.0792, over 2580618.24 frames. ], batch size: 77, lr: 3.27e-03, grad_scale: 32.0 2024-06-21 04:06:57,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=318252.0, ans=0.1 2024-06-21 04:07:09,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=318270.3333333333, ans=15.0 2024-06-21 04:07:13,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=318288.6666666667, ans=0.1 2024-06-21 04:07:31,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=318307.0, ans=0.125 2024-06-21 04:07:41,432 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 1.961e+02 2.083e+02 2.311e+02 3.038e+02, threshold=4.165e+02, percent-clipped=0.0 2024-06-21 04:07:44,021 INFO [train.py:1028] (0/2) Epoch 18, batch 1650, loss[loss=0.2166, simple_loss=0.2718, pruned_loss=0.0807, over 13150.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2642, pruned_loss=0.07938, over 2576006.02 frames. ], batch size: 95, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:07:59,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=318362.0, ans=0.125 2024-06-21 04:08:00,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=318362.0, ans=0.125 2024-06-21 04:08:07,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2024-06-21 04:08:10,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=12.0 2024-06-21 04:08:29,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=318398.6666666667, ans=0.125 2024-06-21 04:08:36,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=318417.0, ans=0.04949747468305833 2024-06-21 04:08:39,321 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2024-06-21 04:08:42,613 INFO [train.py:1028] (0/2) Epoch 18, batch 1700, loss[loss=0.2212, simple_loss=0.2812, pruned_loss=0.08059, over 12480.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2641, pruned_loss=0.07919, over 2581006.15 frames. ], batch size: 25, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:08:52,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=318453.6666666667, ans=0.125 2024-06-21 04:09:00,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=318472.0, ans=0.04949747468305833 2024-06-21 04:09:01,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=318472.0, ans=0.2 2024-06-21 04:09:19,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=318490.3333333333, ans=0.0 2024-06-21 04:09:19,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=318490.3333333333, ans=0.05 2024-06-21 04:09:23,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=318490.3333333333, ans=0.125 2024-06-21 04:09:31,516 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.963e+02 2.125e+02 2.422e+02 3.069e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-21 04:09:35,023 INFO [train.py:1028] (0/2) Epoch 18, batch 1750, loss[loss=0.2144, simple_loss=0.2765, pruned_loss=0.07615, over 12533.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2648, pruned_loss=0.07922, over 2582398.68 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:09:47,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=318545.3333333333, ans=0.125 2024-06-21 04:09:49,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=318545.3333333333, ans=0.025 2024-06-21 04:09:58,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=318563.6666666667, ans=0.07 2024-06-21 04:10:24,394 INFO [train.py:1028] (0/2) Epoch 18, batch 1800, loss[loss=0.2121, simple_loss=0.2668, pruned_loss=0.07872, over 13266.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2653, pruned_loss=0.07949, over 2582839.32 frames. ], batch size: 67, lr: 3.26e-03, grad_scale: 32.0 2024-06-21 04:10:25,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318618.6666666667, ans=0.125 2024-06-21 04:10:25,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=318618.6666666667, ans=0.125 2024-06-21 04:10:30,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=318618.6666666667, ans=0.5 2024-06-21 04:10:34,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=318637.0, ans=0.035 2024-06-21 04:10:37,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=318637.0, ans=0.2 2024-06-21 04:10:42,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=318637.0, ans=0.2 2024-06-21 04:10:50,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=318655.3333333333, ans=0.125 2024-06-21 04:10:57,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=318673.6666666667, ans=0.0 2024-06-21 04:11:13,274 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.986e+02 2.122e+02 2.342e+02 2.807e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 04:11:13,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2024-06-21 04:11:16,264 INFO [train.py:1028] (0/2) Epoch 18, batch 1850, loss[loss=0.2185, simple_loss=0.269, pruned_loss=0.08398, over 13231.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2656, pruned_loss=0.07946, over 2583624.56 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:11:26,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=318728.6666666667, ans=0.125 2024-06-21 04:11:39,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=318747.0, ans=0.025 2024-06-21 04:11:49,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=318765.3333333333, ans=0.95 2024-06-21 04:11:58,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=318783.6666666667, ans=0.0 2024-06-21 04:12:01,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=318783.6666666667, ans=0.035 2024-06-21 04:12:03,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318783.6666666667, ans=0.1 2024-06-21 04:12:08,098 INFO [train.py:1028] (0/2) Epoch 18, batch 1900, loss[loss=0.2139, simple_loss=0.2677, pruned_loss=0.08005, over 13164.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.265, pruned_loss=0.07939, over 2585769.95 frames. ], batch size: 95, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:12:24,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.09 vs. limit=10.0 2024-06-21 04:12:27,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=318820.3333333333, ans=0.07 2024-06-21 04:12:38,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.79 vs. limit=22.5 2024-06-21 04:12:50,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=318857.0, ans=0.125 2024-06-21 04:12:56,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=318875.3333333333, ans=0.2 2024-06-21 04:12:56,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2024-06-21 04:13:02,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.945e+02 2.054e+02 2.248e+02 2.808e+02, threshold=4.108e+02, percent-clipped=0.0 2024-06-21 04:13:06,557 INFO [train.py:1028] (0/2) Epoch 18, batch 1950, loss[loss=0.1902, simple_loss=0.2521, pruned_loss=0.06416, over 13273.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2644, pruned_loss=0.0795, over 2591229.49 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:13:12,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=318893.6666666667, ans=0.125 2024-06-21 04:13:12,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=318893.6666666667, ans=0.125 2024-06-21 04:13:21,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.31 vs. limit=10.0 2024-06-21 04:13:25,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=318930.3333333333, ans=0.125 2024-06-21 04:13:34,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=318948.6666666667, ans=0.0 2024-06-21 04:13:45,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=318985.3333333333, ans=0.125 2024-06-21 04:13:46,138 INFO [train.py:1028] (0/2) Epoch 18, batch 2000, loss[loss=0.2283, simple_loss=0.2912, pruned_loss=0.08267, over 12598.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2635, pruned_loss=0.07904, over 2587441.34 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:13:46,318 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:14:03,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=319022.0, ans=0.1 2024-06-21 04:14:05,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=319022.0, ans=0.0 2024-06-21 04:14:16,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319040.3333333333, ans=0.1 2024-06-21 04:14:25,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.21 vs. limit=15.0 2024-06-21 04:14:32,455 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.083e+02 2.254e+02 2.464e+02 3.221e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 04:14:35,481 INFO [train.py:1028] (0/2) Epoch 18, batch 2050, loss[loss=0.1951, simple_loss=0.2429, pruned_loss=0.07365, over 12697.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2636, pruned_loss=0.07933, over 2582188.83 frames. ], batch size: 29, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:15:09,746 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2024-06-21 04:15:10,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2024-06-21 04:15:11,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=319132.0, ans=0.125 2024-06-21 04:15:21,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=22.5 2024-06-21 04:15:31,744 INFO [train.py:1028] (0/2) Epoch 18, batch 2100, loss[loss=0.2207, simple_loss=0.28, pruned_loss=0.08068, over 13223.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2644, pruned_loss=0.07921, over 2585309.57 frames. ], batch size: 59, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:15:34,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=319168.6666666667, ans=0.025 2024-06-21 04:15:40,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=319168.6666666667, ans=0.2 2024-06-21 04:16:00,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319205.3333333333, ans=0.1 2024-06-21 04:16:11,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=319223.6666666667, ans=0.0 2024-06-21 04:16:20,940 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.007e+02 2.126e+02 2.379e+02 3.400e+02, threshold=4.252e+02, percent-clipped=0.0 2024-06-21 04:16:24,130 INFO [train.py:1028] (0/2) Epoch 18, batch 2150, loss[loss=0.1896, simple_loss=0.2503, pruned_loss=0.06442, over 13309.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2641, pruned_loss=0.07906, over 2587952.84 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:16:24,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=319260.3333333333, ans=0.125 2024-06-21 04:16:31,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=319260.3333333333, ans=0.0 2024-06-21 04:16:44,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.13 vs. limit=12.0 2024-06-21 04:16:56,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=319315.3333333333, ans=0.0 2024-06-21 04:17:11,771 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:17:14,686 INFO [train.py:1028] (0/2) Epoch 18, batch 2200, loss[loss=0.2211, simple_loss=0.2648, pruned_loss=0.08872, over 13195.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2639, pruned_loss=0.07883, over 2589216.11 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:17:37,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319370.3333333333, ans=0.1 2024-06-21 04:17:47,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2024-06-21 04:17:53,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.30 vs. limit=15.0 2024-06-21 04:17:56,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=319425.3333333333, ans=0.025 2024-06-21 04:18:01,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.50 vs. limit=22.5 2024-06-21 04:18:04,355 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.715e+02 1.992e+02 2.153e+02 2.453e+02 3.553e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-21 04:18:04,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2024-06-21 04:18:07,244 INFO [train.py:1028] (0/2) Epoch 18, batch 2250, loss[loss=0.2088, simple_loss=0.2722, pruned_loss=0.07273, over 13241.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2634, pruned_loss=0.07849, over 2587506.70 frames. ], batch size: 63, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:18:16,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=319443.6666666667, ans=0.125 2024-06-21 04:18:18,079 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:18:35,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=319480.3333333333, ans=0.125 2024-06-21 04:18:53,566 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.846e-03 2024-06-21 04:18:56,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=319517.0, ans=0.0 2024-06-21 04:18:56,773 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2024-06-21 04:18:58,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 04:19:00,789 INFO [train.py:1028] (0/2) Epoch 18, batch 2300, loss[loss=0.2066, simple_loss=0.2626, pruned_loss=0.07536, over 12826.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2632, pruned_loss=0.07821, over 2581649.49 frames. ], batch size: 33, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:19:00,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319535.3333333333, ans=0.1 2024-06-21 04:19:09,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=319535.3333333333, ans=0.1 2024-06-21 04:19:24,080 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:19:31,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=319590.3333333333, ans=0.125 2024-06-21 04:19:31,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=319590.3333333333, ans=0.025 2024-06-21 04:19:47,563 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.976e+02 2.143e+02 2.352e+02 3.273e+02, threshold=4.286e+02, percent-clipped=0.0 2024-06-21 04:19:50,317 INFO [train.py:1028] (0/2) Epoch 18, batch 2350, loss[loss=0.2007, simple_loss=0.2513, pruned_loss=0.07499, over 13200.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2638, pruned_loss=0.07835, over 2585045.29 frames. ], batch size: 67, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:20:04,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=319645.3333333333, ans=0.025 2024-06-21 04:20:05,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=319645.3333333333, ans=0.2 2024-06-21 04:20:06,280 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:20:11,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.13 vs. limit=22.5 2024-06-21 04:20:16,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=319663.6666666667, ans=0.0 2024-06-21 04:20:33,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=319682.0, ans=0.125 2024-06-21 04:20:33,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=319682.0, ans=0.0 2024-06-21 04:20:34,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319682.0, ans=0.1 2024-06-21 04:20:42,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319700.3333333333, ans=0.1 2024-06-21 04:20:49,587 INFO [train.py:1028] (0/2) Epoch 18, batch 2400, loss[loss=0.2239, simple_loss=0.2734, pruned_loss=0.08717, over 13304.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2632, pruned_loss=0.07838, over 2587513.72 frames. ], batch size: 46, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:21:06,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=319737.0, ans=0.025 2024-06-21 04:21:13,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=319755.3333333333, ans=0.125 2024-06-21 04:21:32,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319792.0, ans=0.0 2024-06-21 04:21:39,803 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 1.990e+02 2.134e+02 2.325e+02 3.270e+02, threshold=4.269e+02, percent-clipped=0.0 2024-06-21 04:21:42,721 INFO [train.py:1028] (0/2) Epoch 18, batch 2450, loss[loss=0.2217, simple_loss=0.2676, pruned_loss=0.08793, over 13287.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2624, pruned_loss=0.07849, over 2584371.42 frames. ], batch size: 63, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:21:52,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=319828.6666666667, ans=0.0 2024-06-21 04:21:56,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319828.6666666667, ans=0.1 2024-06-21 04:21:59,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=319828.6666666667, ans=0.125 2024-06-21 04:22:00,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=12.0 2024-06-21 04:22:34,105 INFO [train.py:1028] (0/2) Epoch 18, batch 2500, loss[loss=0.2026, simple_loss=0.2485, pruned_loss=0.0783, over 13230.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2613, pruned_loss=0.07789, over 2587191.80 frames. ], batch size: 83, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:22:35,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=319902.0, ans=0.125 2024-06-21 04:22:49,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2024-06-21 04:22:54,475 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:23:06,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2024-06-21 04:23:09,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=319957.0, ans=0.0 2024-06-21 04:23:10,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=319957.0, ans=0.125 2024-06-21 04:23:18,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2024-06-21 04:23:20,978 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 1.935e+02 2.017e+02 2.201e+02 2.844e+02, threshold=4.033e+02, percent-clipped=0.0 2024-06-21 04:23:23,269 INFO [train.py:1028] (0/2) Epoch 18, batch 2550, loss[loss=0.1919, simple_loss=0.2529, pruned_loss=0.06541, over 12618.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.261, pruned_loss=0.0778, over 2587289.57 frames. ], batch size: 22, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:23:35,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=320012.0, ans=0.125 2024-06-21 04:23:59,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=320048.6666666667, ans=0.2 2024-06-21 04:24:02,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=320067.0, ans=0.0 2024-06-21 04:24:03,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=320067.0, ans=0.0 2024-06-21 04:24:14,869 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.067e+01 2024-06-21 04:24:18,867 INFO [train.py:1028] (0/2) Epoch 18, batch 2600, loss[loss=0.2049, simple_loss=0.2557, pruned_loss=0.07703, over 13291.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2597, pruned_loss=0.07768, over 2587680.56 frames. ], batch size: 52, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:24:20,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2024-06-21 04:24:20,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=320085.3333333333, ans=22.5 2024-06-21 04:24:26,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=15.0 2024-06-21 04:24:59,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320158.6666666667, ans=0.1 2024-06-21 04:25:05,969 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.956e+02 2.099e+02 2.313e+02 2.691e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 04:25:08,886 INFO [train.py:1028] (0/2) Epoch 18, batch 2650, loss[loss=0.1897, simple_loss=0.2385, pruned_loss=0.07041, over 13115.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2577, pruned_loss=0.07689, over 2587352.06 frames. ], batch size: 144, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:25:13,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=320177.0, ans=0.125 2024-06-21 04:25:19,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.77 vs. limit=15.0 2024-06-21 04:26:00,637 INFO [train.py:1028] (0/2) Epoch 18, batch 2700, loss[loss=0.2096, simple_loss=0.2544, pruned_loss=0.08236, over 13254.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2567, pruned_loss=0.07682, over 2584796.63 frames. ], batch size: 89, lr: 3.26e-03, grad_scale: 64.0 2024-06-21 04:26:08,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2024-06-21 04:26:19,496 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 04:26:21,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=320287.0, ans=0.0 2024-06-21 04:26:22,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=320287.0, ans=0.04949747468305833 2024-06-21 04:26:41,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=320323.6666666667, ans=0.125 2024-06-21 04:26:43,592 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2024-06-21 04:26:52,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=320342.0, ans=0.1 2024-06-21 04:26:54,932 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.921e+02 2.032e+02 2.140e+02 2.489e+02, threshold=4.065e+02, percent-clipped=0.0 2024-06-21 04:26:57,558 INFO [train.py:1028] (0/2) Epoch 18, batch 2750, loss[loss=0.2095, simple_loss=0.2567, pruned_loss=0.0811, over 13198.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2558, pruned_loss=0.07633, over 2581094.48 frames. ], batch size: 43, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:27:03,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=320360.3333333333, ans=0.125 2024-06-21 04:27:04,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=320360.3333333333, ans=0.95 2024-06-21 04:27:14,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=320378.6666666667, ans=0.125 2024-06-21 04:27:18,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=15.0 2024-06-21 04:27:34,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=320415.3333333333, ans=0.1 2024-06-21 04:27:44,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=320433.6666666667, ans=0.125 2024-06-21 04:27:51,696 INFO [train.py:1028] (0/2) Epoch 18, batch 2800, loss[loss=0.21, simple_loss=0.2539, pruned_loss=0.08307, over 10918.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2547, pruned_loss=0.07614, over 2579335.76 frames. ], batch size: 304, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:28:04,486 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:28:38,101 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 1.969e+02 2.138e+02 2.322e+02 3.413e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-21 04:28:41,288 INFO [train.py:1028] (0/2) Epoch 18, batch 2850, loss[loss=0.1928, simple_loss=0.255, pruned_loss=0.06535, over 13303.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2545, pruned_loss=0.07622, over 2577097.12 frames. ], batch size: 49, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:28:49,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=320562.0, ans=0.125 2024-06-21 04:29:03,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=320562.0, ans=0.05 2024-06-21 04:29:14,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=320598.6666666667, ans=0.0 2024-06-21 04:29:28,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=320617.0, ans=0.125 2024-06-21 04:29:33,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=320635.3333333333, ans=0.125 2024-06-21 04:29:34,121 INFO [train.py:1028] (0/2) Epoch 18, batch 2900, loss[loss=0.1871, simple_loss=0.2454, pruned_loss=0.06444, over 13129.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2526, pruned_loss=0.07538, over 2585250.19 frames. ], batch size: 55, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:29:39,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=320635.3333333333, ans=0.2 2024-06-21 04:29:39,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=320635.3333333333, ans=0.1 2024-06-21 04:29:41,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2024-06-21 04:29:47,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=320653.6666666667, ans=0.125 2024-06-21 04:30:09,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=320672.0, ans=0.125 2024-06-21 04:30:12,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=320690.3333333333, ans=0.125 2024-06-21 04:30:30,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.961e+02 2.108e+02 2.347e+02 3.699e+02, threshold=4.217e+02, percent-clipped=0.0 2024-06-21 04:30:33,256 INFO [train.py:1028] (0/2) Epoch 18, batch 2950, loss[loss=0.202, simple_loss=0.2607, pruned_loss=0.0717, over 13272.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2524, pruned_loss=0.07551, over 2581019.93 frames. ], batch size: 43, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:30:38,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=320727.0, ans=0.0 2024-06-21 04:30:43,999 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=15.0 2024-06-21 04:30:44,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320745.3333333333, ans=0.1 2024-06-21 04:31:25,194 INFO [train.py:1028] (0/2) Epoch 18, batch 3000, loss[loss=0.1926, simple_loss=0.2453, pruned_loss=0.06994, over 13238.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2516, pruned_loss=0.07522, over 2580120.41 frames. ], batch size: 59, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:31:25,196 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 04:31:36,628 INFO [train.py:1060] (0/2) Epoch 18, validation: loss=0.1863, simple_loss=0.2512, pruned_loss=0.06076, over 351949.00 frames. 2024-06-21 04:31:36,629 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 04:31:37,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=320818.6666666667, ans=0.125 2024-06-21 04:32:05,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320855.3333333333, ans=0.0 2024-06-21 04:32:07,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2024-06-21 04:32:16,598 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:32:29,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2024-06-21 04:32:30,132 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 1.946e+02 2.076e+02 2.279e+02 3.062e+02, threshold=4.152e+02, percent-clipped=0.0 2024-06-21 04:32:32,061 INFO [train.py:1028] (0/2) Epoch 18, batch 3050, loss[loss=0.1976, simple_loss=0.2534, pruned_loss=0.07089, over 13281.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2512, pruned_loss=0.07515, over 2579252.42 frames. ], batch size: 46, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:32:40,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=320928.6666666667, ans=0.125 2024-06-21 04:32:44,978 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:33:27,808 INFO [train.py:1028] (0/2) Epoch 18, batch 3100, loss[loss=0.1757, simple_loss=0.2285, pruned_loss=0.06148, over 13076.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2502, pruned_loss=0.07457, over 2579795.44 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:33:47,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=321020.3333333333, ans=0.125 2024-06-21 04:33:51,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=321038.6666666667, ans=0.04949747468305833 2024-06-21 04:34:02,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.63 vs. limit=22.5 2024-06-21 04:34:04,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=321057.0, ans=0.125 2024-06-21 04:34:08,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=321075.3333333333, ans=0.125 2024-06-21 04:34:11,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.61 vs. limit=10.0 2024-06-21 04:34:11,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=11.50 vs. limit=15.0 2024-06-21 04:34:15,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.69 vs. limit=15.0 2024-06-21 04:34:16,298 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.953e+02 2.151e+02 2.324e+02 3.004e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-21 04:34:18,898 INFO [train.py:1028] (0/2) Epoch 18, batch 3150, loss[loss=0.1985, simple_loss=0.2365, pruned_loss=0.08025, over 12930.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2495, pruned_loss=0.07428, over 2582252.54 frames. ], batch size: 158, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:34:21,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=15.0 2024-06-21 04:34:52,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=321148.6666666667, ans=0.0 2024-06-21 04:34:57,224 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:35:08,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321167.0, ans=0.1 2024-06-21 04:35:11,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2024-06-21 04:35:12,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=321167.0, ans=0.125 2024-06-21 04:35:14,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2024-06-21 04:35:15,583 INFO [train.py:1028] (0/2) Epoch 18, batch 3200, loss[loss=0.1929, simple_loss=0.2443, pruned_loss=0.07078, over 13201.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2481, pruned_loss=0.0735, over 2581498.59 frames. ], batch size: 55, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:35:35,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321203.6666666667, ans=0.1 2024-06-21 04:35:36,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=321222.0, ans=0.0 2024-06-21 04:35:40,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321222.0, ans=0.1 2024-06-21 04:35:44,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=321222.0, ans=0.0 2024-06-21 04:35:45,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=321222.0, ans=0.0 2024-06-21 04:35:56,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=321258.6666666667, ans=0.125 2024-06-21 04:36:04,385 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 1.939e+02 2.093e+02 2.285e+02 2.778e+02, threshold=4.186e+02, percent-clipped=0.0 2024-06-21 04:36:06,918 INFO [train.py:1028] (0/2) Epoch 18, batch 3250, loss[loss=0.1796, simple_loss=0.2288, pruned_loss=0.06516, over 13215.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2476, pruned_loss=0.07343, over 2584906.77 frames. ], batch size: 72, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:36:20,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=321277.0, ans=0.125 2024-06-21 04:36:50,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321332.0, ans=0.1 2024-06-21 04:36:56,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2024-06-21 04:37:00,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=321350.3333333333, ans=0.5 2024-06-21 04:37:01,810 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.64 vs. limit=15.0 2024-06-21 04:37:02,950 INFO [train.py:1028] (0/2) Epoch 18, batch 3300, loss[loss=0.208, simple_loss=0.2568, pruned_loss=0.07962, over 12746.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2474, pruned_loss=0.0733, over 2581469.87 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:37:12,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-21 04:37:14,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=321387.0, ans=0.07 2024-06-21 04:37:50,356 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 1.927e+02 2.055e+02 2.184e+02 2.821e+02, threshold=4.110e+02, percent-clipped=0.0 2024-06-21 04:37:50,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=321442.0, ans=0.04949747468305833 2024-06-21 04:37:53,521 INFO [train.py:1028] (0/2) Epoch 18, batch 3350, loss[loss=0.2035, simple_loss=0.2495, pruned_loss=0.07874, over 13016.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2469, pruned_loss=0.07354, over 2576972.85 frames. ], batch size: 158, lr: 3.25e-03, grad_scale: 64.0 2024-06-21 04:38:00,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321460.3333333333, ans=0.125 2024-06-21 04:38:17,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=321478.6666666667, ans=0.09899494936611666 2024-06-21 04:38:45,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=321533.6666666667, ans=0.0 2024-06-21 04:38:49,744 INFO [train.py:1028] (0/2) Epoch 18, batch 3400, loss[loss=0.2161, simple_loss=0.2717, pruned_loss=0.08025, over 12599.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2463, pruned_loss=0.07333, over 2574457.05 frames. ], batch size: 22, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:38:50,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=321552.0, ans=0.125 2024-06-21 04:38:53,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=321552.0, ans=0.125 2024-06-21 04:39:15,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=321570.3333333333, ans=0.125 2024-06-21 04:39:37,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=321607.0, ans=0.125 2024-06-21 04:39:47,705 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 1.929e+02 2.081e+02 2.268e+02 3.200e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 04:39:49,424 INFO [train.py:1028] (0/2) Epoch 18, batch 3450, loss[loss=0.2044, simple_loss=0.2489, pruned_loss=0.07996, over 12708.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2459, pruned_loss=0.0732, over 2575836.25 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:39:56,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=321643.6666666667, ans=0.5 2024-06-21 04:40:02,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.08 vs. limit=22.5 2024-06-21 04:40:08,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=15.0 2024-06-21 04:40:16,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=321680.3333333333, ans=0.2 2024-06-21 04:40:24,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=321698.6666666667, ans=0.07 2024-06-21 04:40:25,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=321698.6666666667, ans=0.0 2024-06-21 04:40:28,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=321717.0, ans=0.0 2024-06-21 04:40:38,880 INFO [train.py:1028] (0/2) Epoch 18, batch 3500, loss[loss=0.196, simple_loss=0.2412, pruned_loss=0.07545, over 12887.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2455, pruned_loss=0.07304, over 2574298.91 frames. ], batch size: 33, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:40:39,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=321735.3333333333, ans=0.0 2024-06-21 04:40:45,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321735.3333333333, ans=0.1 2024-06-21 04:40:48,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=321753.6666666667, ans=0.04949747468305833 2024-06-21 04:41:05,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=321772.0, ans=0.125 2024-06-21 04:41:15,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321772.0, ans=0.125 2024-06-21 04:41:34,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=321808.6666666667, ans=0.125 2024-06-21 04:41:35,462 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.876e+02 1.977e+02 2.125e+02 3.648e+02, threshold=3.955e+02, percent-clipped=0.0 2024-06-21 04:41:38,095 INFO [train.py:1028] (0/2) Epoch 18, batch 3550, loss[loss=0.1919, simple_loss=0.2385, pruned_loss=0.07263, over 13157.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2445, pruned_loss=0.07228, over 2576501.44 frames. ], batch size: 95, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:41:41,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=321827.0, ans=0.0 2024-06-21 04:41:50,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=15.0 2024-06-21 04:41:55,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=321863.6666666667, ans=0.125 2024-06-21 04:42:34,910 INFO [train.py:1028] (0/2) Epoch 18, batch 3600, loss[loss=0.2041, simple_loss=0.258, pruned_loss=0.07506, over 13015.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2446, pruned_loss=0.07258, over 2579298.44 frames. ], batch size: 48, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:43:16,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=321992.0, ans=0.125 2024-06-21 04:43:19,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.87 vs. limit=15.0 2024-06-21 04:43:20,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=321992.0, ans=0.125 2024-06-21 04:43:20,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.74 vs. limit=12.0 2024-06-21 04:43:23,198 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.914e+02 2.081e+02 2.227e+02 3.119e+02, threshold=4.162e+02, percent-clipped=0.0 2024-06-21 04:43:25,635 INFO [train.py:1028] (0/2) Epoch 18, batch 3650, loss[loss=0.2032, simple_loss=0.2426, pruned_loss=0.08187, over 13141.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2445, pruned_loss=0.07236, over 2576292.11 frames. ], batch size: 103, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:43:29,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=322010.3333333333, ans=10.0 2024-06-21 04:43:43,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322028.6666666667, ans=0.125 2024-06-21 04:44:02,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.77 vs. limit=22.5 2024-06-21 04:44:06,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=322083.6666666667, ans=0.025 2024-06-21 04:44:15,835 INFO [train.py:1028] (0/2) Epoch 18, batch 3700, loss[loss=0.1912, simple_loss=0.2424, pruned_loss=0.07, over 13283.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2435, pruned_loss=0.07201, over 2581089.26 frames. ], batch size: 72, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:44:41,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322120.3333333333, ans=0.125 2024-06-21 04:44:43,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=322138.6666666667, ans=0.125 2024-06-21 04:45:01,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=322157.0, ans=0.2 2024-06-21 04:45:09,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=322175.3333333333, ans=0.125 2024-06-21 04:45:11,391 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 1.941e+02 2.050e+02 2.205e+02 2.895e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-21 04:45:13,294 INFO [train.py:1028] (0/2) Epoch 18, batch 3750, loss[loss=0.2126, simple_loss=0.2705, pruned_loss=0.07737, over 12667.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.244, pruned_loss=0.07214, over 2584194.24 frames. ], batch size: 22, lr: 3.25e-03, grad_scale: 32.0 2024-06-21 04:45:15,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322193.6666666667, ans=0.125 2024-06-21 04:45:19,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=322193.6666666667, ans=0.025 2024-06-21 04:45:21,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=322212.0, ans=0.125 2024-06-21 04:45:27,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322212.0, ans=0.1 2024-06-21 04:45:37,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=322230.3333333333, ans=0.125 2024-06-21 04:45:38,577 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:45:51,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2024-06-21 04:45:57,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.04 vs. limit=12.0 2024-06-21 04:46:03,966 INFO [train.py:1028] (0/2) Epoch 18, batch 3800, loss[loss=0.193, simple_loss=0.2425, pruned_loss=0.07177, over 13231.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2441, pruned_loss=0.07218, over 2582455.98 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:46:13,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322285.3333333333, ans=0.125 2024-06-21 04:46:18,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.22 vs. limit=22.5 2024-06-21 04:46:35,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.36 vs. limit=22.5 2024-06-21 04:46:37,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322340.3333333333, ans=0.1 2024-06-21 04:46:39,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=322340.3333333333, ans=0.1 2024-06-21 04:46:53,833 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.687e+02 1.875e+02 1.986e+02 2.144e+02 2.561e+02, threshold=3.973e+02, percent-clipped=0.0 2024-06-21 04:46:55,843 INFO [train.py:1028] (0/2) Epoch 18, batch 3850, loss[loss=0.1957, simple_loss=0.2412, pruned_loss=0.07509, over 12973.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2432, pruned_loss=0.07164, over 2581407.80 frames. ], batch size: 144, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:47:35,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=322450.3333333333, ans=0.07 2024-06-21 04:47:35,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=322450.3333333333, ans=0.125 2024-06-21 04:47:40,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=322450.3333333333, ans=0.0 2024-06-21 04:47:42,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322450.3333333333, ans=0.1 2024-06-21 04:47:47,112 INFO [train.py:1028] (0/2) Epoch 18, batch 3900, loss[loss=0.1889, simple_loss=0.2424, pruned_loss=0.06775, over 13200.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2434, pruned_loss=0.07186, over 2585165.18 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:48:34,030 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.935e+02 2.070e+02 2.369e+02 3.349e+02, threshold=4.140e+02, percent-clipped=0.0 2024-06-21 04:48:35,837 INFO [train.py:1028] (0/2) Epoch 18, batch 3950, loss[loss=0.1877, simple_loss=0.2334, pruned_loss=0.07105, over 13082.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2426, pruned_loss=0.07146, over 2587173.23 frames. ], batch size: 132, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:48:50,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=322578.6666666667, ans=0.0 2024-06-21 04:49:02,179 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:49:06,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=322597.0, ans=0.0 2024-06-21 04:49:25,143 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:49:26,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2024-06-21 04:49:30,378 INFO [train.py:1028] (0/2) Epoch 18, batch 4000, loss[loss=0.1862, simple_loss=0.2416, pruned_loss=0.06537, over 12830.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2418, pruned_loss=0.07144, over 2582009.53 frames. ], batch size: 39, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:49:37,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=322652.0, ans=0.125 2024-06-21 04:49:38,573 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-176000.pt 2024-06-21 04:49:54,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=322670.3333333333, ans=0.04949747468305833 2024-06-21 04:49:58,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=322688.6666666667, ans=0.0 2024-06-21 04:49:58,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=322688.6666666667, ans=0.125 2024-06-21 04:50:03,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=322688.6666666667, ans=0.0 2024-06-21 04:50:16,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=322707.0, ans=0.125 2024-06-21 04:50:23,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.68 vs. limit=15.0 2024-06-21 04:50:25,827 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.910e+02 2.057e+02 2.320e+02 3.640e+02, threshold=4.114e+02, percent-clipped=0.0 2024-06-21 04:50:28,335 INFO [train.py:1028] (0/2) Epoch 18, batch 4050, loss[loss=0.2113, simple_loss=0.2517, pruned_loss=0.0854, over 11031.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2414, pruned_loss=0.07137, over 2579510.42 frames. ], batch size: 303, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:50:33,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-21 04:50:58,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=322780.3333333333, ans=0.5 2024-06-21 04:51:12,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=322798.6666666667, ans=0.125 2024-06-21 04:51:27,872 INFO [train.py:1028] (0/2) Epoch 18, batch 4100, loss[loss=0.176, simple_loss=0.2225, pruned_loss=0.06475, over 13007.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2414, pruned_loss=0.07146, over 2577607.69 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:51:43,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=22.5 2024-06-21 04:52:12,965 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.75 vs. limit=22.5 2024-06-21 04:52:14,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=322890.3333333333, ans=0.125 2024-06-21 04:52:18,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2024-06-21 04:52:19,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=322908.6666666667, ans=0.025 2024-06-21 04:52:27,005 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 1.888e+02 2.036e+02 2.230e+02 2.989e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-21 04:52:29,141 INFO [train.py:1028] (0/2) Epoch 18, batch 4150, loss[loss=0.1846, simple_loss=0.2373, pruned_loss=0.06601, over 13121.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2411, pruned_loss=0.07128, over 2577100.63 frames. ], batch size: 55, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:52:33,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=322927.0, ans=0.125 2024-06-21 04:52:41,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2024-06-21 04:52:43,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=322945.3333333333, ans=0.0 2024-06-21 04:52:45,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=322945.3333333333, ans=0.125 2024-06-21 04:52:56,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=322963.6666666667, ans=10.0 2024-06-21 04:53:21,107 INFO [train.py:1028] (0/2) Epoch 18, batch 4200, loss[loss=0.1943, simple_loss=0.2426, pruned_loss=0.07295, over 12998.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.241, pruned_loss=0.07131, over 2579497.12 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:53:21,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=323018.6666666667, ans=0.125 2024-06-21 04:53:34,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=323037.0, ans=0.125 2024-06-21 04:53:35,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=323037.0, ans=0.2 2024-06-21 04:53:37,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=323037.0, ans=0.125 2024-06-21 04:53:49,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=323055.3333333333, ans=0.2 2024-06-21 04:54:13,178 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.660e+02 1.922e+02 2.011e+02 2.222e+02 2.925e+02, threshold=4.023e+02, percent-clipped=0.0 2024-06-21 04:54:14,893 INFO [train.py:1028] (0/2) Epoch 18, batch 4250, loss[loss=0.1804, simple_loss=0.2413, pruned_loss=0.05972, over 13344.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.241, pruned_loss=0.07108, over 2583371.84 frames. ], batch size: 46, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:54:31,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=323128.6666666667, ans=0.125 2024-06-21 04:54:42,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=323147.0, ans=0.5 2024-06-21 04:55:03,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323183.6666666667, ans=0.1 2024-06-21 04:55:07,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=15.0 2024-06-21 04:55:08,605 INFO [train.py:1028] (0/2) Epoch 18, batch 4300, loss[loss=0.1805, simple_loss=0.2373, pruned_loss=0.0618, over 13266.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2408, pruned_loss=0.07122, over 2583134.86 frames. ], batch size: 59, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:55:20,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=323202.0, ans=0.1 2024-06-21 04:55:28,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=323220.3333333333, ans=0.0 2024-06-21 04:55:48,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=323257.0, ans=0.125 2024-06-21 04:55:49,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=323257.0, ans=0.125 2024-06-21 04:55:56,538 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 04:56:00,238 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.879e+02 2.012e+02 2.167e+02 2.997e+02, threshold=4.024e+02, percent-clipped=0.0 2024-06-21 04:56:01,921 INFO [train.py:1028] (0/2) Epoch 18, batch 4350, loss[loss=0.1801, simple_loss=0.2394, pruned_loss=0.06037, over 13241.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2403, pruned_loss=0.07082, over 2587050.31 frames. ], batch size: 59, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:56:03,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.38 vs. limit=15.0 2024-06-21 04:56:06,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=323293.6666666667, ans=0.125 2024-06-21 04:56:15,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=323312.0, ans=0.125 2024-06-21 04:56:28,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=323330.3333333333, ans=0.0 2024-06-21 04:56:53,826 INFO [train.py:1028] (0/2) Epoch 18, batch 4400, loss[loss=0.1904, simple_loss=0.238, pruned_loss=0.07146, over 13203.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2402, pruned_loss=0.07078, over 2586348.56 frames. ], batch size: 83, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:57:19,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=323403.6666666667, ans=0.1 2024-06-21 04:57:28,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=323422.0, ans=0.0 2024-06-21 04:57:46,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=323458.6666666667, ans=0.0 2024-06-21 04:57:48,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2024-06-21 04:57:52,993 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 1.920e+02 2.048e+02 2.257e+02 2.924e+02, threshold=4.096e+02, percent-clipped=0.0 2024-06-21 04:57:54,934 INFO [train.py:1028] (0/2) Epoch 18, batch 4450, loss[loss=0.1746, simple_loss=0.233, pruned_loss=0.05813, over 12951.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2411, pruned_loss=0.07141, over 2581350.04 frames. ], batch size: 33, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:58:05,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=323495.3333333333, ans=0.125 2024-06-21 04:58:09,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=323495.3333333333, ans=0.0 2024-06-21 04:58:27,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=323513.6666666667, ans=0.125 2024-06-21 04:58:54,568 INFO [train.py:1028] (0/2) Epoch 18, batch 4500, loss[loss=0.1768, simple_loss=0.2259, pruned_loss=0.06379, over 13239.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2407, pruned_loss=0.07127, over 2586151.17 frames. ], batch size: 89, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 04:58:58,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=323568.6666666667, ans=0.125 2024-06-21 04:59:01,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2024-06-21 04:59:10,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=323587.0, ans=0.035 2024-06-21 04:59:11,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2024-06-21 04:59:18,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.00 vs. limit=22.5 2024-06-21 04:59:42,741 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 1.882e+02 2.014e+02 2.131e+02 2.807e+02, threshold=4.027e+02, percent-clipped=0.0 2024-06-21 04:59:45,179 INFO [train.py:1028] (0/2) Epoch 18, batch 4550, loss[loss=0.1829, simple_loss=0.2385, pruned_loss=0.06366, over 13262.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2404, pruned_loss=0.07108, over 2590002.07 frames. ], batch size: 52, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:00:11,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=323697.0, ans=0.125 2024-06-21 05:00:37,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=323733.6666666667, ans=0.125 2024-06-21 05:00:47,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=323752.0, ans=0.0 2024-06-21 05:00:48,254 INFO [train.py:1028] (0/2) Epoch 18, batch 4600, loss[loss=0.1987, simple_loss=0.2415, pruned_loss=0.07793, over 12513.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2406, pruned_loss=0.07114, over 2585594.07 frames. ], batch size: 202, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:01:00,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.38 vs. limit=15.0 2024-06-21 05:01:01,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2024-06-21 05:01:18,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=323807.0, ans=0.0 2024-06-21 05:01:31,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=323825.3333333333, ans=0.125 2024-06-21 05:01:34,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=323825.3333333333, ans=0.0 2024-06-21 05:01:35,451 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.658e+02 1.863e+02 1.968e+02 2.146e+02 3.437e+02, threshold=3.936e+02, percent-clipped=0.0 2024-06-21 05:01:36,980 INFO [train.py:1028] (0/2) Epoch 18, batch 4650, loss[loss=0.1689, simple_loss=0.2136, pruned_loss=0.0621, over 13041.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2395, pruned_loss=0.071, over 2588230.00 frames. ], batch size: 132, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:01:39,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=323843.6666666667, ans=0.125 2024-06-21 05:01:44,722 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.76 vs. limit=10.0 2024-06-21 05:01:48,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=323862.0, ans=0.125 2024-06-21 05:02:26,316 INFO [train.py:1028] (0/2) Epoch 18, batch 4700, loss[loss=0.1712, simple_loss=0.2248, pruned_loss=0.05882, over 12841.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2396, pruned_loss=0.07095, over 2584338.92 frames. ], batch size: 26, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:02:26,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2024-06-21 05:02:36,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=323953.6666666667, ans=0.125 2024-06-21 05:02:45,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323953.6666666667, ans=0.1 2024-06-21 05:02:57,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323990.3333333333, ans=0.1 2024-06-21 05:02:57,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.64 vs. limit=22.5 2024-06-21 05:03:02,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=323990.3333333333, ans=0.05 2024-06-21 05:03:02,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.49 vs. limit=22.5 2024-06-21 05:03:08,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=324008.6666666667, ans=0.125 2024-06-21 05:03:16,755 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.837e+02 1.943e+02 2.074e+02 2.626e+02, threshold=3.886e+02, percent-clipped=0.0 2024-06-21 05:03:18,535 INFO [train.py:1028] (0/2) Epoch 18, batch 4750, loss[loss=0.2041, simple_loss=0.2474, pruned_loss=0.08036, over 12529.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.239, pruned_loss=0.071, over 2581772.95 frames. ], batch size: 202, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:03:18,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=324027.0, ans=0.1 2024-06-21 05:03:28,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2024-06-21 05:03:33,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=324045.3333333333, ans=0.125 2024-06-21 05:03:37,836 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:03:39,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324063.6666666667, ans=0.1 2024-06-21 05:03:40,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=324063.6666666667, ans=0.2 2024-06-21 05:03:45,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=324063.6666666667, ans=15.0 2024-06-21 05:03:49,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2024-06-21 05:03:52,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=324082.0, ans=0.09899494936611666 2024-06-21 05:03:53,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=15.0 2024-06-21 05:04:05,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2024-06-21 05:04:10,144 INFO [train.py:1028] (0/2) Epoch 18, batch 4800, loss[loss=0.1905, simple_loss=0.2439, pruned_loss=0.06854, over 13265.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2392, pruned_loss=0.07093, over 2577966.23 frames. ], batch size: 63, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:04:10,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=324118.6666666667, ans=0.125 2024-06-21 05:04:10,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=324118.6666666667, ans=0.0 2024-06-21 05:04:17,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2024-06-21 05:05:03,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=324192.0, ans=0.125 2024-06-21 05:05:08,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.555e+02 1.902e+02 2.110e+02 2.331e+02 3.355e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 05:05:11,021 INFO [train.py:1028] (0/2) Epoch 18, batch 4850, loss[loss=0.1849, simple_loss=0.2295, pruned_loss=0.0702, over 13280.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.239, pruned_loss=0.0708, over 2574155.02 frames. ], batch size: 89, lr: 3.24e-03, grad_scale: 32.0 2024-06-21 05:05:11,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2024-06-21 05:05:19,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324228.6666666667, ans=0.125 2024-06-21 05:05:23,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=324228.6666666667, ans=0.125 2024-06-21 05:05:58,961 INFO [train.py:1028] (0/2) Epoch 18, batch 4900, loss[loss=0.1789, simple_loss=0.2333, pruned_loss=0.06225, over 13214.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2391, pruned_loss=0.07079, over 2574995.07 frames. ], batch size: 59, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:06:09,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=324320.3333333333, ans=0.1 2024-06-21 05:06:13,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=324320.3333333333, ans=0.125 2024-06-21 05:06:46,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=324357.0, ans=0.0 2024-06-21 05:06:55,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=324375.3333333333, ans=0.2 2024-06-21 05:06:56,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=324375.3333333333, ans=0.125 2024-06-21 05:06:57,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 1.885e+02 1.988e+02 2.119e+02 3.268e+02, threshold=3.976e+02, percent-clipped=0.0 2024-06-21 05:06:59,576 INFO [train.py:1028] (0/2) Epoch 18, batch 4950, loss[loss=0.2038, simple_loss=0.2365, pruned_loss=0.08556, over 11022.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2394, pruned_loss=0.0714, over 2569179.74 frames. ], batch size: 304, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:07:02,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=324393.6666666667, ans=0.125 2024-06-21 05:07:08,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=324393.6666666667, ans=0.025 2024-06-21 05:07:11,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.39 vs. limit=15.0 2024-06-21 05:07:17,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=324412.0, ans=0.125 2024-06-21 05:07:44,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=324448.6666666667, ans=0.0 2024-06-21 05:07:58,884 INFO [train.py:1028] (0/2) Epoch 18, batch 5000, loss[loss=0.1834, simple_loss=0.2313, pruned_loss=0.06782, over 13082.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2392, pruned_loss=0.07116, over 2574462.60 frames. ], batch size: 95, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:08:04,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=324485.3333333333, ans=0.0 2024-06-21 05:08:06,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324485.3333333333, ans=0.125 2024-06-21 05:08:21,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324522.0, ans=0.125 2024-06-21 05:08:27,491 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2024-06-21 05:08:45,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=324558.6666666667, ans=0.0 2024-06-21 05:08:48,698 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.885e+02 2.009e+02 2.183e+02 3.110e+02, threshold=4.017e+02, percent-clipped=0.0 2024-06-21 05:08:50,584 INFO [train.py:1028] (0/2) Epoch 18, batch 5050, loss[loss=0.1941, simple_loss=0.2479, pruned_loss=0.07021, over 12936.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2397, pruned_loss=0.07111, over 2576140.45 frames. ], batch size: 36, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:08:51,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=324577.0, ans=0.0 2024-06-21 05:09:01,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=324595.3333333333, ans=0.0 2024-06-21 05:09:06,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=324613.6666666667, ans=0.125 2024-06-21 05:09:10,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=324613.6666666667, ans=0.125 2024-06-21 05:09:12,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=324613.6666666667, ans=0.125 2024-06-21 05:09:15,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=324632.0, ans=0.09899494936611666 2024-06-21 05:09:16,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=324632.0, ans=0.95 2024-06-21 05:09:39,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2024-06-21 05:09:42,550 INFO [train.py:1028] (0/2) Epoch 18, batch 5100, loss[loss=0.1898, simple_loss=0.2479, pruned_loss=0.06588, over 12918.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2395, pruned_loss=0.07124, over 2572506.93 frames. ], batch size: 39, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:09:45,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=15.29 vs. limit=15.0 2024-06-21 05:09:49,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=324668.6666666667, ans=0.025 2024-06-21 05:10:04,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324705.3333333333, ans=0.1 2024-06-21 05:10:40,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.919e+02 2.015e+02 2.153e+02 3.751e+02, threshold=4.030e+02, percent-clipped=0.0 2024-06-21 05:10:41,778 INFO [train.py:1028] (0/2) Epoch 18, batch 5150, loss[loss=0.1902, simple_loss=0.2282, pruned_loss=0.07613, over 13112.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2392, pruned_loss=0.07148, over 2574378.29 frames. ], batch size: 132, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:10:54,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324778.6666666667, ans=0.1 2024-06-21 05:10:55,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=324778.6666666667, ans=0.0 2024-06-21 05:11:01,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=324797.0, ans=0.025 2024-06-21 05:11:02,131 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.266e+01 2024-06-21 05:11:03,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=324797.0, ans=0.1 2024-06-21 05:11:12,999 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=12.0 2024-06-21 05:11:26,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=324833.6666666667, ans=0.2 2024-06-21 05:11:28,338 INFO [train.py:1028] (0/2) Epoch 18, batch 5200, loss[loss=0.1959, simple_loss=0.2407, pruned_loss=0.07554, over 13202.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.238, pruned_loss=0.07068, over 2576850.46 frames. ], batch size: 95, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:11:38,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=324870.3333333333, ans=0.125 2024-06-21 05:11:44,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=324888.6666666667, ans=0.035 2024-06-21 05:11:44,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=324888.6666666667, ans=0.0 2024-06-21 05:11:46,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=324888.6666666667, ans=0.2 2024-06-21 05:11:54,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=324907.0, ans=0.125 2024-06-21 05:11:54,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=324907.0, ans=0.025 2024-06-21 05:12:08,699 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.686e+02 1.891e+02 2.033e+02 2.163e+02 3.297e+02, threshold=4.067e+02, percent-clipped=0.0 2024-06-21 05:12:18,030 INFO [train.py:1028] (0/2) Epoch 18, batch 5250, loss[loss=0.1828, simple_loss=0.2306, pruned_loss=0.06744, over 13249.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2377, pruned_loss=0.07063, over 2573268.97 frames. ], batch size: 52, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:12:26,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=324943.6666666667, ans=0.125 2024-06-21 05:12:48,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=324998.6666666667, ans=0.2 2024-06-21 05:12:48,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=324998.6666666667, ans=0.025 2024-06-21 05:12:59,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=325017.0, ans=0.125 2024-06-21 05:13:09,682 INFO [train.py:1028] (0/2) Epoch 18, batch 5300, loss[loss=0.1805, simple_loss=0.2261, pruned_loss=0.06745, over 13069.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2376, pruned_loss=0.07042, over 2569993.09 frames. ], batch size: 144, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:13:12,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=325035.3333333333, ans=0.0 2024-06-21 05:13:21,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=325053.6666666667, ans=0.1 2024-06-21 05:13:23,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=325053.6666666667, ans=0.2 2024-06-21 05:13:45,565 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-21 05:13:57,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325108.6666666667, ans=0.125 2024-06-21 05:14:00,713 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.920e+02 2.030e+02 2.183e+02 3.054e+02, threshold=4.060e+02, percent-clipped=0.0 2024-06-21 05:14:02,542 INFO [train.py:1028] (0/2) Epoch 18, batch 5350, loss[loss=0.1878, simple_loss=0.2454, pruned_loss=0.06504, over 11101.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2378, pruned_loss=0.07045, over 2574871.03 frames. ], batch size: 16, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:14:24,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2024-06-21 05:14:42,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=325200.3333333333, ans=0.2 2024-06-21 05:14:53,184 INFO [train.py:1028] (0/2) Epoch 18, batch 5400, loss[loss=0.2181, simple_loss=0.2529, pruned_loss=0.09162, over 12132.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2383, pruned_loss=0.07087, over 2567639.60 frames. ], batch size: 240, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:15:14,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=325237.0, ans=0.0 2024-06-21 05:15:14,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=325237.0, ans=0.0 2024-06-21 05:15:29,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=325255.3333333333, ans=0.125 2024-06-21 05:15:31,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=325273.6666666667, ans=0.07 2024-06-21 05:15:46,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325292.0, ans=0.1 2024-06-21 05:15:47,437 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 1.938e+02 2.079e+02 2.224e+02 2.503e+02, threshold=4.158e+02, percent-clipped=0.0 2024-06-21 05:15:49,122 INFO [train.py:1028] (0/2) Epoch 18, batch 5450, loss[loss=0.1956, simple_loss=0.248, pruned_loss=0.07165, over 12784.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2384, pruned_loss=0.07087, over 2571099.73 frames. ], batch size: 26, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:15:53,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325310.3333333333, ans=0.125 2024-06-21 05:15:55,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=325310.3333333333, ans=0.0 2024-06-21 05:15:58,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325328.6666666667, ans=0.125 2024-06-21 05:16:01,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=325328.6666666667, ans=0.025 2024-06-21 05:16:25,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2024-06-21 05:16:42,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=325383.6666666667, ans=0.0 2024-06-21 05:16:46,100 INFO [train.py:1028] (0/2) Epoch 18, batch 5500, loss[loss=0.2089, simple_loss=0.2481, pruned_loss=0.08488, over 12183.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2382, pruned_loss=0.07079, over 2565092.14 frames. ], batch size: 241, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:17:08,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=325438.6666666667, ans=0.025 2024-06-21 05:17:34,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=325475.3333333333, ans=0.025 2024-06-21 05:17:38,358 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.900e+02 2.074e+02 2.229e+02 3.003e+02, threshold=4.148e+02, percent-clipped=0.0 2024-06-21 05:17:39,949 INFO [train.py:1028] (0/2) Epoch 18, batch 5550, loss[loss=0.1872, simple_loss=0.2327, pruned_loss=0.07082, over 13283.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2378, pruned_loss=0.07025, over 2568710.50 frames. ], batch size: 43, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:17:46,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=325512.0, ans=0.0 2024-06-21 05:17:57,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=325530.3333333333, ans=0.0 2024-06-21 05:17:58,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=325530.3333333333, ans=0.2 2024-06-21 05:17:59,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325530.3333333333, ans=0.125 2024-06-21 05:18:00,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=325530.3333333333, ans=0.125 2024-06-21 05:18:25,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=325567.0, ans=0.125 2024-06-21 05:18:31,121 INFO [train.py:1028] (0/2) Epoch 18, batch 5600, loss[loss=0.1724, simple_loss=0.2192, pruned_loss=0.06276, over 13242.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2372, pruned_loss=0.07002, over 2570207.38 frames. ], batch size: 89, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:18:33,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=325585.3333333333, ans=0.0 2024-06-21 05:18:43,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=325603.6666666667, ans=0.125 2024-06-21 05:18:44,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=325603.6666666667, ans=0.125 2024-06-21 05:18:45,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325603.6666666667, ans=0.1 2024-06-21 05:18:53,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=325622.0, ans=0.125 2024-06-21 05:19:05,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=15.0 2024-06-21 05:19:07,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2024-06-21 05:19:10,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=325640.3333333333, ans=0.2 2024-06-21 05:19:15,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=325658.6666666667, ans=0.0 2024-06-21 05:19:22,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 1.922e+02 2.058e+02 2.231e+02 3.529e+02, threshold=4.117e+02, percent-clipped=0.0 2024-06-21 05:19:23,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=325677.0, ans=0.125 2024-06-21 05:19:24,523 INFO [train.py:1028] (0/2) Epoch 18, batch 5650, loss[loss=0.2058, simple_loss=0.2492, pruned_loss=0.08116, over 12498.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2371, pruned_loss=0.06983, over 2575160.14 frames. ], batch size: 202, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:20:00,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=325732.0, ans=0.0 2024-06-21 05:20:02,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=325732.0, ans=0.0 2024-06-21 05:20:03,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=325750.3333333333, ans=0.0 2024-06-21 05:20:04,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.34 vs. limit=15.0 2024-06-21 05:20:11,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325750.3333333333, ans=0.125 2024-06-21 05:20:15,815 INFO [train.py:1028] (0/2) Epoch 18, batch 5700, loss[loss=0.1934, simple_loss=0.2458, pruned_loss=0.07053, over 13318.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2371, pruned_loss=0.06979, over 2578333.43 frames. ], batch size: 63, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:20:19,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=325768.6666666667, ans=0.0 2024-06-21 05:20:30,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325787.0, ans=0.1 2024-06-21 05:20:38,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325805.3333333333, ans=0.1 2024-06-21 05:20:45,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=15.0 2024-06-21 05:20:51,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=325823.6666666667, ans=0.125 2024-06-21 05:20:52,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325823.6666666667, ans=0.125 2024-06-21 05:21:03,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2024-06-21 05:21:04,680 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.928e+02 2.087e+02 2.285e+02 3.074e+02, threshold=4.174e+02, percent-clipped=0.0 2024-06-21 05:21:06,655 INFO [train.py:1028] (0/2) Epoch 18, batch 5750, loss[loss=0.179, simple_loss=0.2284, pruned_loss=0.06484, over 12724.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.238, pruned_loss=0.07013, over 2578380.86 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:21:16,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325860.3333333333, ans=0.125 2024-06-21 05:21:30,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=325878.6666666667, ans=0.125 2024-06-21 05:21:34,011 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:21:49,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=325933.6666666667, ans=0.125 2024-06-21 05:21:59,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2024-06-21 05:21:59,407 INFO [train.py:1028] (0/2) Epoch 18, batch 5800, loss[loss=0.1876, simple_loss=0.2332, pruned_loss=0.07103, over 12734.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2391, pruned_loss=0.07083, over 2577366.65 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:22:21,875 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2024-06-21 05:22:47,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=326025.3333333333, ans=6.0 2024-06-21 05:22:51,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=326025.3333333333, ans=0.0 2024-06-21 05:22:53,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=326025.3333333333, ans=0.125 2024-06-21 05:22:57,901 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.729e+02 1.973e+02 2.141e+02 2.299e+02 3.812e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-21 05:22:59,851 INFO [train.py:1028] (0/2) Epoch 18, batch 5850, loss[loss=0.2125, simple_loss=0.2565, pruned_loss=0.08419, over 12577.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.241, pruned_loss=0.07171, over 2575365.93 frames. ], batch size: 202, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:23:37,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326117.0, ans=0.1 2024-06-21 05:23:47,181 INFO [train.py:1028] (0/2) Epoch 18, batch 5900, loss[loss=0.1892, simple_loss=0.2371, pruned_loss=0.07062, over 13082.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2432, pruned_loss=0.07247, over 2576871.71 frames. ], batch size: 121, lr: 3.23e-03, grad_scale: 64.0 2024-06-21 05:24:19,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2024-06-21 05:24:24,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=326190.3333333333, ans=0.1 2024-06-21 05:24:29,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.25 vs. limit=15.0 2024-06-21 05:24:30,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=326190.3333333333, ans=0.125 2024-06-21 05:24:30,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=326190.3333333333, ans=0.0 2024-06-21 05:24:43,448 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 1.966e+02 2.155e+02 2.357e+02 3.049e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 05:24:44,706 INFO [train.py:1028] (0/2) Epoch 18, batch 5950, loss[loss=0.1981, simple_loss=0.2381, pruned_loss=0.07904, over 13100.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2443, pruned_loss=0.07301, over 2582040.41 frames. ], batch size: 121, lr: 3.23e-03, grad_scale: 32.0 2024-06-21 05:24:47,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=326227.0, ans=0.125 2024-06-21 05:25:01,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2024-06-21 05:25:03,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=326263.6666666667, ans=0.05 2024-06-21 05:25:10,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2024-06-21 05:25:10,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.29 vs. limit=15.0 2024-06-21 05:25:11,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=326282.0, ans=0.0 2024-06-21 05:25:31,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=326318.6666666667, ans=0.025 2024-06-21 05:25:32,153 INFO [train.py:1028] (0/2) Epoch 18, batch 6000, loss[loss=0.2469, simple_loss=0.2858, pruned_loss=0.104, over 12202.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2455, pruned_loss=0.07355, over 2574543.47 frames. ], batch size: 241, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:25:32,154 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 05:25:44,037 INFO [train.py:1060] (0/2) Epoch 18, validation: loss=0.1882, simple_loss=0.2528, pruned_loss=0.06183, over 351949.00 frames. 2024-06-21 05:25:44,038 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 05:25:45,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=326318.6666666667, ans=0.025 2024-06-21 05:25:52,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=326318.6666666667, ans=0.125 2024-06-21 05:26:11,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=326355.3333333333, ans=0.09899494936611666 2024-06-21 05:26:12,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=326355.3333333333, ans=0.0 2024-06-21 05:26:14,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=326355.3333333333, ans=0.025 2024-06-21 05:26:37,575 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 1.974e+02 2.109e+02 2.263e+02 3.024e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-21 05:26:37,611 INFO [train.py:1028] (0/2) Epoch 18, batch 6050, loss[loss=0.1825, simple_loss=0.2392, pruned_loss=0.06292, over 12950.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2476, pruned_loss=0.07402, over 2578209.30 frames. ], batch size: 39, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:26:58,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=326447.0, ans=0.125 2024-06-21 05:27:04,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=326447.0, ans=0.0 2024-06-21 05:27:04,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=326447.0, ans=0.0 2024-06-21 05:27:14,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=326465.3333333333, ans=0.125 2024-06-21 05:27:16,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=326483.6666666667, ans=0.0 2024-06-21 05:27:26,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=326483.6666666667, ans=0.07 2024-06-21 05:27:27,858 INFO [train.py:1028] (0/2) Epoch 18, batch 6100, loss[loss=0.1816, simple_loss=0.2291, pruned_loss=0.06701, over 13112.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2489, pruned_loss=0.07451, over 2580904.04 frames. ], batch size: 121, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:27:36,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=326520.3333333333, ans=0.125 2024-06-21 05:27:40,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.25 vs. limit=15.0 2024-06-21 05:27:45,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2024-06-21 05:27:53,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=326538.6666666667, ans=0.07 2024-06-21 05:28:18,446 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 1.958e+02 2.118e+02 2.331e+02 3.403e+02, threshold=4.236e+02, percent-clipped=0.0 2024-06-21 05:28:18,489 INFO [train.py:1028] (0/2) Epoch 18, batch 6150, loss[loss=0.2077, simple_loss=0.2441, pruned_loss=0.08562, over 11031.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2503, pruned_loss=0.07528, over 2580717.41 frames. ], batch size: 304, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:28:24,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=326593.6666666667, ans=0.2 2024-06-21 05:28:30,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=326612.0, ans=0.125 2024-06-21 05:28:56,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2024-06-21 05:29:15,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=326667.0, ans=0.0 2024-06-21 05:29:16,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=326667.0, ans=0.125 2024-06-21 05:29:17,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=326667.0, ans=0.125 2024-06-21 05:29:18,507 INFO [train.py:1028] (0/2) Epoch 18, batch 6200, loss[loss=0.2146, simple_loss=0.2696, pruned_loss=0.07978, over 13261.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2519, pruned_loss=0.0758, over 2577766.52 frames. ], batch size: 89, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:29:28,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=326703.6666666667, ans=0.2 2024-06-21 05:29:39,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=326722.0, ans=0.125 2024-06-21 05:29:46,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=326740.3333333333, ans=0.1 2024-06-21 05:30:07,665 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.002e+02 2.194e+02 2.572e+02 4.041e+02, threshold=4.388e+02, percent-clipped=0.0 2024-06-21 05:30:07,717 INFO [train.py:1028] (0/2) Epoch 18, batch 6250, loss[loss=0.2066, simple_loss=0.2547, pruned_loss=0.07921, over 13248.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2527, pruned_loss=0.07605, over 2571192.62 frames. ], batch size: 83, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:30:23,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=326777.0, ans=0.2 2024-06-21 05:30:23,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=326777.0, ans=0.0 2024-06-21 05:30:36,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2024-06-21 05:30:42,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=326813.6666666667, ans=0.1 2024-06-21 05:31:00,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=326850.3333333333, ans=0.125 2024-06-21 05:31:05,803 INFO [train.py:1028] (0/2) Epoch 18, batch 6300, loss[loss=0.2116, simple_loss=0.2685, pruned_loss=0.0774, over 11299.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2537, pruned_loss=0.07654, over 2566358.53 frames. ], batch size: 16, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:31:17,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=326887.0, ans=0.025 2024-06-21 05:31:23,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326905.3333333333, ans=0.1 2024-06-21 05:31:29,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.48 vs. limit=22.5 2024-06-21 05:31:53,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=326942.0, ans=0.0 2024-06-21 05:31:56,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=326942.0, ans=0.1 2024-06-21 05:31:56,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.78 vs. limit=10.0 2024-06-21 05:32:01,553 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.982e+02 2.137e+02 2.387e+02 3.416e+02, threshold=4.274e+02, percent-clipped=0.0 2024-06-21 05:32:01,592 INFO [train.py:1028] (0/2) Epoch 18, batch 6350, loss[loss=0.2288, simple_loss=0.2726, pruned_loss=0.09252, over 12505.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2551, pruned_loss=0.07667, over 2576076.88 frames. ], batch size: 202, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:32:04,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.91 vs. limit=12.0 2024-06-21 05:32:20,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326978.6666666667, ans=0.1 2024-06-21 05:32:35,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=327015.3333333333, ans=0.2 2024-06-21 05:32:35,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2024-06-21 05:32:36,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327015.3333333333, ans=0.1 2024-06-21 05:32:40,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=327015.3333333333, ans=0.125 2024-06-21 05:32:46,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=327033.6666666667, ans=0.025 2024-06-21 05:32:53,908 INFO [train.py:1028] (0/2) Epoch 18, batch 6400, loss[loss=0.1793, simple_loss=0.2421, pruned_loss=0.05822, over 13209.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2571, pruned_loss=0.07738, over 2577308.95 frames. ], batch size: 67, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:32:59,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.59 vs. limit=15.0 2024-06-21 05:32:59,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327052.0, ans=0.125 2024-06-21 05:33:30,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=327107.0, ans=0.0 2024-06-21 05:33:33,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=327107.0, ans=0.125 2024-06-21 05:33:39,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=327125.3333333333, ans=0.09899494936611666 2024-06-21 05:33:39,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=327125.3333333333, ans=0.125 2024-06-21 05:33:41,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=327125.3333333333, ans=0.125 2024-06-21 05:33:42,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=327125.3333333333, ans=0.125 2024-06-21 05:33:47,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.074e+02 2.211e+02 2.412e+02 2.962e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 05:33:47,750 INFO [train.py:1028] (0/2) Epoch 18, batch 6450, loss[loss=0.2259, simple_loss=0.2748, pruned_loss=0.08846, over 12546.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2587, pruned_loss=0.07805, over 2582345.68 frames. ], batch size: 202, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:34:13,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2024-06-21 05:34:13,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327180.3333333333, ans=0.125 2024-06-21 05:34:16,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327180.3333333333, ans=0.125 2024-06-21 05:34:22,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=327198.6666666667, ans=0.0 2024-06-21 05:34:25,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=327198.6666666667, ans=0.1 2024-06-21 05:34:40,630 INFO [train.py:1028] (0/2) Epoch 18, batch 6500, loss[loss=0.2182, simple_loss=0.2596, pruned_loss=0.08835, over 10829.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2603, pruned_loss=0.07859, over 2585261.30 frames. ], batch size: 303, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:35:02,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327272.0, ans=0.1 2024-06-21 05:35:04,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=327272.0, ans=0.125 2024-06-21 05:35:06,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2024-06-21 05:35:10,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=327272.0, ans=0.0 2024-06-21 05:35:12,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.81 vs. limit=15.0 2024-06-21 05:35:27,958 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.038e+02 2.196e+02 2.474e+02 3.910e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 05:35:27,987 INFO [train.py:1028] (0/2) Epoch 18, batch 6550, loss[loss=0.1914, simple_loss=0.2548, pruned_loss=0.06404, over 12561.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2615, pruned_loss=0.07885, over 2590070.99 frames. ], batch size: 22, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:35:43,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=327345.3333333333, ans=0.015 2024-06-21 05:36:04,454 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2024-06-21 05:36:07,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=327400.3333333333, ans=0.0 2024-06-21 05:36:10,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=327400.3333333333, ans=0.125 2024-06-21 05:36:11,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=327418.6666666667, ans=0.125 2024-06-21 05:36:12,553 INFO [train.py:1028] (0/2) Epoch 18, batch 6600, loss[loss=0.2155, simple_loss=0.2767, pruned_loss=0.07718, over 13234.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2618, pruned_loss=0.07885, over 2591847.64 frames. ], batch size: 72, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:36:17,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=327418.6666666667, ans=0.0 2024-06-21 05:36:22,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=327437.0, ans=0.125 2024-06-21 05:36:52,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.89 vs. limit=15.0 2024-06-21 05:36:59,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2024-06-21 05:37:02,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=327492.0, ans=0.0 2024-06-21 05:37:05,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=327492.0, ans=0.125 2024-06-21 05:37:10,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.042e+02 2.176e+02 2.372e+02 3.948e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-21 05:37:10,647 INFO [train.py:1028] (0/2) Epoch 18, batch 6650, loss[loss=0.2318, simple_loss=0.2847, pruned_loss=0.08941, over 12914.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2641, pruned_loss=0.07962, over 2587350.36 frames. ], batch size: 158, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:37:28,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=327528.6666666667, ans=0.125 2024-06-21 05:37:31,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327547.0, ans=0.1 2024-06-21 05:37:52,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.07 vs. limit=15.0 2024-06-21 05:38:06,271 INFO [train.py:1028] (0/2) Epoch 18, batch 6700, loss[loss=0.2204, simple_loss=0.2711, pruned_loss=0.08487, over 12747.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2648, pruned_loss=0.07994, over 2585640.83 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:38:10,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327602.0, ans=0.1 2024-06-21 05:38:12,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=327602.0, ans=0.09899494936611666 2024-06-21 05:38:16,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=327620.3333333333, ans=0.04949747468305833 2024-06-21 05:38:18,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2024-06-21 05:38:18,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=327620.3333333333, ans=0.125 2024-06-21 05:38:29,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=327638.6666666667, ans=0.125 2024-06-21 05:38:35,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.62 vs. limit=22.5 2024-06-21 05:38:37,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=327657.0, ans=0.025 2024-06-21 05:38:46,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.37 vs. limit=22.5 2024-06-21 05:38:54,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.107e+02 2.360e+02 2.692e+02 3.618e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-21 05:38:54,408 INFO [train.py:1028] (0/2) Epoch 18, batch 6750, loss[loss=0.2659, simple_loss=0.3074, pruned_loss=0.1122, over 12252.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2658, pruned_loss=0.08051, over 2580710.98 frames. ], batch size: 241, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:39:00,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=327693.6666666667, ans=0.02 2024-06-21 05:39:05,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=327712.0, ans=0.125 2024-06-21 05:39:11,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2024-06-21 05:39:11,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=327712.0, ans=0.125 2024-06-21 05:39:21,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=327730.3333333333, ans=0.0 2024-06-21 05:39:26,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=327748.6666666667, ans=0.125 2024-06-21 05:39:27,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=327748.6666666667, ans=0.125 2024-06-21 05:39:45,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=327767.0, ans=0.2 2024-06-21 05:39:49,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=327767.0, ans=0.0 2024-06-21 05:39:52,383 INFO [train.py:1028] (0/2) Epoch 18, batch 6800, loss[loss=0.2062, simple_loss=0.2638, pruned_loss=0.07431, over 13247.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2666, pruned_loss=0.08068, over 2582658.73 frames. ], batch size: 67, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:40:01,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=327803.6666666667, ans=0.0 2024-06-21 05:40:03,965 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 05:40:13,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327822.0, ans=0.1 2024-06-21 05:40:14,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=327822.0, ans=0.125 2024-06-21 05:40:35,483 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.060e+02 2.214e+02 2.441e+02 3.106e+02, threshold=4.427e+02, percent-clipped=0.0 2024-06-21 05:40:35,521 INFO [train.py:1028] (0/2) Epoch 18, batch 6850, loss[loss=0.2069, simple_loss=0.2679, pruned_loss=0.07292, over 13247.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2672, pruned_loss=0.08065, over 2585661.76 frames. ], batch size: 63, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:40:45,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=327895.3333333333, ans=0.2 2024-06-21 05:41:18,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=327932.0, ans=0.125 2024-06-21 05:41:27,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=327950.3333333333, ans=0.125 2024-06-21 05:41:28,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=327950.3333333333, ans=0.125 2024-06-21 05:41:34,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=327968.6666666667, ans=0.1 2024-06-21 05:41:35,089 INFO [train.py:1028] (0/2) Epoch 18, batch 6900, loss[loss=0.2014, simple_loss=0.2609, pruned_loss=0.07098, over 13278.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2684, pruned_loss=0.08119, over 2586841.55 frames. ], batch size: 49, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:41:50,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=327987.0, ans=0.0 2024-06-21 05:41:50,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=15.0 2024-06-21 05:41:55,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328005.3333333333, ans=0.0 2024-06-21 05:41:55,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=328005.3333333333, ans=22.5 2024-06-21 05:42:04,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=328023.6666666667, ans=0.5 2024-06-21 05:42:09,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=328023.6666666667, ans=0.125 2024-06-21 05:42:20,344 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.036e+02 2.155e+02 2.392e+02 3.183e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 05:42:20,389 INFO [train.py:1028] (0/2) Epoch 18, batch 6950, loss[loss=0.1923, simple_loss=0.2467, pruned_loss=0.06892, over 12303.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2678, pruned_loss=0.08051, over 2580996.01 frames. ], batch size: 18, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:42:31,604 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.06 vs. limit=15.0 2024-06-21 05:42:37,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2024-06-21 05:42:37,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=328078.6666666667, ans=0.025 2024-06-21 05:42:45,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=328097.0, ans=0.0 2024-06-21 05:42:45,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=328097.0, ans=0.2 2024-06-21 05:43:08,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.92 vs. limit=22.5 2024-06-21 05:43:12,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=328133.6666666667, ans=0.125 2024-06-21 05:43:13,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.12 vs. limit=10.0 2024-06-21 05:43:20,112 INFO [train.py:1028] (0/2) Epoch 18, batch 7000, loss[loss=0.2169, simple_loss=0.2697, pruned_loss=0.08205, over 12932.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2676, pruned_loss=0.08045, over 2577297.54 frames. ], batch size: 158, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:43:38,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=14.09 vs. limit=15.0 2024-06-21 05:44:13,365 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.044e+02 2.195e+02 2.306e+02 3.731e+02, threshold=4.390e+02, percent-clipped=0.0 2024-06-21 05:44:13,395 INFO [train.py:1028] (0/2) Epoch 18, batch 7050, loss[loss=0.2242, simple_loss=0.2747, pruned_loss=0.08687, over 12815.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2688, pruned_loss=0.08073, over 2583210.39 frames. ], batch size: 177, lr: 3.22e-03, grad_scale: 32.0 2024-06-21 05:44:18,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=328243.6666666667, ans=0.125 2024-06-21 05:44:25,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=328262.0, ans=0.0 2024-06-21 05:44:34,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=328280.3333333333, ans=0.125 2024-06-21 05:44:39,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=328280.3333333333, ans=0.0 2024-06-21 05:44:39,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=328280.3333333333, ans=0.125 2024-06-21 05:44:54,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=15.0 2024-06-21 05:44:59,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=328317.0, ans=0.0 2024-06-21 05:45:05,221 INFO [train.py:1028] (0/2) Epoch 18, batch 7100, loss[loss=0.2311, simple_loss=0.2792, pruned_loss=0.09153, over 13192.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2698, pruned_loss=0.08153, over 2575138.02 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:45:20,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.75 vs. limit=15.0 2024-06-21 05:45:24,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=328353.6666666667, ans=0.0 2024-06-21 05:45:26,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=328372.0, ans=0.05 2024-06-21 05:46:07,675 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.083e+02 2.243e+02 2.462e+02 3.681e+02, threshold=4.487e+02, percent-clipped=0.0 2024-06-21 05:46:07,717 INFO [train.py:1028] (0/2) Epoch 18, batch 7150, loss[loss=0.2332, simple_loss=0.2848, pruned_loss=0.09082, over 12588.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2709, pruned_loss=0.08175, over 2574256.71 frames. ], batch size: 202, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:46:12,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.85 vs. limit=10.0 2024-06-21 05:46:32,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2024-06-21 05:46:33,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=328463.6666666667, ans=0.0 2024-06-21 05:46:40,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=328482.0, ans=0.125 2024-06-21 05:46:41,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=328482.0, ans=0.2 2024-06-21 05:46:59,153 INFO [train.py:1028] (0/2) Epoch 18, batch 7200, loss[loss=0.239, simple_loss=0.2989, pruned_loss=0.08959, over 13232.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2722, pruned_loss=0.08226, over 2578868.46 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:47:02,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=328518.6666666667, ans=0.0 2024-06-21 05:47:09,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2024-06-21 05:47:11,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=328537.0, ans=0.2 2024-06-21 05:47:49,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=328592.0, ans=0.0 2024-06-21 05:47:57,777 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.139e+02 2.254e+02 2.492e+02 2.952e+02, threshold=4.508e+02, percent-clipped=0.0 2024-06-21 05:47:57,806 INFO [train.py:1028] (0/2) Epoch 18, batch 7250, loss[loss=0.2208, simple_loss=0.2749, pruned_loss=0.08339, over 12898.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2731, pruned_loss=0.08244, over 2579433.69 frames. ], batch size: 36, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:48:02,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328610.3333333333, ans=0.125 2024-06-21 05:48:13,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=328647.0, ans=0.125 2024-06-21 05:48:37,913 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.47 vs. limit=15.0 2024-06-21 05:48:41,038 INFO [train.py:1028] (0/2) Epoch 18, batch 7300, loss[loss=0.228, simple_loss=0.2806, pruned_loss=0.08764, over 12932.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2744, pruned_loss=0.08324, over 2578666.05 frames. ], batch size: 36, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:48:48,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=328702.0, ans=0.2 2024-06-21 05:48:51,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.22 vs. limit=6.0 2024-06-21 05:49:25,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=328775.3333333333, ans=0.0 2024-06-21 05:49:36,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.146e+02 2.288e+02 2.461e+02 3.426e+02, threshold=4.576e+02, percent-clipped=0.0 2024-06-21 05:49:36,501 INFO [train.py:1028] (0/2) Epoch 18, batch 7350, loss[loss=0.2515, simple_loss=0.3044, pruned_loss=0.09929, over 13294.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2755, pruned_loss=0.08365, over 2579521.11 frames. ], batch size: 46, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:49:36,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=328793.6666666667, ans=0.0 2024-06-21 05:49:44,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=328793.6666666667, ans=0.125 2024-06-21 05:49:53,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=328812.0, ans=0.125 2024-06-21 05:49:59,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.42 vs. limit=22.5 2024-06-21 05:50:06,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2024-06-21 05:50:12,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2024-06-21 05:50:14,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.92 vs. limit=22.5 2024-06-21 05:50:15,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=328867.0, ans=0.2 2024-06-21 05:50:20,264 INFO [train.py:1028] (0/2) Epoch 18, batch 7400, loss[loss=0.2363, simple_loss=0.3038, pruned_loss=0.08443, over 13215.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2757, pruned_loss=0.08351, over 2585592.82 frames. ], batch size: 63, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:50:22,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2024-06-21 05:50:27,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=328885.3333333333, ans=0.125 2024-06-21 05:50:29,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=328885.3333333333, ans=0.125 2024-06-21 05:50:32,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=328903.6666666667, ans=22.5 2024-06-21 05:51:01,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2024-06-21 05:51:11,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=328958.6666666667, ans=0.035 2024-06-21 05:51:20,887 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.098e+02 2.281e+02 2.540e+02 3.820e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 05:51:20,920 INFO [train.py:1028] (0/2) Epoch 18, batch 7450, loss[loss=0.2154, simple_loss=0.2726, pruned_loss=0.07917, over 12603.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.275, pruned_loss=0.08284, over 2578911.53 frames. ], batch size: 29, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:51:24,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=328977.0, ans=0.125 2024-06-21 05:51:28,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328977.0, ans=0.125 2024-06-21 05:51:34,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=328995.3333333333, ans=0.125 2024-06-21 05:52:17,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=329050.3333333333, ans=0.125 2024-06-21 05:52:19,522 INFO [train.py:1028] (0/2) Epoch 18, batch 7500, loss[loss=0.2153, simple_loss=0.2644, pruned_loss=0.08316, over 10596.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2759, pruned_loss=0.08341, over 2577402.93 frames. ], batch size: 304, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:52:30,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329087.0, ans=0.1 2024-06-21 05:52:42,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=329105.3333333333, ans=0.125 2024-06-21 05:52:45,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=329105.3333333333, ans=0.0 2024-06-21 05:53:10,302 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.092e+02 2.265e+02 2.570e+02 3.672e+02, threshold=4.530e+02, percent-clipped=0.0 2024-06-21 05:53:10,336 INFO [train.py:1028] (0/2) Epoch 18, batch 7550, loss[loss=0.2317, simple_loss=0.2838, pruned_loss=0.08983, over 12933.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2769, pruned_loss=0.08434, over 2577073.11 frames. ], batch size: 158, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:53:13,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=329160.3333333333, ans=0.125 2024-06-21 05:53:20,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=329178.6666666667, ans=0.125 2024-06-21 05:53:32,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2024-06-21 05:53:39,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=329215.3333333333, ans=0.025 2024-06-21 05:53:45,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.51 vs. limit=22.5 2024-06-21 05:53:58,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=329215.3333333333, ans=0.0 2024-06-21 05:54:04,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2024-06-21 05:54:11,952 INFO [train.py:1028] (0/2) Epoch 18, batch 7600, loss[loss=0.2333, simple_loss=0.2836, pruned_loss=0.09146, over 13214.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2769, pruned_loss=0.08411, over 2576577.48 frames. ], batch size: 83, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:54:12,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=329252.0, ans=0.0 2024-06-21 05:54:46,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=329307.0, ans=0.125 2024-06-21 05:54:47,905 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2024-06-21 05:55:01,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329325.3333333333, ans=0.125 2024-06-21 05:55:07,906 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.147e+02 2.403e+02 2.598e+02 3.884e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-21 05:55:07,942 INFO [train.py:1028] (0/2) Epoch 18, batch 7650, loss[loss=0.2304, simple_loss=0.2876, pruned_loss=0.0866, over 12962.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2776, pruned_loss=0.08454, over 2573263.44 frames. ], batch size: 33, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:55:17,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=329362.0, ans=0.125 2024-06-21 05:55:18,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.85 vs. limit=22.5 2024-06-21 05:55:27,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=15.0 2024-06-21 05:55:30,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=329380.3333333333, ans=0.0 2024-06-21 05:55:32,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=329380.3333333333, ans=0.125 2024-06-21 05:55:39,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=329398.6666666667, ans=0.125 2024-06-21 05:55:44,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=329398.6666666667, ans=0.0 2024-06-21 05:55:46,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329398.6666666667, ans=0.1 2024-06-21 05:55:52,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=329417.0, ans=0.025 2024-06-21 05:56:00,833 INFO [train.py:1028] (0/2) Epoch 18, batch 7700, loss[loss=0.2422, simple_loss=0.2978, pruned_loss=0.09333, over 13269.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2785, pruned_loss=0.085, over 2570056.46 frames. ], batch size: 63, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:56:04,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=329435.3333333333, ans=0.125 2024-06-21 05:56:06,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=329435.3333333333, ans=0.2 2024-06-21 05:56:09,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=329435.3333333333, ans=0.1 2024-06-21 05:56:14,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=329453.6666666667, ans=0.0 2024-06-21 05:56:22,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2024-06-21 05:56:51,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=329527.0, ans=0.2 2024-06-21 05:56:52,428 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.091e+02 2.239e+02 2.424e+02 3.231e+02, threshold=4.478e+02, percent-clipped=0.0 2024-06-21 05:56:52,476 INFO [train.py:1028] (0/2) Epoch 18, batch 7750, loss[loss=0.2188, simple_loss=0.2767, pruned_loss=0.08048, over 13277.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2791, pruned_loss=0.08537, over 2573376.74 frames. ], batch size: 72, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:57:07,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=329527.0, ans=0.125 2024-06-21 05:57:08,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=329527.0, ans=0.0 2024-06-21 05:57:22,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.33 vs. limit=15.0 2024-06-21 05:57:29,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=329582.0, ans=0.125 2024-06-21 05:57:34,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=329582.0, ans=0.2 2024-06-21 05:57:42,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=329600.3333333333, ans=0.125 2024-06-21 05:57:51,500 INFO [train.py:1028] (0/2) Epoch 18, batch 7800, loss[loss=0.2168, simple_loss=0.2726, pruned_loss=0.0805, over 13198.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2799, pruned_loss=0.08542, over 2578420.11 frames. ], batch size: 95, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:57:52,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=329618.6666666667, ans=0.125 2024-06-21 05:58:17,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=329655.3333333333, ans=0.2 2024-06-21 05:58:19,044 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.21 vs. limit=22.5 2024-06-21 05:58:25,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=329655.3333333333, ans=0.125 2024-06-21 05:58:26,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2024-06-21 05:58:45,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=329692.0, ans=0.125 2024-06-21 05:58:49,217 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.121e+02 2.301e+02 2.539e+02 3.490e+02, threshold=4.601e+02, percent-clipped=0.0 2024-06-21 05:58:49,251 INFO [train.py:1028] (0/2) Epoch 18, batch 7850, loss[loss=0.1988, simple_loss=0.2557, pruned_loss=0.07092, over 11558.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2805, pruned_loss=0.08595, over 2571899.51 frames. ], batch size: 16, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:59:14,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=329747.0, ans=0.125 2024-06-21 05:59:42,777 INFO [train.py:1028] (0/2) Epoch 18, batch 7900, loss[loss=0.1949, simple_loss=0.2591, pruned_loss=0.06535, over 13153.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2802, pruned_loss=0.08555, over 2572559.04 frames. ], batch size: 77, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 05:59:42,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=329802.0, ans=0.125 2024-06-21 05:59:49,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2024-06-21 06:00:06,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=329820.3333333333, ans=0.0 2024-06-21 06:00:17,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=329838.6666666667, ans=15.0 2024-06-21 06:00:18,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=329838.6666666667, ans=0.125 2024-06-21 06:00:25,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2024-06-21 06:00:26,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=329857.0, ans=0.0 2024-06-21 06:00:33,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2024-06-21 06:00:34,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=329875.3333333333, ans=0.125 2024-06-21 06:00:37,205 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.111e+02 2.272e+02 2.551e+02 3.559e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 06:00:37,241 INFO [train.py:1028] (0/2) Epoch 18, batch 7950, loss[loss=0.2113, simple_loss=0.2603, pruned_loss=0.08117, over 10658.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2809, pruned_loss=0.08593, over 2575472.21 frames. ], batch size: 304, lr: 3.21e-03, grad_scale: 32.0 2024-06-21 06:00:59,884 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.49 vs. limit=22.5 2024-06-21 06:01:04,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329948.6666666667, ans=0.1 2024-06-21 06:01:08,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329948.6666666667, ans=0.125 2024-06-21 06:01:31,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=329967.0, ans=0.125 2024-06-21 06:01:34,941 INFO [train.py:1028] (0/2) Epoch 18, batch 8000, loss[loss=0.2105, simple_loss=0.2658, pruned_loss=0.07756, over 12658.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2812, pruned_loss=0.08603, over 2572439.21 frames. ], batch size: 29, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:01:43,150 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-180000.pt 2024-06-21 06:02:09,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=330022.0, ans=0.07 2024-06-21 06:02:15,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=330040.3333333333, ans=0.025 2024-06-21 06:02:33,933 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.163e+02 2.316e+02 2.569e+02 3.179e+02, threshold=4.632e+02, percent-clipped=0.0 2024-06-21 06:02:33,972 INFO [train.py:1028] (0/2) Epoch 18, batch 8050, loss[loss=0.2208, simple_loss=0.2726, pruned_loss=0.08448, over 13219.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2816, pruned_loss=0.08615, over 2571858.22 frames. ], batch size: 83, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:02:46,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.11 vs. limit=15.0 2024-06-21 06:02:49,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=330113.6666666667, ans=0.0 2024-06-21 06:02:49,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=330113.6666666667, ans=0.05 2024-06-21 06:02:51,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330113.6666666667, ans=0.125 2024-06-21 06:02:52,672 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.44 vs. limit=15.0 2024-06-21 06:03:01,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=330132.0, ans=0.125 2024-06-21 06:03:04,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=330132.0, ans=0.0 2024-06-21 06:03:17,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.34 vs. limit=22.5 2024-06-21 06:03:27,902 INFO [train.py:1028] (0/2) Epoch 18, batch 8100, loss[loss=0.219, simple_loss=0.2648, pruned_loss=0.0866, over 13169.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2816, pruned_loss=0.08616, over 2576160.90 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:03:33,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=330168.6666666667, ans=0.125 2024-06-21 06:03:37,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2024-06-21 06:03:38,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=330187.0, ans=0.2 2024-06-21 06:03:54,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=330205.3333333333, ans=0.125 2024-06-21 06:04:00,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.46 vs. limit=22.5 2024-06-21 06:04:02,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=330223.6666666667, ans=0.125 2024-06-21 06:04:09,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=330242.0, ans=0.2 2024-06-21 06:04:21,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.104e+02 2.232e+02 2.387e+02 3.049e+02, threshold=4.464e+02, percent-clipped=0.0 2024-06-21 06:04:21,356 INFO [train.py:1028] (0/2) Epoch 18, batch 8150, loss[loss=0.2191, simple_loss=0.2738, pruned_loss=0.08219, over 13118.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2817, pruned_loss=0.08584, over 2580028.91 frames. ], batch size: 121, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:04:22,507 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:04:32,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=330278.6666666667, ans=0.07 2024-06-21 06:04:35,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330278.6666666667, ans=0.1 2024-06-21 06:04:54,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=330297.0, ans=0.0 2024-06-21 06:05:14,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=330333.6666666667, ans=0.0 2024-06-21 06:05:20,428 INFO [train.py:1028] (0/2) Epoch 18, batch 8200, loss[loss=0.226, simple_loss=0.2787, pruned_loss=0.08665, over 13132.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2816, pruned_loss=0.08556, over 2583723.06 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 64.0 2024-06-21 06:05:35,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=330370.3333333333, ans=15.0 2024-06-21 06:05:48,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330388.6666666667, ans=0.1 2024-06-21 06:05:50,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=330388.6666666667, ans=0.125 2024-06-21 06:05:56,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=330407.0, ans=0.0 2024-06-21 06:06:00,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=330407.0, ans=0.04949747468305833 2024-06-21 06:06:05,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=330425.3333333333, ans=0.09899494936611666 2024-06-21 06:06:10,372 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2024-06-21 06:06:12,739 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.050e+02 2.196e+02 2.395e+02 2.969e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 06:06:12,767 INFO [train.py:1028] (0/2) Epoch 18, batch 8250, loss[loss=0.2192, simple_loss=0.2887, pruned_loss=0.07488, over 13295.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2817, pruned_loss=0.08531, over 2583654.34 frames. ], batch size: 52, lr: 3.20e-03, grad_scale: 64.0 2024-06-21 06:06:23,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=330462.0, ans=0.125 2024-06-21 06:06:29,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.72 vs. limit=22.5 2024-06-21 06:06:32,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=330480.3333333333, ans=0.125 2024-06-21 06:06:48,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=330517.0, ans=0.125 2024-06-21 06:06:51,238 INFO [train.py:1028] (0/2) Epoch 18, batch 8300, loss[loss=0.2262, simple_loss=0.2767, pruned_loss=0.08782, over 13041.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2808, pruned_loss=0.08474, over 2580094.59 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 64.0 2024-06-21 06:07:18,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=330572.0, ans=0.0 2024-06-21 06:07:34,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=330590.3333333333, ans=0.025 2024-06-21 06:07:45,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2024-06-21 06:07:48,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=330608.6666666667, ans=0.125 2024-06-21 06:07:51,739 INFO [train.py:1028] (0/2) Epoch 18, batch 8350, loss[loss=0.2247, simple_loss=0.2791, pruned_loss=0.08511, over 13180.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2805, pruned_loss=0.08464, over 2580418.44 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:07:52,666 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.154e+02 2.311e+02 2.459e+02 3.428e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 06:08:02,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=330645.3333333333, ans=0.125 2024-06-21 06:08:24,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=330682.0, ans=0.125 2024-06-21 06:08:31,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=330682.0, ans=0.2 2024-06-21 06:08:38,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=330700.3333333333, ans=0.0 2024-06-21 06:08:44,163 INFO [train.py:1028] (0/2) Epoch 18, batch 8400, loss[loss=0.2028, simple_loss=0.2601, pruned_loss=0.07274, over 12947.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2807, pruned_loss=0.08494, over 2576288.56 frames. ], batch size: 39, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:08:46,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.85 vs. limit=10.0 2024-06-21 06:09:02,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=330737.0, ans=0.0 2024-06-21 06:09:20,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=330773.6666666667, ans=0.0 2024-06-21 06:09:20,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2024-06-21 06:09:21,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=8.0 2024-06-21 06:09:42,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=330810.3333333333, ans=0.125 2024-06-21 06:09:42,725 INFO [train.py:1028] (0/2) Epoch 18, batch 8450, loss[loss=0.2176, simple_loss=0.2821, pruned_loss=0.07651, over 13178.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2814, pruned_loss=0.08527, over 2578734.66 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:09:43,712 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.128e+02 2.308e+02 2.595e+02 3.438e+02, threshold=4.616e+02, percent-clipped=0.0 2024-06-21 06:10:04,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=330847.0, ans=0.125 2024-06-21 06:10:13,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=330865.3333333333, ans=0.125 2024-06-21 06:10:29,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=330883.6666666667, ans=0.2 2024-06-21 06:10:31,307 INFO [train.py:1028] (0/2) Epoch 18, batch 8500, loss[loss=0.2123, simple_loss=0.2712, pruned_loss=0.0767, over 12700.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2824, pruned_loss=0.08589, over 2576832.59 frames. ], batch size: 29, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:10:43,028 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.615e-01 2024-06-21 06:10:44,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=330902.0, ans=0.125 2024-06-21 06:10:52,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=330920.3333333333, ans=0.125 2024-06-21 06:11:01,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330938.6666666667, ans=0.125 2024-06-21 06:11:13,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=330957.0, ans=0.0 2024-06-21 06:11:29,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=330993.6666666667, ans=0.125 2024-06-21 06:11:30,171 INFO [train.py:1028] (0/2) Epoch 18, batch 8550, loss[loss=0.2204, simple_loss=0.2684, pruned_loss=0.08621, over 12340.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2816, pruned_loss=0.08515, over 2575562.96 frames. ], batch size: 22, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:11:31,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.092e+02 2.212e+02 2.480e+02 4.496e+02, threshold=4.425e+02, percent-clipped=0.0 2024-06-21 06:11:31,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=330993.6666666667, ans=0.125 2024-06-21 06:11:34,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=330993.6666666667, ans=0.0 2024-06-21 06:11:53,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=331030.3333333333, ans=0.0 2024-06-21 06:11:56,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2024-06-21 06:12:21,915 INFO [train.py:1028] (0/2) Epoch 18, batch 8600, loss[loss=0.2125, simple_loss=0.2658, pruned_loss=0.07962, over 13116.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2823, pruned_loss=0.08557, over 2573307.66 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:12:27,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=331085.3333333333, ans=0.09899494936611666 2024-06-21 06:12:27,993 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:12:58,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=331122.0, ans=0.1 2024-06-21 06:13:03,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=331140.3333333333, ans=0.125 2024-06-21 06:13:22,188 INFO [train.py:1028] (0/2) Epoch 18, batch 8650, loss[loss=0.2234, simple_loss=0.2737, pruned_loss=0.08655, over 13014.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2822, pruned_loss=0.08523, over 2575954.30 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:13:23,029 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.131e+02 2.273e+02 2.423e+02 2.858e+02, threshold=4.546e+02, percent-clipped=0.0 2024-06-21 06:13:29,335 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2024-06-21 06:13:51,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=331213.6666666667, ans=0.025 2024-06-21 06:14:11,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.80 vs. limit=22.5 2024-06-21 06:14:15,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2024-06-21 06:14:18,083 INFO [train.py:1028] (0/2) Epoch 18, batch 8700, loss[loss=0.2545, simple_loss=0.3128, pruned_loss=0.0981, over 13145.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2829, pruned_loss=0.08572, over 2573468.23 frames. ], batch size: 59, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:14:18,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=331268.6666666667, ans=0.0 2024-06-21 06:14:25,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=331287.0, ans=0.2 2024-06-21 06:14:27,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=331287.0, ans=0.0 2024-06-21 06:14:29,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=331287.0, ans=0.0 2024-06-21 06:14:33,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.11 vs. limit=15.0 2024-06-21 06:14:34,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2024-06-21 06:14:36,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=331305.3333333333, ans=0.5 2024-06-21 06:14:37,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331305.3333333333, ans=0.1 2024-06-21 06:14:45,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=331305.3333333333, ans=0.2 2024-06-21 06:14:50,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=331323.6666666667, ans=0.0 2024-06-21 06:15:01,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=331342.0, ans=0.125 2024-06-21 06:15:07,515 INFO [train.py:1028] (0/2) Epoch 18, batch 8750, loss[loss=0.2351, simple_loss=0.2861, pruned_loss=0.09206, over 13113.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2832, pruned_loss=0.0862, over 2570162.77 frames. ], batch size: 121, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:15:08,306 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.187e+02 2.311e+02 2.588e+02 3.245e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 06:15:15,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331360.3333333333, ans=0.1 2024-06-21 06:15:33,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=12.0 2024-06-21 06:15:39,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=331415.3333333333, ans=10.0 2024-06-21 06:16:07,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=331452.0, ans=0.0 2024-06-21 06:16:08,133 INFO [train.py:1028] (0/2) Epoch 18, batch 8800, loss[loss=0.2184, simple_loss=0.2737, pruned_loss=0.08152, over 13227.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2836, pruned_loss=0.08621, over 2575085.82 frames. ], batch size: 72, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:16:08,939 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:16:17,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=331470.3333333333, ans=0.125 2024-06-21 06:16:19,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331470.3333333333, ans=0.1 2024-06-21 06:16:34,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.35 vs. limit=15.0 2024-06-21 06:16:36,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=331507.0, ans=0.125 2024-06-21 06:16:51,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=331507.0, ans=15.0 2024-06-21 06:16:56,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=331525.3333333333, ans=0.0 2024-06-21 06:17:02,583 INFO [train.py:1028] (0/2) Epoch 18, batch 8850, loss[loss=0.2343, simple_loss=0.2929, pruned_loss=0.08784, over 12527.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2837, pruned_loss=0.08661, over 2564521.94 frames. ], batch size: 202, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:17:04,375 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.169e+02 2.295e+02 2.512e+02 3.208e+02, threshold=4.590e+02, percent-clipped=0.0 2024-06-21 06:17:44,082 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2024-06-21 06:17:44,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=331617.0, ans=0.2 2024-06-21 06:17:56,419 INFO [train.py:1028] (0/2) Epoch 18, batch 8900, loss[loss=0.2519, simple_loss=0.3107, pruned_loss=0.09657, over 12884.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2849, pruned_loss=0.0871, over 2562072.61 frames. ], batch size: 33, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:18:07,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=331653.6666666667, ans=0.025 2024-06-21 06:18:23,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=331672.0, ans=0.125 2024-06-21 06:18:31,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=331690.3333333333, ans=0.0 2024-06-21 06:18:35,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2024-06-21 06:18:54,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.95 vs. limit=15.0 2024-06-21 06:18:55,810 INFO [train.py:1028] (0/2) Epoch 18, batch 8950, loss[loss=0.242, simple_loss=0.2916, pruned_loss=0.09619, over 12480.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2849, pruned_loss=0.08683, over 2561036.17 frames. ], batch size: 202, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:18:57,558 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.172e+02 2.329e+02 2.469e+02 3.325e+02, threshold=4.658e+02, percent-clipped=0.0 2024-06-21 06:19:06,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331745.3333333333, ans=0.1 2024-06-21 06:19:28,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=331782.0, ans=0.05 2024-06-21 06:19:33,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=331782.0, ans=0.125 2024-06-21 06:19:35,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=331782.0, ans=0.2 2024-06-21 06:19:47,149 INFO [train.py:1028] (0/2) Epoch 18, batch 9000, loss[loss=0.2194, simple_loss=0.2774, pruned_loss=0.08074, over 13330.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2849, pruned_loss=0.08645, over 2567415.74 frames. ], batch size: 46, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:19:47,150 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 06:19:56,435 INFO [train.py:1060] (0/2) Epoch 18, validation: loss=0.187, simple_loss=0.2519, pruned_loss=0.06106, over 351949.00 frames. 2024-06-21 06:19:56,436 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 06:20:07,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=331837.0, ans=0.125 2024-06-21 06:20:21,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=331837.0, ans=0.025 2024-06-21 06:20:26,560 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.86 vs. limit=15.0 2024-06-21 06:20:29,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=331855.3333333333, ans=0.2 2024-06-21 06:20:30,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=12.0 2024-06-21 06:20:32,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=331855.3333333333, ans=0.125 2024-06-21 06:20:37,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=331873.6666666667, ans=0.125 2024-06-21 06:20:38,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=331873.6666666667, ans=0.0 2024-06-21 06:20:55,073 INFO [train.py:1028] (0/2) Epoch 18, batch 9050, loss[loss=0.2007, simple_loss=0.2579, pruned_loss=0.07179, over 12109.00 frames. ], tot_loss[loss=0.23, simple_loss=0.286, pruned_loss=0.08701, over 2567546.32 frames. ], batch size: 18, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:20:57,384 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.123e+02 2.252e+02 2.459e+02 3.045e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 06:21:33,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=331965.3333333333, ans=0.125 2024-06-21 06:21:36,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=331983.6666666667, ans=0.125 2024-06-21 06:21:45,613 INFO [train.py:1028] (0/2) Epoch 18, batch 9100, loss[loss=0.2212, simple_loss=0.2837, pruned_loss=0.07935, over 13240.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.285, pruned_loss=0.08638, over 2570265.06 frames. ], batch size: 72, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:21:45,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=332002.0, ans=0.0 2024-06-21 06:21:54,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=332020.3333333333, ans=0.025 2024-06-21 06:22:01,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=332020.3333333333, ans=0.125 2024-06-21 06:22:07,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332038.6666666667, ans=0.125 2024-06-21 06:22:15,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332057.0, ans=0.1 2024-06-21 06:22:17,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=332057.0, ans=0.2 2024-06-21 06:22:20,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=332057.0, ans=0.0 2024-06-21 06:22:23,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2024-06-21 06:22:32,032 INFO [train.py:1028] (0/2) Epoch 18, batch 9150, loss[loss=0.2293, simple_loss=0.2891, pruned_loss=0.08475, over 13149.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.285, pruned_loss=0.08628, over 2570706.17 frames. ], batch size: 77, lr: 3.20e-03, grad_scale: 16.0 2024-06-21 06:22:32,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=332093.6666666667, ans=0.025 2024-06-21 06:22:34,060 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.104e+02 2.273e+02 2.392e+02 2.929e+02, threshold=4.545e+02, percent-clipped=0.0 2024-06-21 06:22:47,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=332112.0, ans=0.2 2024-06-21 06:22:55,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332130.3333333333, ans=0.1 2024-06-21 06:23:01,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332148.6666666667, ans=0.125 2024-06-21 06:23:05,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=332148.6666666667, ans=0.0 2024-06-21 06:23:21,896 INFO [train.py:1028] (0/2) Epoch 18, batch 9200, loss[loss=0.2216, simple_loss=0.28, pruned_loss=0.08161, over 12847.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2847, pruned_loss=0.08585, over 2573601.07 frames. ], batch size: 36, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:23:26,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=332185.3333333333, ans=0.125 2024-06-21 06:23:40,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2024-06-21 06:23:40,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=332203.6666666667, ans=0.125 2024-06-21 06:24:11,978 INFO [train.py:1028] (0/2) Epoch 18, batch 9250, loss[loss=0.2247, simple_loss=0.2782, pruned_loss=0.08561, over 13242.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2838, pruned_loss=0.08541, over 2574638.91 frames. ], batch size: 67, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:24:14,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.076e+02 2.234e+02 2.439e+02 3.603e+02, threshold=4.468e+02, percent-clipped=0.0 2024-06-21 06:24:17,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=332277.0, ans=0.025 2024-06-21 06:24:21,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=332295.3333333333, ans=0.0 2024-06-21 06:24:34,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=332313.6666666667, ans=0.0 2024-06-21 06:24:55,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2024-06-21 06:24:59,905 INFO [train.py:1028] (0/2) Epoch 18, batch 9300, loss[loss=0.1939, simple_loss=0.258, pruned_loss=0.06495, over 12984.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2835, pruned_loss=0.08494, over 2571000.39 frames. ], batch size: 39, lr: 3.20e-03, grad_scale: 32.0 2024-06-21 06:25:02,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332368.6666666667, ans=0.125 2024-06-21 06:25:03,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=332368.6666666667, ans=0.0 2024-06-21 06:25:14,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=332387.0, ans=0.125 2024-06-21 06:25:16,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=332387.0, ans=0.125 2024-06-21 06:25:21,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=332405.3333333333, ans=0.2 2024-06-21 06:25:22,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=332405.3333333333, ans=0.0 2024-06-21 06:25:29,507 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2024-06-21 06:25:30,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.28 vs. limit=15.0 2024-06-21 06:25:31,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=332423.6666666667, ans=0.2 2024-06-21 06:25:35,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=332423.6666666667, ans=0.0 2024-06-21 06:25:38,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=332442.0, ans=0.0 2024-06-21 06:25:39,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-21 06:25:41,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=332442.0, ans=0.125 2024-06-21 06:25:44,228 INFO [train.py:1028] (0/2) Epoch 18, batch 9350, loss[loss=0.2434, simple_loss=0.3025, pruned_loss=0.09215, over 12528.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.284, pruned_loss=0.08521, over 2568283.11 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:25:45,381 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.166e+02 2.296e+02 2.538e+02 3.596e+02, threshold=4.591e+02, percent-clipped=0.0 2024-06-21 06:25:46,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=332460.3333333333, ans=0.1 2024-06-21 06:25:51,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=332460.3333333333, ans=0.07 2024-06-21 06:25:56,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=332478.6666666667, ans=0.0 2024-06-21 06:25:59,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=332497.0, ans=0.125 2024-06-21 06:26:00,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=15.0 2024-06-21 06:26:01,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.22 vs. limit=22.5 2024-06-21 06:26:03,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2024-06-21 06:26:05,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=332515.3333333333, ans=0.2 2024-06-21 06:26:09,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=332515.3333333333, ans=0.125 2024-06-21 06:26:15,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=332533.6666666667, ans=0.025 2024-06-21 06:26:16,814 INFO [train.py:1028] (0/2) Epoch 18, batch 9400, loss[loss=0.2345, simple_loss=0.3006, pruned_loss=0.08422, over 13274.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2835, pruned_loss=0.08523, over 2567242.35 frames. ], batch size: 52, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:26:17,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=332552.0, ans=0.2 2024-06-21 06:26:28,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=332588.6666666667, ans=0.125 2024-06-21 06:26:32,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=332588.6666666667, ans=0.125 2024-06-21 06:26:32,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=332588.6666666667, ans=0.0 2024-06-21 06:26:37,708 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:26:38,755 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=15.0 2024-06-21 06:26:40,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332625.3333333333, ans=0.125 2024-06-21 06:26:41,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332625.3333333333, ans=0.1 2024-06-21 06:26:46,633 INFO [train.py:1028] (0/2) Epoch 18, batch 9450, loss[loss=0.2459, simple_loss=0.2978, pruned_loss=0.09703, over 12428.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2847, pruned_loss=0.08625, over 2567049.67 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:26:47,792 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.113e+02 2.300e+02 2.541e+02 3.122e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-21 06:26:51,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332643.6666666667, ans=0.125 2024-06-21 06:26:55,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2024-06-21 06:27:05,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.01 vs. limit=6.0 2024-06-21 06:27:15,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=12.0 2024-06-21 06:27:16,062 INFO [train.py:1028] (0/2) Epoch 18, batch 9500, loss[loss=0.2276, simple_loss=0.2882, pruned_loss=0.08357, over 13234.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2849, pruned_loss=0.08575, over 2575972.77 frames. ], batch size: 43, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:27:22,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=332753.6666666667, ans=0.125 2024-06-21 06:27:29,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=332772.0, ans=0.125 2024-06-21 06:27:42,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=332808.6666666667, ans=0.2 2024-06-21 06:27:48,515 INFO [train.py:1028] (0/2) Epoch 18, batch 9550, loss[loss=0.2033, simple_loss=0.2626, pruned_loss=0.07196, over 12875.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2847, pruned_loss=0.08595, over 2573861.57 frames. ], batch size: 39, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:27:49,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.114e+02 2.250e+02 2.517e+02 3.191e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 06:27:52,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=332827.0, ans=0.125 2024-06-21 06:27:57,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2024-06-21 06:27:57,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=332845.3333333333, ans=0.125 2024-06-21 06:28:03,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=332863.6666666667, ans=0.125 2024-06-21 06:28:14,746 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:28:15,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=332900.3333333333, ans=0.025 2024-06-21 06:28:21,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=332900.3333333333, ans=0.2 2024-06-21 06:28:22,222 INFO [train.py:1028] (0/2) Epoch 18, batch 9600, loss[loss=0.2531, simple_loss=0.2912, pruned_loss=0.1074, over 10465.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2847, pruned_loss=0.08621, over 2571324.15 frames. ], batch size: 303, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:28:23,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.90 vs. limit=22.5 2024-06-21 06:28:29,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=332937.0, ans=0.125 2024-06-21 06:28:36,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=332955.3333333333, ans=0.025 2024-06-21 06:28:37,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=332955.3333333333, ans=0.125 2024-06-21 06:28:42,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=332973.6666666667, ans=0.125 2024-06-21 06:28:50,409 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.96 vs. limit=22.5 2024-06-21 06:28:53,202 INFO [train.py:1028] (0/2) Epoch 18, batch 9650, loss[loss=0.2329, simple_loss=0.2807, pruned_loss=0.09257, over 13066.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2846, pruned_loss=0.08661, over 2559964.20 frames. ], batch size: 132, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:28:54,388 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.136e+02 2.290e+02 2.620e+02 3.254e+02, threshold=4.580e+02, percent-clipped=0.0 2024-06-21 06:29:05,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=333047.0, ans=0.125 2024-06-21 06:29:12,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=22.5 2024-06-21 06:29:16,528 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=4.845e-02 2024-06-21 06:29:18,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=333083.6666666667, ans=0.05 2024-06-21 06:29:22,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=12.0 2024-06-21 06:29:24,156 INFO [train.py:1028] (0/2) Epoch 18, batch 9700, loss[loss=0.2349, simple_loss=0.2831, pruned_loss=0.09335, over 13054.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2846, pruned_loss=0.0867, over 2554784.06 frames. ], batch size: 144, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:29:27,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.78 vs. limit=15.0 2024-06-21 06:29:28,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=333102.0, ans=0.125 2024-06-21 06:29:29,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=333120.3333333333, ans=0.5 2024-06-21 06:29:33,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=333120.3333333333, ans=0.125 2024-06-21 06:29:35,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=333120.3333333333, ans=0.125 2024-06-21 06:29:46,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=333157.0, ans=0.2 2024-06-21 06:29:47,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=333157.0, ans=0.04949747468305833 2024-06-21 06:29:51,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333175.3333333333, ans=0.1 2024-06-21 06:29:52,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=333175.3333333333, ans=0.025 2024-06-21 06:29:53,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.57 vs. limit=15.0 2024-06-21 06:29:55,656 INFO [train.py:1028] (0/2) Epoch 18, batch 9750, loss[loss=0.2042, simple_loss=0.2513, pruned_loss=0.07854, over 13047.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2838, pruned_loss=0.08616, over 2551588.21 frames. ], batch size: 132, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:29:56,845 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 2.126e+02 2.250e+02 2.401e+02 3.043e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 06:29:59,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=333193.6666666667, ans=0.125 2024-06-21 06:30:05,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-21 06:30:29,205 INFO [train.py:1028] (0/2) Epoch 18, batch 9800, loss[loss=0.2241, simple_loss=0.2848, pruned_loss=0.0817, over 12976.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2836, pruned_loss=0.0862, over 2543728.32 frames. ], batch size: 39, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:30:39,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=333303.6666666667, ans=0.125 2024-06-21 06:30:42,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=333322.0, ans=0.125 2024-06-21 06:30:59,845 INFO [train.py:1028] (0/2) Epoch 18, batch 9850, loss[loss=0.2207, simple_loss=0.2805, pruned_loss=0.08042, over 13071.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2827, pruned_loss=0.08566, over 2537060.56 frames. ], batch size: 102, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:31:01,034 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.091e+02 2.238e+02 2.447e+02 3.018e+02, threshold=4.477e+02, percent-clipped=0.0 2024-06-21 06:31:02,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=333377.0, ans=0.0 2024-06-21 06:31:02,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=333377.0, ans=0.125 2024-06-21 06:31:04,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333377.0, ans=0.1 2024-06-21 06:31:05,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=333395.3333333333, ans=0.125 2024-06-21 06:31:08,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=333395.3333333333, ans=0.0 2024-06-21 06:31:15,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333413.6666666667, ans=0.125 2024-06-21 06:31:22,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.63 vs. limit=15.0 2024-06-21 06:31:22,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=333432.0, ans=0.2 2024-06-21 06:31:30,998 INFO [train.py:1028] (0/2) Epoch 18, batch 9900, loss[loss=0.2052, simple_loss=0.2703, pruned_loss=0.07009, over 13168.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.282, pruned_loss=0.0854, over 2529678.74 frames. ], batch size: 40, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:31:48,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=333505.3333333333, ans=0.0 2024-06-21 06:31:54,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=333523.6666666667, ans=0.0 2024-06-21 06:32:03,681 INFO [train.py:1028] (0/2) Epoch 18, batch 9950, loss[loss=0.2123, simple_loss=0.2769, pruned_loss=0.07384, over 12722.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2808, pruned_loss=0.08512, over 2524809.89 frames. ], batch size: 29, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:32:04,986 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.051e+02 2.279e+02 2.503e+02 4.129e+02, threshold=4.558e+02, percent-clipped=0.0 2024-06-21 06:32:17,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=333597.0, ans=0.125 2024-06-21 06:32:18,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.72 vs. limit=15.0 2024-06-21 06:32:24,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=333615.3333333333, ans=0.04949747468305833 2024-06-21 06:32:27,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=333615.3333333333, ans=0.0 2024-06-21 06:32:28,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=333615.3333333333, ans=0.125 2024-06-21 06:32:32,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.58 vs. limit=15.0 2024-06-21 06:32:36,000 INFO [train.py:1028] (0/2) Epoch 18, batch 10000, loss[loss=0.2164, simple_loss=0.2781, pruned_loss=0.07731, over 12497.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2816, pruned_loss=0.08602, over 2487198.84 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:32:58,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=333707.0, ans=0.0 2024-06-21 06:33:07,002 INFO [train.py:1028] (0/2) Epoch 18, batch 10050, loss[loss=0.2546, simple_loss=0.3085, pruned_loss=0.1003, over 12397.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2818, pruned_loss=0.08667, over 2444013.30 frames. ], batch size: 22, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:33:08,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.138e+02 2.350e+02 2.532e+02 3.465e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-21 06:33:22,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=333780.3333333333, ans=0.2 2024-06-21 06:33:23,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2024-06-21 06:33:37,374 INFO [train.py:1028] (0/2) Epoch 18, batch 10100, loss[loss=0.2073, simple_loss=0.2591, pruned_loss=0.07771, over 11503.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2808, pruned_loss=0.08569, over 2423747.55 frames. ], batch size: 17, lr: 3.19e-03, grad_scale: 32.0 2024-06-21 06:33:37,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=333835.3333333333, ans=0.2 2024-06-21 06:33:40,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=333835.3333333333, ans=0.0 2024-06-21 06:33:40,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=333835.3333333333, ans=0.0 2024-06-21 06:33:49,835 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-18.pt 2024-06-21 06:35:51,173 INFO [train.py:1028] (0/2) Epoch 19, batch 0, loss[loss=0.2034, simple_loss=0.2662, pruned_loss=0.07028, over 12924.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2662, pruned_loss=0.07028, over 12924.00 frames. ], batch size: 36, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:35:51,175 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 06:35:55,301 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.9634, 2.7022, 1.8298, 2.7262], device='cuda:0') 2024-06-21 06:35:58,386 INFO [train.py:1060] (0/2) Epoch 19, validation: loss=0.1875, simple_loss=0.2524, pruned_loss=0.06132, over 351949.00 frames. 2024-06-21 06:35:58,386 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 06:36:05,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=333883.0, ans=0.125 2024-06-21 06:36:10,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=333883.0, ans=0.2 2024-06-21 06:36:19,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.43 vs. limit=10.0 2024-06-21 06:36:21,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333919.6666666667, ans=0.125 2024-06-21 06:36:23,539 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.711e+02 2.030e+02 2.223e+02 2.404e+02 3.272e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-21 06:36:23,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333919.6666666667, ans=0.1 2024-06-21 06:36:30,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333938.0, ans=0.125 2024-06-21 06:36:32,147 INFO [train.py:1028] (0/2) Epoch 19, batch 50, loss[loss=0.1991, simple_loss=0.2591, pruned_loss=0.06949, over 12623.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2663, pruned_loss=0.07956, over 574739.51 frames. ], batch size: 29, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:36:35,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=333956.3333333333, ans=0.125 2024-06-21 06:36:41,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=333974.6666666667, ans=0.035 2024-06-21 06:36:42,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=333974.6666666667, ans=0.125 2024-06-21 06:36:51,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=334011.3333333333, ans=0.125 2024-06-21 06:36:51,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=334011.3333333333, ans=0.0 2024-06-21 06:36:51,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2024-06-21 06:37:03,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=334048.0, ans=0.05 2024-06-21 06:37:03,818 INFO [train.py:1028] (0/2) Epoch 19, batch 100, loss[loss=0.201, simple_loss=0.2664, pruned_loss=0.06782, over 13291.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2621, pruned_loss=0.07794, over 1017083.99 frames. ], batch size: 46, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:37:05,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=334048.0, ans=0.5 2024-06-21 06:37:07,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=334048.0, ans=15.0 2024-06-21 06:37:11,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=334066.3333333333, ans=0.125 2024-06-21 06:37:14,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=334066.3333333333, ans=0.09899494936611666 2024-06-21 06:37:14,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=334066.3333333333, ans=0.025 2024-06-21 06:37:28,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=334103.0, ans=0.125 2024-06-21 06:37:33,145 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.005e+02 2.167e+02 2.304e+02 2.832e+02, threshold=4.335e+02, percent-clipped=0.0 2024-06-21 06:37:41,322 INFO [train.py:1028] (0/2) Epoch 19, batch 150, loss[loss=0.1925, simple_loss=0.2535, pruned_loss=0.06577, over 12495.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2616, pruned_loss=0.07699, over 1364775.10 frames. ], batch size: 29, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:37:52,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=334158.0, ans=0.07 2024-06-21 06:37:57,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2024-06-21 06:37:59,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=334176.3333333333, ans=0.125 2024-06-21 06:38:04,375 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2024-06-21 06:38:06,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=334194.6666666667, ans=0.025 2024-06-21 06:38:13,791 INFO [train.py:1028] (0/2) Epoch 19, batch 200, loss[loss=0.2321, simple_loss=0.2803, pruned_loss=0.0919, over 12515.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2615, pruned_loss=0.07689, over 1634897.89 frames. ], batch size: 202, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:38:14,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2024-06-21 06:38:14,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.34 vs. limit=15.0 2024-06-21 06:38:15,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=334231.3333333333, ans=0.1 2024-06-21 06:38:37,491 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.986e+02 2.104e+02 2.298e+02 3.037e+02, threshold=4.208e+02, percent-clipped=0.0 2024-06-21 06:38:45,949 INFO [train.py:1028] (0/2) Epoch 19, batch 250, loss[loss=0.2004, simple_loss=0.2485, pruned_loss=0.07617, over 13073.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.261, pruned_loss=0.07684, over 1846709.79 frames. ], batch size: 144, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:38:49,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=334323.0, ans=0.1 2024-06-21 06:38:56,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.49 vs. limit=10.0 2024-06-21 06:39:03,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=334359.6666666667, ans=0.0 2024-06-21 06:39:10,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=334378.0, ans=0.07 2024-06-21 06:39:13,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334396.3333333333, ans=0.1 2024-06-21 06:39:18,504 INFO [train.py:1028] (0/2) Epoch 19, batch 300, loss[loss=0.1958, simple_loss=0.2517, pruned_loss=0.06993, over 13219.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2613, pruned_loss=0.07696, over 2008776.22 frames. ], batch size: 112, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:39:26,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=334414.6666666667, ans=0.125 2024-06-21 06:39:46,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334469.6666666667, ans=0.125 2024-06-21 06:39:47,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=334469.6666666667, ans=0.0 2024-06-21 06:39:48,262 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 1.980e+02 2.106e+02 2.270e+02 3.333e+02, threshold=4.211e+02, percent-clipped=0.0 2024-06-21 06:39:55,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=334506.3333333333, ans=0.125 2024-06-21 06:39:56,126 INFO [train.py:1028] (0/2) Epoch 19, batch 350, loss[loss=0.2095, simple_loss=0.2619, pruned_loss=0.07853, over 12897.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2603, pruned_loss=0.0764, over 2137996.50 frames. ], batch size: 33, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:40:07,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334524.6666666667, ans=0.125 2024-06-21 06:40:16,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334561.3333333333, ans=0.125 2024-06-21 06:40:17,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=334561.3333333333, ans=0.0 2024-06-21 06:40:23,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2024-06-21 06:40:26,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=334598.0, ans=0.0 2024-06-21 06:40:27,309 INFO [train.py:1028] (0/2) Epoch 19, batch 400, loss[loss=0.2094, simple_loss=0.2714, pruned_loss=0.07373, over 13216.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2605, pruned_loss=0.07623, over 2238850.05 frames. ], batch size: 63, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:40:31,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=334598.0, ans=0.0 2024-06-21 06:40:37,512 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:40:41,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=334634.6666666667, ans=0.125 2024-06-21 06:40:46,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=12.0 2024-06-21 06:40:47,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.80 vs. limit=15.0 2024-06-21 06:40:49,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=334653.0, ans=0.0 2024-06-21 06:40:51,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 1.984e+02 2.110e+02 2.297e+02 2.944e+02, threshold=4.220e+02, percent-clipped=0.0 2024-06-21 06:40:59,046 INFO [train.py:1028] (0/2) Epoch 19, batch 450, loss[loss=0.204, simple_loss=0.2633, pruned_loss=0.07234, over 13216.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2604, pruned_loss=0.07601, over 2313036.23 frames. ], batch size: 67, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:41:00,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.83 vs. limit=15.0 2024-06-21 06:41:04,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=334689.6666666667, ans=0.0 2024-06-21 06:41:05,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=334708.0, ans=0.1 2024-06-21 06:41:10,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=334708.0, ans=0.0 2024-06-21 06:41:10,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=334708.0, ans=0.0 2024-06-21 06:41:11,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=334726.3333333333, ans=0.0 2024-06-21 06:41:17,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.93 vs. limit=15.0 2024-06-21 06:41:23,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=334744.6666666667, ans=0.07 2024-06-21 06:41:34,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=334763.0, ans=0.125 2024-06-21 06:41:37,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=334781.3333333333, ans=0.125 2024-06-21 06:41:38,348 INFO [train.py:1028] (0/2) Epoch 19, batch 500, loss[loss=0.1884, simple_loss=0.2397, pruned_loss=0.0686, over 13110.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2615, pruned_loss=0.07652, over 2376497.16 frames. ], batch size: 121, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:41:38,470 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:41:45,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2024-06-21 06:41:50,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.88 vs. limit=15.0 2024-06-21 06:41:51,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=334818.0, ans=0.05 2024-06-21 06:41:56,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=334818.0, ans=0.0 2024-06-21 06:41:59,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=334836.3333333333, ans=0.2 2024-06-21 06:42:03,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 1.947e+02 2.048e+02 2.204e+02 3.030e+02, threshold=4.095e+02, percent-clipped=0.0 2024-06-21 06:42:06,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=334854.6666666667, ans=0.125 2024-06-21 06:42:10,521 INFO [train.py:1028] (0/2) Epoch 19, batch 550, loss[loss=0.2102, simple_loss=0.263, pruned_loss=0.07869, over 12888.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2609, pruned_loss=0.07628, over 2421309.73 frames. ], batch size: 158, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:42:26,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=334909.6666666667, ans=0.125 2024-06-21 06:42:29,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=334928.0, ans=0.125 2024-06-21 06:42:37,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=334946.3333333333, ans=0.07 2024-06-21 06:42:42,229 INFO [train.py:1028] (0/2) Epoch 19, batch 600, loss[loss=0.2055, simple_loss=0.2541, pruned_loss=0.07849, over 13052.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2618, pruned_loss=0.07699, over 2459602.01 frames. ], batch size: 144, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:42:45,166 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:42:49,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=334983.0, ans=0.09899494936611666 2024-06-21 06:42:54,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=334983.0, ans=0.0 2024-06-21 06:43:02,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=335019.6666666667, ans=0.125 2024-06-21 06:43:06,743 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.624e+02 2.003e+02 2.149e+02 2.414e+02 2.992e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-21 06:43:07,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=335038.0, ans=0.125 2024-06-21 06:43:08,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=15.0 2024-06-21 06:43:11,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=335038.0, ans=0.125 2024-06-21 06:43:11,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=335038.0, ans=0.125 2024-06-21 06:43:14,359 INFO [train.py:1028] (0/2) Epoch 19, batch 650, loss[loss=0.2101, simple_loss=0.271, pruned_loss=0.07455, over 13235.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2618, pruned_loss=0.07652, over 2490711.59 frames. ], batch size: 59, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:43:14,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=335056.3333333333, ans=0.0 2024-06-21 06:43:21,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=15.0 2024-06-21 06:43:21,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=335074.6666666667, ans=0.0 2024-06-21 06:43:37,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.19 vs. limit=22.5 2024-06-21 06:43:45,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=335129.6666666667, ans=0.125 2024-06-21 06:43:46,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.61 vs. limit=15.0 2024-06-21 06:43:49,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=335129.6666666667, ans=0.04949747468305833 2024-06-21 06:43:51,998 INFO [train.py:1028] (0/2) Epoch 19, batch 700, loss[loss=0.2059, simple_loss=0.2617, pruned_loss=0.07508, over 13308.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2614, pruned_loss=0.07648, over 2512095.70 frames. ], batch size: 46, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:43:57,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2024-06-21 06:43:59,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=335166.3333333333, ans=0.95 2024-06-21 06:44:09,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=335184.6666666667, ans=0.05 2024-06-21 06:44:09,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=15.0 2024-06-21 06:44:13,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335203.0, ans=0.1 2024-06-21 06:44:15,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=335203.0, ans=0.125 2024-06-21 06:44:16,012 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 1.968e+02 2.123e+02 2.241e+02 3.000e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 06:44:17,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=335221.3333333333, ans=0.0 2024-06-21 06:44:23,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=335239.6666666667, ans=0.125 2024-06-21 06:44:23,510 INFO [train.py:1028] (0/2) Epoch 19, batch 750, loss[loss=0.2046, simple_loss=0.2662, pruned_loss=0.0715, over 13287.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2612, pruned_loss=0.07612, over 2528344.06 frames. ], batch size: 63, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:44:24,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=335239.6666666667, ans=0.0 2024-06-21 06:44:26,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2024-06-21 06:44:32,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=335258.0, ans=0.125 2024-06-21 06:44:47,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2024-06-21 06:44:55,693 INFO [train.py:1028] (0/2) Epoch 19, batch 800, loss[loss=0.1997, simple_loss=0.2568, pruned_loss=0.07129, over 12876.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2612, pruned_loss=0.07624, over 2542599.73 frames. ], batch size: 36, lr: 3.10e-03, grad_scale: 32.0 2024-06-21 06:45:04,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.69 vs. limit=22.5 2024-06-21 06:45:05,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=335349.6666666667, ans=0.0 2024-06-21 06:45:08,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2024-06-21 06:45:14,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=335386.3333333333, ans=0.125 2024-06-21 06:45:14,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2024-06-21 06:45:20,217 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 1.995e+02 2.111e+02 2.221e+02 4.184e+02, threshold=4.222e+02, percent-clipped=0.0 2024-06-21 06:45:28,369 INFO [train.py:1028] (0/2) Epoch 19, batch 850, loss[loss=0.2029, simple_loss=0.2612, pruned_loss=0.07234, over 13132.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2613, pruned_loss=0.07617, over 2552443.77 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:45:28,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-21 06:45:30,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=335423.0, ans=0.125 2024-06-21 06:45:47,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=12.0 2024-06-21 06:45:47,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=335441.3333333333, ans=0.2 2024-06-21 06:45:55,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=335478.0, ans=10.0 2024-06-21 06:46:02,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=335496.3333333333, ans=0.0 2024-06-21 06:46:04,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:04,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=335496.3333333333, ans=0.07 2024-06-21 06:46:04,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:05,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=335496.3333333333, ans=0.125 2024-06-21 06:46:05,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2024-06-21 06:46:08,730 INFO [train.py:1028] (0/2) Epoch 19, batch 900, loss[loss=0.1917, simple_loss=0.2428, pruned_loss=0.07033, over 12953.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2609, pruned_loss=0.07639, over 2557049.98 frames. ], batch size: 36, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:46:12,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.34 vs. limit=15.0 2024-06-21 06:46:19,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2024-06-21 06:46:24,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335551.3333333333, ans=0.1 2024-06-21 06:46:31,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=335569.6666666667, ans=0.125 2024-06-21 06:46:33,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 1.987e+02 2.123e+02 2.258e+02 3.013e+02, threshold=4.245e+02, percent-clipped=0.0 2024-06-21 06:46:40,906 INFO [train.py:1028] (0/2) Epoch 19, batch 950, loss[loss=0.207, simple_loss=0.2638, pruned_loss=0.0751, over 12900.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.261, pruned_loss=0.07626, over 2559517.43 frames. ], batch size: 39, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:46:47,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=335624.6666666667, ans=0.125 2024-06-21 06:46:52,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=335643.0, ans=0.0 2024-06-21 06:46:53,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=335643.0, ans=0.0 2024-06-21 06:46:56,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335643.0, ans=0.1 2024-06-21 06:47:11,986 INFO [train.py:1028] (0/2) Epoch 19, batch 1000, loss[loss=0.214, simple_loss=0.2789, pruned_loss=0.07458, over 13278.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2609, pruned_loss=0.0765, over 2561120.33 frames. ], batch size: 49, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:47:19,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335716.3333333333, ans=0.1 2024-06-21 06:47:23,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=335716.3333333333, ans=0.0 2024-06-21 06:47:38,926 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.047e+02 2.169e+02 2.393e+02 2.853e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 06:47:40,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=335771.3333333333, ans=0.125 2024-06-21 06:47:46,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335771.3333333333, ans=0.1 2024-06-21 06:47:49,338 INFO [train.py:1028] (0/2) Epoch 19, batch 1050, loss[loss=0.1987, simple_loss=0.2569, pruned_loss=0.07027, over 13169.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2617, pruned_loss=0.07687, over 2563582.29 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:47:56,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=335808.0, ans=0.025 2024-06-21 06:47:56,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=335808.0, ans=0.125 2024-06-21 06:47:59,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2024-06-21 06:48:09,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2024-06-21 06:48:22,214 INFO [train.py:1028] (0/2) Epoch 19, batch 1100, loss[loss=0.2074, simple_loss=0.2669, pruned_loss=0.07394, over 13217.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2623, pruned_loss=0.07718, over 2569247.44 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:48:28,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=335899.6666666667, ans=0.025 2024-06-21 06:48:42,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=335936.3333333333, ans=0.125 2024-06-21 06:48:47,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.973e+02 2.124e+02 2.300e+02 3.230e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 06:48:55,214 INFO [train.py:1028] (0/2) Epoch 19, batch 1150, loss[loss=0.2235, simple_loss=0.2763, pruned_loss=0.08535, over 13271.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2617, pruned_loss=0.07702, over 2571177.94 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:48:59,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=335973.0, ans=0.0 2024-06-21 06:49:06,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=335991.3333333333, ans=15.0 2024-06-21 06:49:23,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=336046.3333333333, ans=0.125 2024-06-21 06:49:29,811 INFO [train.py:1028] (0/2) Epoch 19, batch 1200, loss[loss=0.1924, simple_loss=0.2521, pruned_loss=0.06638, over 13145.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2621, pruned_loss=0.07725, over 2573144.14 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:49:48,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.39 vs. limit=22.5 2024-06-21 06:49:50,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=336101.3333333333, ans=0.025 2024-06-21 06:49:53,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=336119.6666666667, ans=0.125 2024-06-21 06:49:57,059 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.988e+02 2.130e+02 2.298e+02 2.759e+02, threshold=4.261e+02, percent-clipped=0.0 2024-06-21 06:49:57,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=336119.6666666667, ans=0.125 2024-06-21 06:49:57,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=336119.6666666667, ans=0.2 2024-06-21 06:49:58,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2024-06-21 06:49:59,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336138.0, ans=0.1 2024-06-21 06:50:04,647 INFO [train.py:1028] (0/2) Epoch 19, batch 1250, loss[loss=0.1974, simple_loss=0.2501, pruned_loss=0.07237, over 13221.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2615, pruned_loss=0.07684, over 2582773.10 frames. ], batch size: 112, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:50:21,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=336193.0, ans=0.0 2024-06-21 06:50:22,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=336193.0, ans=0.125 2024-06-21 06:50:26,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.52 vs. limit=22.5 2024-06-21 06:50:30,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2024-06-21 06:50:36,379 INFO [train.py:1028] (0/2) Epoch 19, batch 1300, loss[loss=0.2193, simple_loss=0.2706, pruned_loss=0.08401, over 12776.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2616, pruned_loss=0.07669, over 2583030.52 frames. ], batch size: 176, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:50:39,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=336248.0, ans=0.0 2024-06-21 06:50:44,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=336266.3333333333, ans=0.125 2024-06-21 06:50:47,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.53 vs. limit=10.0 2024-06-21 06:50:48,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.68 vs. limit=12.0 2024-06-21 06:50:49,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=336284.6666666667, ans=0.0 2024-06-21 06:50:49,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=336284.6666666667, ans=0.125 2024-06-21 06:50:55,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=336303.0, ans=0.125 2024-06-21 06:51:00,767 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.032e+02 2.117e+02 2.336e+02 3.053e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-21 06:51:07,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=336321.3333333333, ans=0.125 2024-06-21 06:51:08,758 INFO [train.py:1028] (0/2) Epoch 19, batch 1350, loss[loss=0.195, simple_loss=0.2562, pruned_loss=0.06687, over 13277.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2613, pruned_loss=0.07633, over 2585268.07 frames. ], batch size: 59, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:51:08,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=336339.6666666667, ans=0.0 2024-06-21 06:51:17,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.04 vs. limit=15.0 2024-06-21 06:51:45,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=336413.0, ans=0.125 2024-06-21 06:51:46,827 INFO [train.py:1028] (0/2) Epoch 19, batch 1400, loss[loss=0.2198, simple_loss=0.2786, pruned_loss=0.08048, over 12311.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2613, pruned_loss=0.07643, over 2586350.38 frames. ], batch size: 25, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:51:56,127 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:52:01,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=336468.0, ans=0.2 2024-06-21 06:52:11,269 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 1.985e+02 2.096e+02 2.205e+02 3.142e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-21 06:52:12,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=336504.6666666667, ans=0.1 2024-06-21 06:52:14,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=336504.6666666667, ans=0.2 2024-06-21 06:52:16,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=336504.6666666667, ans=0.0 2024-06-21 06:52:19,099 INFO [train.py:1028] (0/2) Epoch 19, batch 1450, loss[loss=0.2024, simple_loss=0.2505, pruned_loss=0.07718, over 13103.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2614, pruned_loss=0.07658, over 2586366.35 frames. ], batch size: 121, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:52:19,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=336523.0, ans=0.2 2024-06-21 06:52:43,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=336578.0, ans=0.125 2024-06-21 06:52:45,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=336596.3333333333, ans=0.125 2024-06-21 06:52:46,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=336596.3333333333, ans=0.125 2024-06-21 06:52:51,328 INFO [train.py:1028] (0/2) Epoch 19, batch 1500, loss[loss=0.2091, simple_loss=0.2644, pruned_loss=0.07692, over 13176.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2615, pruned_loss=0.07688, over 2588784.14 frames. ], batch size: 83, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:52:53,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=336614.6666666667, ans=0.05 2024-06-21 06:53:05,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=336651.3333333333, ans=0.125 2024-06-21 06:53:15,406 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.027e+02 2.146e+02 2.360e+02 2.942e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-21 06:53:20,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=336688.0, ans=0.2 2024-06-21 06:53:26,724 INFO [train.py:1028] (0/2) Epoch 19, batch 1550, loss[loss=0.2154, simple_loss=0.2671, pruned_loss=0.08187, over 13011.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2621, pruned_loss=0.07704, over 2584140.78 frames. ], batch size: 102, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:53:29,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.33 vs. limit=15.0 2024-06-21 06:53:30,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=336706.3333333333, ans=0.125 2024-06-21 06:53:31,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=15.0 2024-06-21 06:53:36,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=336724.6666666667, ans=0.0 2024-06-21 06:53:42,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=336724.6666666667, ans=0.1 2024-06-21 06:53:51,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=336761.3333333333, ans=0.1 2024-06-21 06:53:54,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=336761.3333333333, ans=0.125 2024-06-21 06:54:02,586 INFO [train.py:1028] (0/2) Epoch 19, batch 1600, loss[loss=0.2002, simple_loss=0.2589, pruned_loss=0.07074, over 13123.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2622, pruned_loss=0.07704, over 2579341.26 frames. ], batch size: 77, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:54:09,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=336816.3333333333, ans=15.0 2024-06-21 06:54:11,555 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=6.368e-02 2024-06-21 06:54:13,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2024-06-21 06:54:14,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=336816.3333333333, ans=0.0 2024-06-21 06:54:14,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=336834.6666666667, ans=0.1 2024-06-21 06:54:26,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=336853.0, ans=0.0 2024-06-21 06:54:26,410 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.072e+02 2.212e+02 2.413e+02 3.055e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-21 06:54:26,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=336853.0, ans=0.0 2024-06-21 06:54:34,161 INFO [train.py:1028] (0/2) Epoch 19, batch 1650, loss[loss=0.2187, simple_loss=0.2631, pruned_loss=0.08719, over 13161.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2624, pruned_loss=0.07747, over 2575721.56 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:54:39,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=336889.6666666667, ans=0.0 2024-06-21 06:54:56,619 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.52 vs. limit=15.0 2024-06-21 06:55:03,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=336963.0, ans=0.125 2024-06-21 06:55:05,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=336963.0, ans=0.0 2024-06-21 06:55:06,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.49 vs. limit=5.0 2024-06-21 06:55:07,563 INFO [train.py:1028] (0/2) Epoch 19, batch 1700, loss[loss=0.1984, simple_loss=0.2578, pruned_loss=0.06951, over 12849.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2623, pruned_loss=0.07721, over 2580770.69 frames. ], batch size: 26, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:55:13,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.58 vs. limit=5.0 2024-06-21 06:55:24,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=337018.0, ans=0.0 2024-06-21 06:55:27,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=337018.0, ans=0.125 2024-06-21 06:55:27,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=337018.0, ans=0.0 2024-06-21 06:55:33,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=337036.3333333333, ans=0.2 2024-06-21 06:55:36,815 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.103e+02 2.260e+02 2.509e+02 3.791e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 06:55:48,949 INFO [train.py:1028] (0/2) Epoch 19, batch 1750, loss[loss=0.1792, simple_loss=0.2402, pruned_loss=0.05906, over 12630.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2624, pruned_loss=0.07703, over 2582394.07 frames. ], batch size: 22, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:55:50,385 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:55:52,016 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2024-06-21 06:55:53,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=337073.0, ans=0.125 2024-06-21 06:55:58,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.45 vs. limit=10.0 2024-06-21 06:56:02,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=337109.6666666667, ans=0.0 2024-06-21 06:56:03,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=337109.6666666667, ans=0.125 2024-06-21 06:56:19,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=337146.3333333333, ans=0.0 2024-06-21 06:56:21,865 INFO [train.py:1028] (0/2) Epoch 19, batch 1800, loss[loss=0.2093, simple_loss=0.2655, pruned_loss=0.07653, over 13263.00 frames. ], tot_loss[loss=0.208, simple_loss=0.262, pruned_loss=0.07698, over 2582863.21 frames. ], batch size: 67, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:56:23,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=337164.6666666667, ans=0.025 2024-06-21 06:56:35,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-21 06:56:46,641 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.001e+02 2.118e+02 2.267e+02 4.544e+02, threshold=4.236e+02, percent-clipped=1.0 2024-06-21 06:56:51,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=337238.0, ans=0.125 2024-06-21 06:56:54,793 INFO [train.py:1028] (0/2) Epoch 19, batch 1850, loss[loss=0.1991, simple_loss=0.2507, pruned_loss=0.07374, over 13250.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.262, pruned_loss=0.07719, over 2584093.42 frames. ], batch size: 83, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:56:58,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337256.3333333333, ans=0.125 2024-06-21 06:56:58,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=337256.3333333333, ans=0.0 2024-06-21 06:57:04,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=337274.6666666667, ans=0.125 2024-06-21 06:57:06,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=15.0 2024-06-21 06:57:17,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=337311.3333333333, ans=0.125 2024-06-21 06:57:20,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.66 vs. limit=15.0 2024-06-21 06:57:21,468 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-184000.pt 2024-06-21 06:57:35,758 INFO [train.py:1028] (0/2) Epoch 19, batch 1900, loss[loss=0.2105, simple_loss=0.263, pruned_loss=0.07899, over 13157.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2612, pruned_loss=0.07697, over 2587162.15 frames. ], batch size: 95, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:57:35,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=337348.0, ans=0.2 2024-06-21 06:57:44,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=337366.3333333333, ans=0.125 2024-06-21 06:57:46,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=337366.3333333333, ans=0.07 2024-06-21 06:57:59,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337384.6666666667, ans=0.125 2024-06-21 06:58:02,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=337403.0, ans=0.2 2024-06-21 06:58:05,365 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.991e+02 2.133e+02 2.313e+02 2.789e+02, threshold=4.265e+02, percent-clipped=0.0 2024-06-21 06:58:09,942 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:58:11,331 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:58:13,053 INFO [train.py:1028] (0/2) Epoch 19, batch 1950, loss[loss=0.203, simple_loss=0.258, pruned_loss=0.07403, over 13221.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2615, pruned_loss=0.07737, over 2592946.80 frames. ], batch size: 52, lr: 3.09e-03, grad_scale: 32.0 2024-06-21 06:58:18,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=337458.0, ans=0.0 2024-06-21 06:58:44,813 INFO [train.py:1028] (0/2) Epoch 19, batch 2000, loss[loss=0.1902, simple_loss=0.2507, pruned_loss=0.06486, over 12649.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2618, pruned_loss=0.07742, over 2588549.49 frames. ], batch size: 22, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 06:58:50,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=337549.6666666667, ans=0.125 2024-06-21 06:58:51,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=337549.6666666667, ans=0.5 2024-06-21 06:58:52,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=337549.6666666667, ans=0.0 2024-06-21 06:58:56,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=337549.6666666667, ans=15.0 2024-06-21 06:58:58,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=337568.0, ans=0.0 2024-06-21 06:59:02,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=337568.0, ans=0.0 2024-06-21 06:59:08,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=337586.3333333333, ans=0.2 2024-06-21 06:59:09,747 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.045e+02 2.213e+02 2.420e+02 3.112e+02, threshold=4.427e+02, percent-clipped=0.0 2024-06-21 06:59:10,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337586.3333333333, ans=0.1 2024-06-21 06:59:10,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=337586.3333333333, ans=0.125 2024-06-21 06:59:15,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.41 vs. limit=10.0 2024-06-21 06:59:16,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=337623.0, ans=0.0 2024-06-21 06:59:17,329 INFO [train.py:1028] (0/2) Epoch 19, batch 2050, loss[loss=0.1777, simple_loss=0.2383, pruned_loss=0.05853, over 12418.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2618, pruned_loss=0.07743, over 2583462.45 frames. ], batch size: 29, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 06:59:17,497 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 06:59:19,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=337623.0, ans=0.025 2024-06-21 06:59:32,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=337641.3333333333, ans=0.125 2024-06-21 06:59:35,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337659.6666666667, ans=0.1 2024-06-21 06:59:55,542 INFO [train.py:1028] (0/2) Epoch 19, batch 2100, loss[loss=0.1886, simple_loss=0.2447, pruned_loss=0.06623, over 13204.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2622, pruned_loss=0.07731, over 2585724.19 frames. ], batch size: 59, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 06:59:55,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=337714.6666666667, ans=0.025 2024-06-21 06:59:58,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=337714.6666666667, ans=0.125 2024-06-21 07:00:00,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.77 vs. limit=15.0 2024-06-21 07:00:06,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=337733.0, ans=0.0 2024-06-21 07:00:09,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=337751.3333333333, ans=0.125 2024-06-21 07:00:17,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=337769.6666666667, ans=15.0 2024-06-21 07:00:20,649 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.004e+02 2.106e+02 2.256e+02 2.796e+02, threshold=4.212e+02, percent-clipped=0.0 2024-06-21 07:00:23,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=337788.0, ans=0.0 2024-06-21 07:00:28,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=337806.3333333333, ans=0.0 2024-06-21 07:00:29,120 INFO [train.py:1028] (0/2) Epoch 19, batch 2150, loss[loss=0.2179, simple_loss=0.2772, pruned_loss=0.07933, over 13288.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2619, pruned_loss=0.07682, over 2589132.48 frames. ], batch size: 52, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:00:35,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=337806.3333333333, ans=15.0 2024-06-21 07:00:43,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=337843.0, ans=0.125 2024-06-21 07:00:49,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=337861.3333333333, ans=0.95 2024-06-21 07:01:00,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=337879.6666666667, ans=0.0 2024-06-21 07:01:03,321 INFO [train.py:1028] (0/2) Epoch 19, batch 2200, loss[loss=0.2193, simple_loss=0.2711, pruned_loss=0.08371, over 13262.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2622, pruned_loss=0.0769, over 2589099.45 frames. ], batch size: 83, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:01:10,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=337916.3333333333, ans=0.0 2024-06-21 07:01:22,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=337953.0, ans=0.125 2024-06-21 07:01:28,396 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 1.962e+02 2.150e+02 2.342e+02 4.060e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 07:01:36,719 INFO [train.py:1028] (0/2) Epoch 19, batch 2250, loss[loss=0.1925, simple_loss=0.2509, pruned_loss=0.0671, over 13305.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2618, pruned_loss=0.07675, over 2588391.61 frames. ], batch size: 63, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:01:40,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2024-06-21 07:01:46,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=338008.0, ans=0.125 2024-06-21 07:01:52,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2024-06-21 07:02:04,702 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:02:07,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338044.6666666667, ans=0.1 2024-06-21 07:02:07,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=338044.6666666667, ans=0.125 2024-06-21 07:02:10,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=338063.0, ans=0.09899494936611666 2024-06-21 07:02:11,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=338063.0, ans=0.0 2024-06-21 07:02:12,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2024-06-21 07:02:14,778 INFO [train.py:1028] (0/2) Epoch 19, batch 2300, loss[loss=0.2219, simple_loss=0.2725, pruned_loss=0.08567, over 12939.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2617, pruned_loss=0.07665, over 2582655.00 frames. ], batch size: 33, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:02:31,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=338118.0, ans=0.035 2024-06-21 07:02:37,812 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:02:38,756 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.55 vs. limit=22.5 2024-06-21 07:02:39,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.52 vs. limit=15.0 2024-06-21 07:02:39,602 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.032e+02 2.142e+02 2.325e+02 3.906e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-21 07:02:40,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=338154.6666666667, ans=0.1 2024-06-21 07:02:47,379 INFO [train.py:1028] (0/2) Epoch 19, batch 2350, loss[loss=0.2203, simple_loss=0.2727, pruned_loss=0.08397, over 13250.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.262, pruned_loss=0.07687, over 2585985.32 frames. ], batch size: 67, lr: 3.08e-03, grad_scale: 32.0 2024-06-21 07:02:51,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338173.0, ans=0.1 2024-06-21 07:02:56,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=338191.3333333333, ans=0.0 2024-06-21 07:02:56,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=338191.3333333333, ans=0.0 2024-06-21 07:02:56,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=338191.3333333333, ans=0.09899494936611666 2024-06-21 07:02:59,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=338191.3333333333, ans=0.1 2024-06-21 07:02:59,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=338191.3333333333, ans=0.125 2024-06-21 07:03:02,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2024-06-21 07:03:07,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=338228.0, ans=0.125 2024-06-21 07:03:11,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=338228.0, ans=0.025 2024-06-21 07:03:18,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=338246.3333333333, ans=0.125 2024-06-21 07:03:20,401 INFO [train.py:1028] (0/2) Epoch 19, batch 2400, loss[loss=0.2334, simple_loss=0.2868, pruned_loss=0.08997, over 13359.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2615, pruned_loss=0.07694, over 2588771.86 frames. ], batch size: 46, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:03:30,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=338283.0, ans=15.0 2024-06-21 07:03:50,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.65 vs. limit=22.5 2024-06-21 07:03:50,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 1.989e+02 2.102e+02 2.305e+02 2.886e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-21 07:03:58,143 INFO [train.py:1028] (0/2) Epoch 19, batch 2450, loss[loss=0.2129, simple_loss=0.2728, pruned_loss=0.0765, over 13301.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2605, pruned_loss=0.07693, over 2584452.26 frames. ], batch size: 63, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:04:21,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2024-06-21 07:04:21,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2024-06-21 07:04:25,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=338429.6666666667, ans=0.125 2024-06-21 07:04:26,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=338429.6666666667, ans=0.0 2024-06-21 07:04:31,142 INFO [train.py:1028] (0/2) Epoch 19, batch 2500, loss[loss=0.1933, simple_loss=0.2438, pruned_loss=0.07141, over 13165.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2591, pruned_loss=0.07635, over 2587957.22 frames. ], batch size: 83, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:04:34,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.38 vs. limit=6.0 2024-06-21 07:04:35,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=338448.0, ans=0.125 2024-06-21 07:04:40,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2024-06-21 07:04:41,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2024-06-21 07:04:45,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=338484.6666666667, ans=0.125 2024-06-21 07:04:56,442 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.628e+02 1.988e+02 2.092e+02 2.207e+02 4.147e+02, threshold=4.184e+02, percent-clipped=0.0 2024-06-21 07:05:02,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=338521.3333333333, ans=0.125 2024-06-21 07:05:04,348 INFO [train.py:1028] (0/2) Epoch 19, batch 2550, loss[loss=0.2207, simple_loss=0.277, pruned_loss=0.08215, over 12483.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2581, pruned_loss=0.07583, over 2587640.26 frames. ], batch size: 22, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:05:04,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=338539.6666666667, ans=0.125 2024-06-21 07:05:04,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=338539.6666666667, ans=0.1 2024-06-21 07:05:22,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=338576.3333333333, ans=0.125 2024-06-21 07:05:28,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=338594.6666666667, ans=10.0 2024-06-21 07:05:30,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=338594.6666666667, ans=0.2 2024-06-21 07:05:35,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=338613.0, ans=0.0 2024-06-21 07:05:36,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=338613.0, ans=0.0 2024-06-21 07:05:41,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=338613.0, ans=0.125 2024-06-21 07:05:42,691 INFO [train.py:1028] (0/2) Epoch 19, batch 2600, loss[loss=0.1979, simple_loss=0.2504, pruned_loss=0.07273, over 13195.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2569, pruned_loss=0.07556, over 2588182.75 frames. ], batch size: 52, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:05:43,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.55 vs. limit=22.5 2024-06-21 07:05:48,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=338649.6666666667, ans=0.125 2024-06-21 07:05:54,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.30 vs. limit=10.0 2024-06-21 07:05:56,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2024-06-21 07:06:07,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.030e+02 2.187e+02 2.443e+02 3.083e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 07:06:10,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=338704.6666666667, ans=10.0 2024-06-21 07:06:15,345 INFO [train.py:1028] (0/2) Epoch 19, batch 2650, loss[loss=0.18, simple_loss=0.2278, pruned_loss=0.06609, over 13013.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2552, pruned_loss=0.07515, over 2587381.61 frames. ], batch size: 144, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:06:15,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=338723.0, ans=0.125 2024-06-21 07:06:17,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=338723.0, ans=0.0 2024-06-21 07:06:17,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=338723.0, ans=0.125 2024-06-21 07:06:21,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=338741.3333333333, ans=0.0 2024-06-21 07:06:28,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2024-06-21 07:06:30,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=338759.6666666667, ans=0.2 2024-06-21 07:06:37,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2024-06-21 07:06:43,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=338796.3333333333, ans=0.0 2024-06-21 07:06:48,201 INFO [train.py:1028] (0/2) Epoch 19, batch 2700, loss[loss=0.1954, simple_loss=0.2507, pruned_loss=0.07007, over 13249.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2535, pruned_loss=0.0746, over 2585940.08 frames. ], batch size: 89, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:06:53,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-21 07:07:13,115 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.746e+02 1.953e+02 2.102e+02 2.262e+02 2.942e+02, threshold=4.204e+02, percent-clipped=0.0 2024-06-21 07:07:13,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=338869.6666666667, ans=0.2 2024-06-21 07:07:17,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2024-06-21 07:07:24,382 INFO [train.py:1028] (0/2) Epoch 19, batch 2750, loss[loss=0.2277, simple_loss=0.276, pruned_loss=0.0897, over 13178.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2531, pruned_loss=0.07418, over 2583694.37 frames. ], batch size: 43, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:07:41,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338943.0, ans=0.125 2024-06-21 07:07:44,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=338943.0, ans=0.125 2024-06-21 07:07:48,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338961.3333333333, ans=0.1 2024-06-21 07:07:51,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=338961.3333333333, ans=0.125 2024-06-21 07:08:01,914 INFO [train.py:1028] (0/2) Epoch 19, batch 2800, loss[loss=0.217, simple_loss=0.2622, pruned_loss=0.08588, over 10921.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2526, pruned_loss=0.0741, over 2580517.09 frames. ], batch size: 303, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:08:02,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=338998.0, ans=0.125 2024-06-21 07:08:09,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=339016.3333333333, ans=0.125 2024-06-21 07:08:20,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-21 07:08:21,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=339053.0, ans=0.0 2024-06-21 07:08:26,906 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 1.988e+02 2.133e+02 2.273e+02 2.773e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 07:08:34,985 INFO [train.py:1028] (0/2) Epoch 19, batch 2850, loss[loss=0.1855, simple_loss=0.2437, pruned_loss=0.06366, over 13022.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2516, pruned_loss=0.07372, over 2577928.95 frames. ], batch size: 48, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:08:39,952 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.0 2024-06-21 07:08:53,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2024-06-21 07:09:03,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=339163.0, ans=0.125 2024-06-21 07:09:04,461 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.718e-01 2024-06-21 07:09:07,428 INFO [train.py:1028] (0/2) Epoch 19, batch 2900, loss[loss=0.1733, simple_loss=0.2303, pruned_loss=0.05812, over 13191.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2499, pruned_loss=0.07323, over 2586222.41 frames. ], batch size: 55, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:09:32,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.07 vs. limit=15.0 2024-06-21 07:09:38,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=339236.3333333333, ans=0.0 2024-06-21 07:09:40,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.923e+02 2.018e+02 2.205e+02 2.853e+02, threshold=4.037e+02, percent-clipped=0.0 2024-06-21 07:09:40,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=339236.3333333333, ans=0.0 2024-06-21 07:09:48,716 INFO [train.py:1028] (0/2) Epoch 19, batch 2950, loss[loss=0.1958, simple_loss=0.2496, pruned_loss=0.071, over 13219.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2498, pruned_loss=0.0731, over 2579870.65 frames. ], batch size: 43, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:09:53,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=339273.0, ans=0.0 2024-06-21 07:09:56,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339291.3333333333, ans=0.1 2024-06-21 07:10:01,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=339291.3333333333, ans=0.1 2024-06-21 07:10:11,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2024-06-21 07:10:14,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=339328.0, ans=0.125 2024-06-21 07:10:16,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=339346.3333333333, ans=0.0 2024-06-21 07:10:18,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=339346.3333333333, ans=0.2 2024-06-21 07:10:22,645 INFO [train.py:1028] (0/2) Epoch 19, batch 3000, loss[loss=0.2047, simple_loss=0.2618, pruned_loss=0.07377, over 13220.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.249, pruned_loss=0.07296, over 2577893.75 frames. ], batch size: 59, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:10:22,646 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 07:10:30,411 INFO [train.py:1060] (0/2) Epoch 19, validation: loss=0.186, simple_loss=0.2507, pruned_loss=0.06065, over 351949.00 frames. 2024-06-21 07:10:30,411 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 07:10:32,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=339364.6666666667, ans=0.0 2024-06-21 07:10:39,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2024-06-21 07:10:42,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=339401.3333333333, ans=0.0 2024-06-21 07:10:51,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2024-06-21 07:10:52,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=339419.6666666667, ans=0.125 2024-06-21 07:10:53,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=339419.6666666667, ans=0.5 2024-06-21 07:10:53,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339419.6666666667, ans=0.1 2024-06-21 07:10:54,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=339419.6666666667, ans=0.0 2024-06-21 07:10:54,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.93 vs. limit=15.0 2024-06-21 07:10:55,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.023e+02 2.182e+02 2.347e+02 3.005e+02, threshold=4.364e+02, percent-clipped=0.0 2024-06-21 07:11:00,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=339438.0, ans=0.025 2024-06-21 07:11:02,752 INFO [train.py:1028] (0/2) Epoch 19, batch 3050, loss[loss=0.1873, simple_loss=0.2438, pruned_loss=0.06543, over 13357.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2486, pruned_loss=0.07325, over 2578256.41 frames. ], batch size: 46, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:11:04,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=339456.3333333333, ans=0.125 2024-06-21 07:11:17,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=339474.6666666667, ans=0.2 2024-06-21 07:11:41,448 INFO [train.py:1028] (0/2) Epoch 19, batch 3100, loss[loss=0.194, simple_loss=0.2399, pruned_loss=0.07408, over 13004.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2483, pruned_loss=0.07307, over 2578691.83 frames. ], batch size: 144, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:11:47,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=339566.3333333333, ans=0.0 2024-06-21 07:11:48,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339566.3333333333, ans=0.0 2024-06-21 07:11:51,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=339566.3333333333, ans=0.025 2024-06-21 07:12:05,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=339603.0, ans=0.125 2024-06-21 07:12:06,990 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.957e+02 2.090e+02 2.363e+02 2.907e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-21 07:12:07,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=339603.0, ans=0.0 2024-06-21 07:12:14,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.17 vs. limit=15.0 2024-06-21 07:12:14,578 INFO [train.py:1028] (0/2) Epoch 19, batch 3150, loss[loss=0.2058, simple_loss=0.2477, pruned_loss=0.08197, over 12940.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2469, pruned_loss=0.07243, over 2580805.18 frames. ], batch size: 158, lr: 3.08e-03, grad_scale: 64.0 2024-06-21 07:12:29,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=339676.3333333333, ans=0.125 2024-06-21 07:12:33,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=339676.3333333333, ans=0.0 2024-06-21 07:12:40,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=339713.0, ans=0.125 2024-06-21 07:12:40,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=339713.0, ans=0.125 2024-06-21 07:12:46,964 INFO [train.py:1028] (0/2) Epoch 19, batch 3200, loss[loss=0.1939, simple_loss=0.2453, pruned_loss=0.07127, over 13124.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2466, pruned_loss=0.07237, over 2580872.60 frames. ], batch size: 55, lr: 3.07e-03, grad_scale: 64.0 2024-06-21 07:12:55,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2024-06-21 07:12:57,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=339749.6666666667, ans=0.125 2024-06-21 07:12:59,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=339768.0, ans=0.2 2024-06-21 07:13:01,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=339768.0, ans=0.0 2024-06-21 07:13:06,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=339786.3333333333, ans=0.2 2024-06-21 07:13:12,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=339786.3333333333, ans=0.125 2024-06-21 07:13:14,708 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.946e+02 2.055e+02 2.198e+02 2.815e+02, threshold=4.109e+02, percent-clipped=0.0 2024-06-21 07:13:18,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=339804.6666666667, ans=0.5 2024-06-21 07:13:22,369 INFO [train.py:1028] (0/2) Epoch 19, batch 3250, loss[loss=0.1907, simple_loss=0.2368, pruned_loss=0.07232, over 13266.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2461, pruned_loss=0.0724, over 2584996.21 frames. ], batch size: 72, lr: 3.07e-03, grad_scale: 64.0 2024-06-21 07:13:24,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.05 vs. limit=22.5 2024-06-21 07:13:33,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=339841.3333333333, ans=0.0 2024-06-21 07:13:35,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=339841.3333333333, ans=0.0 2024-06-21 07:13:40,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=339859.6666666667, ans=0.2 2024-06-21 07:13:56,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=339896.3333333333, ans=0.125 2024-06-21 07:13:59,199 INFO [train.py:1028] (0/2) Epoch 19, batch 3300, loss[loss=0.1977, simple_loss=0.2438, pruned_loss=0.07584, over 12743.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2461, pruned_loss=0.07214, over 2581093.13 frames. ], batch size: 176, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:14:01,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=339914.6666666667, ans=0.0 2024-06-21 07:14:12,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2024-06-21 07:14:24,818 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.009e+02 2.139e+02 2.349e+02 3.100e+02, threshold=4.277e+02, percent-clipped=0.0 2024-06-21 07:14:30,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=339988.0, ans=0.125 2024-06-21 07:14:32,125 INFO [train.py:1028] (0/2) Epoch 19, batch 3350, loss[loss=0.1809, simple_loss=0.2292, pruned_loss=0.06632, over 12924.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2454, pruned_loss=0.07206, over 2575703.14 frames. ], batch size: 158, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:14:32,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=340006.3333333333, ans=0.0 2024-06-21 07:14:37,938 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:14:40,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.65 vs. limit=15.0 2024-06-21 07:14:45,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=340043.0, ans=0.1 2024-06-21 07:14:45,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=15.0 2024-06-21 07:14:46,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2024-06-21 07:14:49,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=340043.0, ans=0.125 2024-06-21 07:14:58,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=340061.3333333333, ans=0.125 2024-06-21 07:15:04,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340079.6666666667, ans=0.1 2024-06-21 07:15:08,173 INFO [train.py:1028] (0/2) Epoch 19, batch 3400, loss[loss=0.1928, simple_loss=0.2505, pruned_loss=0.06753, over 12596.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2447, pruned_loss=0.07196, over 2573518.14 frames. ], batch size: 22, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:15:08,944 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:15:08,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=340098.0, ans=0.0 2024-06-21 07:15:19,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=340116.3333333333, ans=0.125 2024-06-21 07:15:23,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=340134.6666666667, ans=0.0 2024-06-21 07:15:32,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.74 vs. limit=10.0 2024-06-21 07:15:36,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 1.945e+02 2.046e+02 2.317e+02 2.778e+02, threshold=4.091e+02, percent-clipped=0.0 2024-06-21 07:15:37,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=340171.3333333333, ans=0.025 2024-06-21 07:15:38,805 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.57 vs. limit=15.0 2024-06-21 07:15:40,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.88 vs. limit=15.0 2024-06-21 07:15:44,117 INFO [train.py:1028] (0/2) Epoch 19, batch 3450, loss[loss=0.1931, simple_loss=0.2418, pruned_loss=0.07223, over 12783.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2447, pruned_loss=0.072, over 2573928.07 frames. ], batch size: 176, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:15:53,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.25 vs. limit=15.0 2024-06-21 07:15:55,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2024-06-21 07:15:57,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=340226.3333333333, ans=0.125 2024-06-21 07:15:57,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.49 vs. limit=6.0 2024-06-21 07:16:01,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2024-06-21 07:16:03,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340244.6666666667, ans=0.1 2024-06-21 07:16:09,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=340263.0, ans=0.2 2024-06-21 07:16:10,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.22 vs. limit=15.0 2024-06-21 07:16:14,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=340263.0, ans=0.0 2024-06-21 07:16:16,557 INFO [train.py:1028] (0/2) Epoch 19, batch 3500, loss[loss=0.1969, simple_loss=0.2407, pruned_loss=0.07661, over 12745.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2444, pruned_loss=0.07158, over 2572899.50 frames. ], batch size: 33, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:16:18,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340281.3333333333, ans=0.125 2024-06-21 07:16:18,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=340281.3333333333, ans=0.0 2024-06-21 07:16:28,135 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:16:29,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.60 vs. limit=10.0 2024-06-21 07:16:31,858 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2024-06-21 07:16:32,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=340318.0, ans=0.125 2024-06-21 07:16:35,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=340318.0, ans=0.025 2024-06-21 07:16:42,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.899e+02 2.001e+02 2.180e+02 3.338e+02, threshold=4.001e+02, percent-clipped=0.0 2024-06-21 07:16:54,628 INFO [train.py:1028] (0/2) Epoch 19, batch 3550, loss[loss=0.2044, simple_loss=0.2441, pruned_loss=0.08238, over 13085.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2435, pruned_loss=0.07106, over 2574942.68 frames. ], batch size: 95, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:17:02,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2024-06-21 07:17:03,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340391.3333333333, ans=0.1 2024-06-21 07:17:29,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=340446.3333333333, ans=0.125 2024-06-21 07:17:30,345 INFO [train.py:1028] (0/2) Epoch 19, batch 3600, loss[loss=0.1984, simple_loss=0.2516, pruned_loss=0.07262, over 13378.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2435, pruned_loss=0.07112, over 2579930.62 frames. ], batch size: 49, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:17:41,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2024-06-21 07:17:47,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=340501.3333333333, ans=0.125 2024-06-21 07:17:50,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340519.6666666667, ans=0.1 2024-06-21 07:17:56,006 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.999e+02 2.212e+02 2.432e+02 3.452e+02, threshold=4.424e+02, percent-clipped=0.0 2024-06-21 07:18:03,162 INFO [train.py:1028] (0/2) Epoch 19, batch 3650, loss[loss=0.1937, simple_loss=0.2343, pruned_loss=0.07654, over 12981.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2433, pruned_loss=0.07095, over 2577555.89 frames. ], batch size: 102, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:18:03,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340556.3333333333, ans=0.125 2024-06-21 07:18:07,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=340556.3333333333, ans=0.125 2024-06-21 07:18:11,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=340574.6666666667, ans=0.125 2024-06-21 07:18:11,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.14 vs. limit=15.0 2024-06-21 07:18:20,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340593.0, ans=0.125 2024-06-21 07:18:27,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=340611.3333333333, ans=0.125 2024-06-21 07:18:28,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=340611.3333333333, ans=0.125 2024-06-21 07:18:28,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340611.3333333333, ans=0.1 2024-06-21 07:18:33,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=340629.6666666667, ans=0.125 2024-06-21 07:18:34,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=340629.6666666667, ans=0.125 2024-06-21 07:18:36,079 INFO [train.py:1028] (0/2) Epoch 19, batch 3700, loss[loss=0.1707, simple_loss=0.2287, pruned_loss=0.05631, over 13267.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2423, pruned_loss=0.07058, over 2583109.95 frames. ], batch size: 72, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:18:45,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=340666.3333333333, ans=0.125 2024-06-21 07:18:54,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=340684.6666666667, ans=0.125 2024-06-21 07:19:00,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2024-06-21 07:19:05,074 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 1.972e+02 2.096e+02 2.246e+02 2.799e+02, threshold=4.192e+02, percent-clipped=0.0 2024-06-21 07:19:05,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=340721.3333333333, ans=0.5 2024-06-21 07:19:07,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=340721.3333333333, ans=0.125 2024-06-21 07:19:12,295 INFO [train.py:1028] (0/2) Epoch 19, batch 3750, loss[loss=0.1899, simple_loss=0.2502, pruned_loss=0.06481, over 12753.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2418, pruned_loss=0.07051, over 2585798.10 frames. ], batch size: 22, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:19:17,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=340739.6666666667, ans=0.0 2024-06-21 07:19:22,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340758.0, ans=0.1 2024-06-21 07:19:26,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=340776.3333333333, ans=0.125 2024-06-21 07:19:44,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=340813.0, ans=0.09899494936611666 2024-06-21 07:19:47,499 INFO [train.py:1028] (0/2) Epoch 19, batch 3800, loss[loss=0.1894, simple_loss=0.2395, pruned_loss=0.06969, over 13179.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2419, pruned_loss=0.07054, over 2583523.06 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:19:55,756 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2024-06-21 07:20:06,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=340886.3333333333, ans=0.2 2024-06-21 07:20:07,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.34 vs. limit=10.0 2024-06-21 07:20:10,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=340886.3333333333, ans=0.2 2024-06-21 07:20:12,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.882e+02 2.058e+02 2.202e+02 2.940e+02, threshold=4.116e+02, percent-clipped=0.0 2024-06-21 07:20:14,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=340904.6666666667, ans=0.2 2024-06-21 07:20:18,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=340904.6666666667, ans=0.125 2024-06-21 07:20:20,155 INFO [train.py:1028] (0/2) Epoch 19, batch 3850, loss[loss=0.1933, simple_loss=0.2388, pruned_loss=0.07393, over 13008.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2416, pruned_loss=0.07022, over 2582655.90 frames. ], batch size: 144, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:20:23,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.72 vs. limit=10.0 2024-06-21 07:20:49,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=340996.3333333333, ans=0.125 2024-06-21 07:20:49,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=340996.3333333333, ans=0.125 2024-06-21 07:20:52,657 INFO [train.py:1028] (0/2) Epoch 19, batch 3900, loss[loss=0.1817, simple_loss=0.2293, pruned_loss=0.06701, over 13251.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2415, pruned_loss=0.0705, over 2586292.30 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:20:58,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341014.6666666667, ans=0.1 2024-06-21 07:21:00,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=341033.0, ans=0.125 2024-06-21 07:21:09,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=341051.3333333333, ans=0.125 2024-06-21 07:21:21,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=341069.6666666667, ans=0.0 2024-06-21 07:21:21,489 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 1.938e+02 2.061e+02 2.241e+02 2.813e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-21 07:21:23,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=341088.0, ans=0.04949747468305833 2024-06-21 07:21:25,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=341088.0, ans=0.2 2024-06-21 07:21:28,716 INFO [train.py:1028] (0/2) Epoch 19, batch 3950, loss[loss=0.2005, simple_loss=0.2387, pruned_loss=0.08117, over 13094.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2404, pruned_loss=0.06999, over 2588569.60 frames. ], batch size: 132, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:21:41,673 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2024-06-21 07:21:45,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=341143.0, ans=0.5 2024-06-21 07:21:47,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=341143.0, ans=0.125 2024-06-21 07:21:48,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=341143.0, ans=10.0 2024-06-21 07:21:52,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=341161.3333333333, ans=0.125 2024-06-21 07:22:04,974 INFO [train.py:1028] (0/2) Epoch 19, batch 4000, loss[loss=0.2001, simple_loss=0.2539, pruned_loss=0.07316, over 12901.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2405, pruned_loss=0.07029, over 2583498.20 frames. ], batch size: 39, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:22:07,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=341198.0, ans=0.025 2024-06-21 07:22:12,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341216.3333333333, ans=0.1 2024-06-21 07:22:14,521 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.67 vs. limit=22.5 2024-06-21 07:22:15,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2024-06-21 07:22:19,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.99 vs. limit=22.5 2024-06-21 07:22:19,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2024-06-21 07:22:21,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341234.6666666667, ans=0.1 2024-06-21 07:22:27,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=341253.0, ans=10.0 2024-06-21 07:22:28,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=341253.0, ans=0.025 2024-06-21 07:22:30,240 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 1.947e+02 2.062e+02 2.205e+02 3.340e+02, threshold=4.124e+02, percent-clipped=0.0 2024-06-21 07:22:37,713 INFO [train.py:1028] (0/2) Epoch 19, batch 4050, loss[loss=0.1922, simple_loss=0.2329, pruned_loss=0.0758, over 11015.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2398, pruned_loss=0.07002, over 2581756.46 frames. ], batch size: 304, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:22:46,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=341308.0, ans=0.035 2024-06-21 07:22:52,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=341326.3333333333, ans=0.05 2024-06-21 07:22:55,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=341326.3333333333, ans=0.125 2024-06-21 07:23:06,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2024-06-21 07:23:13,798 INFO [train.py:1028] (0/2) Epoch 19, batch 4100, loss[loss=0.2026, simple_loss=0.2494, pruned_loss=0.0779, over 12982.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2405, pruned_loss=0.07049, over 2578148.27 frames. ], batch size: 102, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:23:23,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=341399.6666666667, ans=0.125 2024-06-21 07:23:24,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=341399.6666666667, ans=0.0 2024-06-21 07:23:25,436 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=5.110e-03 2024-06-21 07:23:26,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=341399.6666666667, ans=0.1 2024-06-21 07:23:28,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341418.0, ans=0.1 2024-06-21 07:23:41,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=341436.3333333333, ans=0.125 2024-06-21 07:23:43,228 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 1.993e+02 2.146e+02 2.420e+02 3.460e+02, threshold=4.292e+02, percent-clipped=0.0 2024-06-21 07:23:44,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=341454.6666666667, ans=0.125 2024-06-21 07:23:50,356 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2024-06-21 07:23:50,677 INFO [train.py:1028] (0/2) Epoch 19, batch 4150, loss[loss=0.1849, simple_loss=0.2408, pruned_loss=0.06445, over 13134.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2399, pruned_loss=0.07021, over 2577254.20 frames. ], batch size: 55, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:23:53,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341473.0, ans=0.1 2024-06-21 07:23:54,831 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.086e+01 2024-06-21 07:23:55,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=12.0 2024-06-21 07:23:58,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=341491.3333333333, ans=0.0 2024-06-21 07:24:11,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=341528.0, ans=0.125 2024-06-21 07:24:13,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2024-06-21 07:24:15,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.80 vs. limit=15.0 2024-06-21 07:24:18,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=341546.3333333333, ans=0.125 2024-06-21 07:24:23,575 INFO [train.py:1028] (0/2) Epoch 19, batch 4200, loss[loss=0.1853, simple_loss=0.2354, pruned_loss=0.06764, over 12999.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2393, pruned_loss=0.06985, over 2579361.90 frames. ], batch size: 102, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:24:32,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2024-06-21 07:24:43,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2024-06-21 07:24:47,922 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:24:49,098 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 1.889e+02 2.049e+02 2.187e+02 2.672e+02, threshold=4.099e+02, percent-clipped=0.0 2024-06-21 07:24:56,437 INFO [train.py:1028] (0/2) Epoch 19, batch 4250, loss[loss=0.1689, simple_loss=0.2234, pruned_loss=0.05722, over 13286.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2391, pruned_loss=0.06967, over 2582286.06 frames. ], batch size: 46, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:25:00,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=341656.3333333333, ans=0.0 2024-06-21 07:25:01,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=341656.3333333333, ans=0.0 2024-06-21 07:25:01,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=341656.3333333333, ans=0.125 2024-06-21 07:25:01,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=341656.3333333333, ans=0.1 2024-06-21 07:25:03,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=341674.6666666667, ans=0.125 2024-06-21 07:25:03,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341674.6666666667, ans=0.1 2024-06-21 07:25:17,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=341693.0, ans=0.0 2024-06-21 07:25:22,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341711.3333333333, ans=0.0 2024-06-21 07:25:24,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=22.12 vs. limit=15.0 2024-06-21 07:25:26,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2024-06-21 07:25:32,344 INFO [train.py:1028] (0/2) Epoch 19, batch 4300, loss[loss=0.1786, simple_loss=0.2297, pruned_loss=0.06378, over 13216.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2387, pruned_loss=0.0698, over 2582504.38 frames. ], batch size: 59, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:25:34,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=15.0 2024-06-21 07:25:38,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=341748.0, ans=10.0 2024-06-21 07:25:39,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=341748.0, ans=0.125 2024-06-21 07:25:57,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=341803.0, ans=0.0 2024-06-21 07:25:57,604 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2024-06-21 07:25:58,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.99 vs. limit=15.0 2024-06-21 07:25:58,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=341803.0, ans=0.125 2024-06-21 07:26:00,216 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.956e+02 2.044e+02 2.173e+02 2.894e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-21 07:26:01,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=341821.3333333333, ans=0.125 2024-06-21 07:26:07,366 INFO [train.py:1028] (0/2) Epoch 19, batch 4350, loss[loss=0.1938, simple_loss=0.2424, pruned_loss=0.07265, over 13212.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2384, pruned_loss=0.06954, over 2587385.55 frames. ], batch size: 59, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:26:10,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=341839.6666666667, ans=0.125 2024-06-21 07:26:19,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=341876.3333333333, ans=0.0 2024-06-21 07:26:34,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=341913.0, ans=0.125 2024-06-21 07:26:36,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=341913.0, ans=0.0 2024-06-21 07:26:36,196 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.73 vs. limit=12.0 2024-06-21 07:26:37,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=341913.0, ans=0.0 2024-06-21 07:26:40,076 INFO [train.py:1028] (0/2) Epoch 19, batch 4400, loss[loss=0.186, simple_loss=0.234, pruned_loss=0.06906, over 13230.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2391, pruned_loss=0.06992, over 2587909.96 frames. ], batch size: 83, lr: 3.07e-03, grad_scale: 32.0 2024-06-21 07:26:40,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=341931.3333333333, ans=0.2 2024-06-21 07:26:53,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=341968.0, ans=0.125 2024-06-21 07:26:54,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2024-06-21 07:27:05,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 1.892e+02 1.967e+02 2.124e+02 2.885e+02, threshold=3.934e+02, percent-clipped=0.0 2024-06-21 07:27:06,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342004.6666666667, ans=0.1 2024-06-21 07:27:15,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=342023.0, ans=0.125 2024-06-21 07:27:15,967 INFO [train.py:1028] (0/2) Epoch 19, batch 4450, loss[loss=0.1778, simple_loss=0.2314, pruned_loss=0.06209, over 12973.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2394, pruned_loss=0.07007, over 2582767.78 frames. ], batch size: 33, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:27:23,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.28 vs. limit=15.0 2024-06-21 07:27:23,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2024-06-21 07:27:27,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2024-06-21 07:27:37,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342078.0, ans=0.1 2024-06-21 07:27:40,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=342078.0, ans=0.125 2024-06-21 07:27:51,123 INFO [train.py:1028] (0/2) Epoch 19, batch 4500, loss[loss=0.1777, simple_loss=0.2299, pruned_loss=0.06271, over 13272.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2389, pruned_loss=0.0697, over 2586227.78 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:27:55,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=342114.6666666667, ans=0.125 2024-06-21 07:27:55,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=342114.6666666667, ans=0.0 2024-06-21 07:27:58,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=342133.0, ans=0.0 2024-06-21 07:28:00,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=342133.0, ans=0.0 2024-06-21 07:28:02,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=342133.0, ans=0.0 2024-06-21 07:28:04,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=342151.3333333333, ans=0.125 2024-06-21 07:28:11,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=342169.6666666667, ans=0.0 2024-06-21 07:28:16,871 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 1.948e+02 2.036e+02 2.180e+02 2.904e+02, threshold=4.072e+02, percent-clipped=0.0 2024-06-21 07:28:17,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2024-06-21 07:28:24,249 INFO [train.py:1028] (0/2) Epoch 19, batch 4550, loss[loss=0.1831, simple_loss=0.238, pruned_loss=0.06405, over 13271.00 frames. ], tot_loss[loss=0.189, simple_loss=0.239, pruned_loss=0.06951, over 2589908.67 frames. ], batch size: 52, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:28:35,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=342224.6666666667, ans=0.125 2024-06-21 07:28:49,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=342261.3333333333, ans=0.125 2024-06-21 07:28:56,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=342298.0, ans=0.025 2024-06-21 07:28:56,706 INFO [train.py:1028] (0/2) Epoch 19, batch 4600, loss[loss=0.2113, simple_loss=0.2576, pruned_loss=0.08247, over 12501.00 frames. ], tot_loss[loss=0.189, simple_loss=0.239, pruned_loss=0.06952, over 2585090.38 frames. ], batch size: 202, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:29:01,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.50 vs. limit=22.5 2024-06-21 07:29:14,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=342334.6666666667, ans=0.07 2024-06-21 07:29:14,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=342334.6666666667, ans=0.125 2024-06-21 07:29:19,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=342353.0, ans=0.2 2024-06-21 07:29:23,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2024-06-21 07:29:25,648 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.706e+02 1.889e+02 2.049e+02 2.297e+02 2.729e+02, threshold=4.098e+02, percent-clipped=0.0 2024-06-21 07:29:32,961 INFO [train.py:1028] (0/2) Epoch 19, batch 4650, loss[loss=0.1735, simple_loss=0.2174, pruned_loss=0.06478, over 13103.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2383, pruned_loss=0.06948, over 2587560.52 frames. ], batch size: 132, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:29:33,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=342389.6666666667, ans=10.0 2024-06-21 07:29:40,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342389.6666666667, ans=0.1 2024-06-21 07:29:41,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2024-06-21 07:29:51,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=342426.3333333333, ans=0.0 2024-06-21 07:29:51,999 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.65 vs. limit=15.0 2024-06-21 07:29:58,583 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:29:59,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2024-06-21 07:30:09,817 INFO [train.py:1028] (0/2) Epoch 19, batch 4700, loss[loss=0.1792, simple_loss=0.2323, pruned_loss=0.0631, over 12331.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.239, pruned_loss=0.06993, over 2582608.02 frames. ], batch size: 25, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:30:19,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=342499.6666666667, ans=0.0 2024-06-21 07:30:29,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.52 vs. limit=15.0 2024-06-21 07:30:32,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=342536.3333333333, ans=0.0 2024-06-21 07:30:33,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.98 vs. limit=15.0 2024-06-21 07:30:33,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=342536.3333333333, ans=0.0 2024-06-21 07:30:34,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=342536.3333333333, ans=0.125 2024-06-21 07:30:35,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=342536.3333333333, ans=0.025 2024-06-21 07:30:36,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=342536.3333333333, ans=0.0 2024-06-21 07:30:36,589 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 1.980e+02 2.165e+02 2.370e+02 2.910e+02, threshold=4.331e+02, percent-clipped=0.0 2024-06-21 07:30:37,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=342554.6666666667, ans=0.025 2024-06-21 07:30:43,788 INFO [train.py:1028] (0/2) Epoch 19, batch 4750, loss[loss=0.2023, simple_loss=0.244, pruned_loss=0.08031, over 12590.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2382, pruned_loss=0.06974, over 2580057.94 frames. ], batch size: 202, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:30:48,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=342573.0, ans=0.125 2024-06-21 07:30:58,771 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:31:08,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=342628.0, ans=0.0 2024-06-21 07:31:18,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=342646.3333333333, ans=0.125 2024-06-21 07:31:21,691 INFO [train.py:1028] (0/2) Epoch 19, batch 4800, loss[loss=0.1798, simple_loss=0.239, pruned_loss=0.06028, over 13252.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2383, pruned_loss=0.06968, over 2576088.49 frames. ], batch size: 63, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:31:27,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=342664.6666666667, ans=0.0 2024-06-21 07:31:49,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.41 vs. limit=15.0 2024-06-21 07:31:51,720 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 1.953e+02 2.094e+02 2.323e+02 2.994e+02, threshold=4.188e+02, percent-clipped=0.0 2024-06-21 07:31:53,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=342738.0, ans=0.2 2024-06-21 07:31:53,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=342738.0, ans=0.125 2024-06-21 07:31:56,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=15.0 2024-06-21 07:31:57,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=342738.0, ans=0.0 2024-06-21 07:31:59,009 INFO [train.py:1028] (0/2) Epoch 19, batch 4850, loss[loss=0.1753, simple_loss=0.2313, pruned_loss=0.05966, over 13279.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2371, pruned_loss=0.06894, over 2573612.45 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:32:04,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342756.3333333333, ans=0.1 2024-06-21 07:32:18,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=342793.0, ans=0.125 2024-06-21 07:32:21,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=342811.3333333333, ans=0.2 2024-06-21 07:32:22,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=342811.3333333333, ans=0.0 2024-06-21 07:32:24,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342811.3333333333, ans=0.1 2024-06-21 07:32:28,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=342829.6666666667, ans=0.95 2024-06-21 07:32:32,905 INFO [train.py:1028] (0/2) Epoch 19, batch 4900, loss[loss=0.1821, simple_loss=0.2333, pruned_loss=0.0654, over 13186.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2369, pruned_loss=0.06922, over 2575288.73 frames. ], batch size: 59, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:32:38,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=342866.3333333333, ans=0.125 2024-06-21 07:32:55,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=342903.0, ans=0.125 2024-06-21 07:33:02,252 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.584e+02 1.897e+02 2.022e+02 2.196e+02 2.774e+02, threshold=4.044e+02, percent-clipped=0.0 2024-06-21 07:33:03,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=342921.3333333333, ans=0.2 2024-06-21 07:33:08,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=342939.6666666667, ans=0.0 2024-06-21 07:33:09,473 INFO [train.py:1028] (0/2) Epoch 19, batch 4950, loss[loss=0.2007, simple_loss=0.2379, pruned_loss=0.08173, over 10993.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2369, pruned_loss=0.06938, over 2569367.47 frames. ], batch size: 303, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:33:13,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=342939.6666666667, ans=0.125 2024-06-21 07:33:27,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=342976.3333333333, ans=0.0 2024-06-21 07:33:35,444 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=12.0 2024-06-21 07:33:45,532 INFO [train.py:1028] (0/2) Epoch 19, batch 5000, loss[loss=0.1795, simple_loss=0.2305, pruned_loss=0.06425, over 13219.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.237, pruned_loss=0.06913, over 2574195.72 frames. ], batch size: 95, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:33:46,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=343031.3333333333, ans=0.125 2024-06-21 07:33:54,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=343049.6666666667, ans=0.0 2024-06-21 07:33:57,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=343049.6666666667, ans=0.2 2024-06-21 07:34:08,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=343086.3333333333, ans=0.125 2024-06-21 07:34:11,659 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 1.843e+02 1.957e+02 2.118e+02 2.953e+02, threshold=3.913e+02, percent-clipped=0.0 2024-06-21 07:34:12,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=343104.6666666667, ans=0.0 2024-06-21 07:34:14,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=343104.6666666667, ans=0.04949747468305833 2024-06-21 07:34:18,930 INFO [train.py:1028] (0/2) Epoch 19, batch 5050, loss[loss=0.1697, simple_loss=0.2231, pruned_loss=0.05813, over 12827.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2373, pruned_loss=0.06903, over 2573072.74 frames. ], batch size: 36, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:34:20,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=343123.0, ans=0.125 2024-06-21 07:34:20,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=343123.0, ans=0.125 2024-06-21 07:34:25,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=343141.3333333333, ans=0.0 2024-06-21 07:34:30,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=343141.3333333333, ans=0.125 2024-06-21 07:34:35,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343159.6666666667, ans=0.1 2024-06-21 07:34:46,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=343196.3333333333, ans=0.0 2024-06-21 07:34:50,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-21 07:34:52,184 INFO [train.py:1028] (0/2) Epoch 19, batch 5100, loss[loss=0.1663, simple_loss=0.221, pruned_loss=0.05579, over 12927.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2371, pruned_loss=0.06919, over 2569850.20 frames. ], batch size: 39, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:34:54,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343214.6666666667, ans=0.1 2024-06-21 07:35:06,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=343233.0, ans=0.0 2024-06-21 07:35:06,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.15 vs. limit=12.0 2024-06-21 07:35:12,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2024-06-21 07:35:15,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=343269.6666666667, ans=0.0 2024-06-21 07:35:16,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343269.6666666667, ans=0.1 2024-06-21 07:35:17,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=343269.6666666667, ans=0.0 2024-06-21 07:35:21,385 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 1.928e+02 2.122e+02 2.324e+02 3.199e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 07:35:26,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=343288.0, ans=0.125 2024-06-21 07:35:26,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2024-06-21 07:35:28,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=343306.3333333333, ans=0.1 2024-06-21 07:35:28,545 INFO [train.py:1028] (0/2) Epoch 19, batch 5150, loss[loss=0.1811, simple_loss=0.2239, pruned_loss=0.06916, over 13104.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2369, pruned_loss=0.06913, over 2571572.48 frames. ], batch size: 132, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:35:48,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=343343.0, ans=0.125 2024-06-21 07:35:54,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=343361.3333333333, ans=0.125 2024-06-21 07:35:56,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=343361.3333333333, ans=0.0 2024-06-21 07:36:02,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=343379.6666666667, ans=0.2 2024-06-21 07:36:05,374 INFO [train.py:1028] (0/2) Epoch 19, batch 5200, loss[loss=0.1855, simple_loss=0.2314, pruned_loss=0.06985, over 13165.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2367, pruned_loss=0.06892, over 2574152.84 frames. ], batch size: 95, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:36:06,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=343398.0, ans=10.0 2024-06-21 07:36:14,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=343416.3333333333, ans=0.0 2024-06-21 07:36:30,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=15.0 2024-06-21 07:36:31,524 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 1.889e+02 2.044e+02 2.199e+02 2.726e+02, threshold=4.087e+02, percent-clipped=0.0 2024-06-21 07:36:38,661 INFO [train.py:1028] (0/2) Epoch 19, batch 5250, loss[loss=0.1625, simple_loss=0.2159, pruned_loss=0.05459, over 13217.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2363, pruned_loss=0.06893, over 2569121.00 frames. ], batch size: 52, lr: 3.06e-03, grad_scale: 32.0 2024-06-21 07:36:38,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=343489.6666666667, ans=0.0 2024-06-21 07:36:49,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=343508.0, ans=0.125 2024-06-21 07:37:04,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=343544.6666666667, ans=0.0 2024-06-21 07:37:06,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343544.6666666667, ans=0.1 2024-06-21 07:37:10,388 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:37:15,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2024-06-21 07:37:15,473 INFO [train.py:1028] (0/2) Epoch 19, batch 5300, loss[loss=0.1736, simple_loss=0.225, pruned_loss=0.06112, over 13071.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2365, pruned_loss=0.06905, over 2565632.52 frames. ], batch size: 144, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:37:15,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=343581.3333333333, ans=0.2 2024-06-21 07:37:20,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=343581.3333333333, ans=0.125 2024-06-21 07:37:21,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=343599.6666666667, ans=0.125 2024-06-21 07:37:33,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2024-06-21 07:37:35,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=343618.0, ans=0.125 2024-06-21 07:37:43,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=343636.3333333333, ans=0.0 2024-06-21 07:37:44,977 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.721e+02 1.924e+02 2.069e+02 2.270e+02 3.512e+02, threshold=4.138e+02, percent-clipped=0.0 2024-06-21 07:37:45,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=343654.6666666667, ans=0.0 2024-06-21 07:37:52,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=343673.0, ans=0.0 2024-06-21 07:37:52,890 INFO [train.py:1028] (0/2) Epoch 19, batch 5350, loss[loss=0.1971, simple_loss=0.2491, pruned_loss=0.07252, over 12002.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2362, pruned_loss=0.06901, over 2573336.07 frames. ], batch size: 17, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:38:03,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=343691.3333333333, ans=0.125 2024-06-21 07:38:05,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=343709.6666666667, ans=0.125 2024-06-21 07:38:06,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=343709.6666666667, ans=0.125 2024-06-21 07:38:07,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=343709.6666666667, ans=0.0 2024-06-21 07:38:18,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.61 vs. limit=15.0 2024-06-21 07:38:21,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=343746.3333333333, ans=0.0 2024-06-21 07:38:24,898 INFO [train.py:1028] (0/2) Epoch 19, batch 5400, loss[loss=0.1981, simple_loss=0.236, pruned_loss=0.08009, over 12201.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2361, pruned_loss=0.06927, over 2565846.69 frames. ], batch size: 240, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:38:34,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=343783.0, ans=0.2 2024-06-21 07:38:38,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=343801.3333333333, ans=0.1 2024-06-21 07:38:48,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.53 vs. limit=15.0 2024-06-21 07:38:54,116 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 1.913e+02 2.027e+02 2.190e+02 2.653e+02, threshold=4.054e+02, percent-clipped=0.0 2024-06-21 07:39:01,375 INFO [train.py:1028] (0/2) Epoch 19, batch 5450, loss[loss=0.2015, simple_loss=0.2526, pruned_loss=0.0752, over 12860.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2366, pruned_loss=0.06909, over 2570370.77 frames. ], batch size: 26, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:39:30,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=343929.6666666667, ans=0.5 2024-06-21 07:39:36,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=343948.0, ans=0.0 2024-06-21 07:39:37,298 INFO [train.py:1028] (0/2) Epoch 19, batch 5500, loss[loss=0.2201, simple_loss=0.2615, pruned_loss=0.0893, over 12211.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.237, pruned_loss=0.06935, over 2564550.19 frames. ], batch size: 240, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:39:41,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=343948.0, ans=0.0 2024-06-21 07:39:42,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.34 vs. limit=10.0 2024-06-21 07:39:48,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2024-06-21 07:39:50,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=343984.6666666667, ans=0.125 2024-06-21 07:39:52,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=343984.6666666667, ans=0.125 2024-06-21 07:39:53,642 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.718e-03 2024-06-21 07:40:02,699 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.684e+02 1.901e+02 2.066e+02 2.390e+02 3.746e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-21 07:40:08,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=344021.3333333333, ans=0.125 2024-06-21 07:40:10,238 INFO [train.py:1028] (0/2) Epoch 19, batch 5550, loss[loss=0.1883, simple_loss=0.2379, pruned_loss=0.06939, over 13254.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2366, pruned_loss=0.06895, over 2568083.41 frames. ], batch size: 43, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:40:42,503 INFO [train.py:1028] (0/2) Epoch 19, batch 5600, loss[loss=0.2045, simple_loss=0.2482, pruned_loss=0.08045, over 13253.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2359, pruned_loss=0.06845, over 2570182.39 frames. ], batch size: 89, lr: 3.06e-03, grad_scale: 64.0 2024-06-21 07:40:55,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=344149.6666666667, ans=0.125 2024-06-21 07:40:55,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=344149.6666666667, ans=0.125 2024-06-21 07:41:02,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2024-06-21 07:41:12,042 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 1.903e+02 2.029e+02 2.180e+02 2.583e+02, threshold=4.059e+02, percent-clipped=0.0 2024-06-21 07:41:18,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2024-06-21 07:41:19,075 INFO [train.py:1028] (0/2) Epoch 19, batch 5650, loss[loss=0.1979, simple_loss=0.239, pruned_loss=0.07834, over 12570.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2361, pruned_loss=0.06863, over 2575691.36 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:41:22,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=344223.0, ans=0.125 2024-06-21 07:41:37,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2024-06-21 07:41:43,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2024-06-21 07:41:46,637 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=15.0 2024-06-21 07:41:47,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=344278.0, ans=0.125 2024-06-21 07:41:47,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2024-06-21 07:41:55,450 INFO [train.py:1028] (0/2) Epoch 19, batch 5700, loss[loss=0.1884, simple_loss=0.2396, pruned_loss=0.06862, over 13302.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2356, pruned_loss=0.06855, over 2578267.95 frames. ], batch size: 63, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:42:02,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=344333.0, ans=0.0 2024-06-21 07:42:19,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=344369.6666666667, ans=0.035 2024-06-21 07:42:20,455 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.888e+02 2.025e+02 2.203e+02 2.999e+02, threshold=4.051e+02, percent-clipped=0.0 2024-06-21 07:42:20,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344388.0, ans=0.1 2024-06-21 07:42:27,736 INFO [train.py:1028] (0/2) Epoch 19, batch 5750, loss[loss=0.1987, simple_loss=0.247, pruned_loss=0.07524, over 12758.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2366, pruned_loss=0.06889, over 2578914.26 frames. ], batch size: 176, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:42:27,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=344406.3333333333, ans=0.125 2024-06-21 07:42:41,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=344443.0, ans=0.125 2024-06-21 07:42:56,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=344479.6666666667, ans=0.0 2024-06-21 07:43:02,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2024-06-21 07:43:03,713 INFO [train.py:1028] (0/2) Epoch 19, batch 5800, loss[loss=0.2124, simple_loss=0.2517, pruned_loss=0.0866, over 12870.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2383, pruned_loss=0.06984, over 2579338.67 frames. ], batch size: 177, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:43:11,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=344516.3333333333, ans=0.0 2024-06-21 07:43:14,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-21 07:43:22,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=344534.6666666667, ans=0.1 2024-06-21 07:43:32,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=344553.0, ans=0.2 2024-06-21 07:43:33,349 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.958e+02 2.123e+02 2.335e+02 3.232e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 07:43:39,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=344571.3333333333, ans=0.125 2024-06-21 07:43:40,566 INFO [train.py:1028] (0/2) Epoch 19, batch 5850, loss[loss=0.1908, simple_loss=0.243, pruned_loss=0.06931, over 12532.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2401, pruned_loss=0.07064, over 2577364.03 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:43:48,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=344608.0, ans=0.95 2024-06-21 07:43:49,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=344608.0, ans=0.125 2024-06-21 07:43:52,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=344608.0, ans=0.015 2024-06-21 07:43:56,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=344626.3333333333, ans=0.125 2024-06-21 07:44:03,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=344644.6666666667, ans=0.1 2024-06-21 07:44:07,483 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-188000.pt 2024-06-21 07:44:18,472 INFO [train.py:1028] (0/2) Epoch 19, batch 5900, loss[loss=0.1741, simple_loss=0.2197, pruned_loss=0.06422, over 13110.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2416, pruned_loss=0.07126, over 2578740.17 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:44:22,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=344681.3333333333, ans=0.1 2024-06-21 07:44:23,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2024-06-21 07:44:26,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344699.6666666667, ans=0.0 2024-06-21 07:44:34,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=344718.0, ans=0.0 2024-06-21 07:44:34,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=344718.0, ans=0.0 2024-06-21 07:44:44,392 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.639e+02 1.955e+02 2.100e+02 2.342e+02 3.281e+02, threshold=4.200e+02, percent-clipped=0.0 2024-06-21 07:44:55,065 INFO [train.py:1028] (0/2) Epoch 19, batch 5950, loss[loss=0.1891, simple_loss=0.2362, pruned_loss=0.07103, over 13131.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2428, pruned_loss=0.07142, over 2583956.20 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:44:55,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=344773.0, ans=10.0 2024-06-21 07:44:57,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=344773.0, ans=0.125 2024-06-21 07:45:02,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=344791.3333333333, ans=22.5 2024-06-21 07:45:04,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=344791.3333333333, ans=0.0 2024-06-21 07:45:07,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=344791.3333333333, ans=0.2 2024-06-21 07:45:11,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344809.6666666667, ans=0.1 2024-06-21 07:45:20,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=344828.0, ans=0.2 2024-06-21 07:45:22,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2024-06-21 07:45:28,461 INFO [train.py:1028] (0/2) Epoch 19, batch 6000, loss[loss=0.2641, simple_loss=0.2955, pruned_loss=0.1164, over 12186.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2438, pruned_loss=0.07201, over 2576816.96 frames. ], batch size: 240, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:45:28,462 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 07:45:36,184 INFO [train.py:1060] (0/2) Epoch 19, validation: loss=0.1869, simple_loss=0.2515, pruned_loss=0.06121, over 351949.00 frames. 2024-06-21 07:45:36,185 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 07:45:52,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344901.3333333333, ans=0.1 2024-06-21 07:45:58,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=344901.3333333333, ans=0.125 2024-06-21 07:46:02,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=344919.6666666667, ans=0.2 2024-06-21 07:46:05,923 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.949e+02 2.070e+02 2.211e+02 3.861e+02, threshold=4.141e+02, percent-clipped=0.0 2024-06-21 07:46:07,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=344938.0, ans=0.0 2024-06-21 07:46:13,178 INFO [train.py:1028] (0/2) Epoch 19, batch 6050, loss[loss=0.2052, simple_loss=0.2566, pruned_loss=0.07694, over 12904.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2454, pruned_loss=0.07271, over 2580112.20 frames. ], batch size: 39, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:46:16,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.16 vs. limit=15.0 2024-06-21 07:46:24,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=344974.6666666667, ans=0.0 2024-06-21 07:46:27,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=344993.0, ans=0.0 2024-06-21 07:46:30,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=344993.0, ans=0.0 2024-06-21 07:46:32,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=344993.0, ans=0.2 2024-06-21 07:46:34,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=345011.3333333333, ans=0.025 2024-06-21 07:46:46,814 INFO [train.py:1028] (0/2) Epoch 19, batch 6100, loss[loss=0.1758, simple_loss=0.229, pruned_loss=0.06133, over 13120.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2466, pruned_loss=0.07266, over 2581619.78 frames. ], batch size: 121, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:46:51,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=345048.0, ans=0.125 2024-06-21 07:46:58,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345066.3333333333, ans=0.1 2024-06-21 07:47:01,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=8.0 2024-06-21 07:47:01,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=345066.3333333333, ans=0.025 2024-06-21 07:47:03,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=345084.6666666667, ans=0.1 2024-06-21 07:47:05,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=345084.6666666667, ans=0.1 2024-06-21 07:47:05,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2024-06-21 07:47:15,924 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 2.007e+02 2.141e+02 2.368e+02 3.897e+02, threshold=4.282e+02, percent-clipped=0.0 2024-06-21 07:47:16,156 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:47:23,328 INFO [train.py:1028] (0/2) Epoch 19, batch 6150, loss[loss=0.2034, simple_loss=0.2473, pruned_loss=0.07974, over 10861.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2486, pruned_loss=0.07329, over 2579714.60 frames. ], batch size: 304, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:47:29,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=345158.0, ans=0.125 2024-06-21 07:47:33,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=345158.0, ans=0.125 2024-06-21 07:47:36,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2024-06-21 07:47:45,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=345176.3333333333, ans=0.04949747468305833 2024-06-21 07:48:00,265 INFO [train.py:1028] (0/2) Epoch 19, batch 6200, loss[loss=0.1969, simple_loss=0.2545, pruned_loss=0.06961, over 13252.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2504, pruned_loss=0.07405, over 2576714.67 frames. ], batch size: 89, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:48:26,229 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.100e+02 2.345e+02 2.601e+02 3.428e+02, threshold=4.690e+02, percent-clipped=0.0 2024-06-21 07:48:33,728 INFO [train.py:1028] (0/2) Epoch 19, batch 6250, loss[loss=0.1692, simple_loss=0.2226, pruned_loss=0.0579, over 13198.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2513, pruned_loss=0.07446, over 2569628.25 frames. ], batch size: 83, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:48:35,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=345323.0, ans=0.0 2024-06-21 07:48:37,411 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=12.0 2024-06-21 07:48:39,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=345323.0, ans=0.02 2024-06-21 07:48:58,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2024-06-21 07:48:58,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=345378.0, ans=0.2 2024-06-21 07:49:04,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=12.0 2024-06-21 07:49:09,165 INFO [train.py:1028] (0/2) Epoch 19, batch 6300, loss[loss=0.1886, simple_loss=0.2448, pruned_loss=0.06622, over 11859.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2529, pruned_loss=0.07514, over 2564693.84 frames. ], batch size: 17, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:49:19,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=345433.0, ans=0.125 2024-06-21 07:49:34,607 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.757e+02 2.023e+02 2.163e+02 2.379e+02 3.057e+02, threshold=4.325e+02, percent-clipped=0.0 2024-06-21 07:49:45,802 INFO [train.py:1028] (0/2) Epoch 19, batch 6350, loss[loss=0.2377, simple_loss=0.2909, pruned_loss=0.09226, over 12585.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2543, pruned_loss=0.07505, over 2573902.92 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:49:48,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.22 vs. limit=22.5 2024-06-21 07:49:49,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2024-06-21 07:49:50,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=345506.3333333333, ans=0.0 2024-06-21 07:50:08,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345561.3333333333, ans=0.125 2024-06-21 07:50:18,186 INFO [train.py:1028] (0/2) Epoch 19, batch 6400, loss[loss=0.1944, simple_loss=0.2458, pruned_loss=0.07146, over 13239.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2565, pruned_loss=0.0762, over 2574234.90 frames. ], batch size: 67, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:50:24,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=345616.3333333333, ans=0.125 2024-06-21 07:50:26,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=345616.3333333333, ans=0.0 2024-06-21 07:50:31,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345634.6666666667, ans=0.1 2024-06-21 07:50:39,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=345653.0, ans=0.125 2024-06-21 07:50:42,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=345653.0, ans=0.125 2024-06-21 07:50:43,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.065e+02 2.280e+02 2.608e+02 4.185e+02, threshold=4.561e+02, percent-clipped=0.0 2024-06-21 07:50:46,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345671.3333333333, ans=0.125 2024-06-21 07:50:48,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=345671.3333333333, ans=0.0 2024-06-21 07:50:48,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=345671.3333333333, ans=0.125 2024-06-21 07:50:48,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=345671.3333333333, ans=0.05 2024-06-21 07:50:49,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=345671.3333333333, ans=0.0 2024-06-21 07:50:50,585 INFO [train.py:1028] (0/2) Epoch 19, batch 6450, loss[loss=0.2495, simple_loss=0.2944, pruned_loss=0.1023, over 12566.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2578, pruned_loss=0.07679, over 2580598.28 frames. ], batch size: 202, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:50:55,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.24 vs. limit=10.0 2024-06-21 07:51:09,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=345726.3333333333, ans=0.0 2024-06-21 07:51:15,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=345744.6666666667, ans=0.025 2024-06-21 07:51:16,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=345744.6666666667, ans=0.0 2024-06-21 07:51:16,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=345744.6666666667, ans=0.125 2024-06-21 07:51:22,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=345763.0, ans=0.025 2024-06-21 07:51:28,602 INFO [train.py:1028] (0/2) Epoch 19, batch 6500, loss[loss=0.2312, simple_loss=0.2716, pruned_loss=0.09538, over 10648.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2598, pruned_loss=0.07736, over 2584147.16 frames. ], batch size: 304, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:51:30,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345781.3333333333, ans=0.1 2024-06-21 07:51:53,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=345836.3333333333, ans=0.035 2024-06-21 07:51:57,861 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.101e+02 2.298e+02 2.636e+02 3.938e+02, threshold=4.596e+02, percent-clipped=0.0 2024-06-21 07:51:58,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345854.6666666667, ans=0.1 2024-06-21 07:51:59,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=345854.6666666667, ans=0.125 2024-06-21 07:52:04,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=345873.0, ans=0.125 2024-06-21 07:52:05,032 INFO [train.py:1028] (0/2) Epoch 19, batch 6550, loss[loss=0.1945, simple_loss=0.2441, pruned_loss=0.07244, over 12610.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2604, pruned_loss=0.07744, over 2587801.88 frames. ], batch size: 22, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:52:07,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-21 07:52:19,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2024-06-21 07:52:37,334 INFO [train.py:1028] (0/2) Epoch 19, batch 6600, loss[loss=0.2016, simple_loss=0.2569, pruned_loss=0.07309, over 13219.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2605, pruned_loss=0.07739, over 2589254.58 frames. ], batch size: 72, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:52:38,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=345964.6666666667, ans=0.125 2024-06-21 07:52:44,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=345983.0, ans=0.0 2024-06-21 07:52:44,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345983.0, ans=0.1 2024-06-21 07:52:48,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345983.0, ans=0.1 2024-06-21 07:52:48,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=345983.0, ans=0.125 2024-06-21 07:52:49,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=345983.0, ans=0.125 2024-06-21 07:52:50,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=346001.3333333333, ans=0.0 2024-06-21 07:52:51,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.56 vs. limit=22.5 2024-06-21 07:52:52,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-21 07:53:03,062 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.038e+02 2.225e+02 2.438e+02 3.295e+02, threshold=4.451e+02, percent-clipped=0.0 2024-06-21 07:53:12,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=346038.0, ans=0.125 2024-06-21 07:53:13,721 INFO [train.py:1028] (0/2) Epoch 19, batch 6650, loss[loss=0.2296, simple_loss=0.2791, pruned_loss=0.09004, over 12920.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2621, pruned_loss=0.07785, over 2585131.21 frames. ], batch size: 158, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:53:15,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=346056.3333333333, ans=0.125 2024-06-21 07:53:28,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=346093.0, ans=0.0 2024-06-21 07:53:32,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=346093.0, ans=0.0 2024-06-21 07:53:33,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=346111.3333333333, ans=0.125 2024-06-21 07:53:37,721 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:53:38,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.79 vs. limit=10.0 2024-06-21 07:53:42,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346129.6666666667, ans=0.125 2024-06-21 07:53:46,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.38 vs. limit=22.5 2024-06-21 07:53:47,090 INFO [train.py:1028] (0/2) Epoch 19, batch 6700, loss[loss=0.2197, simple_loss=0.2676, pruned_loss=0.08587, over 12800.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2637, pruned_loss=0.07872, over 2584402.22 frames. ], batch size: 176, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:54:04,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=346184.6666666667, ans=0.025 2024-06-21 07:54:07,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=346184.6666666667, ans=0.04949747468305833 2024-06-21 07:54:16,710 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.076e+02 2.216e+02 2.445e+02 3.352e+02, threshold=4.433e+02, percent-clipped=0.0 2024-06-21 07:54:24,467 INFO [train.py:1028] (0/2) Epoch 19, batch 6750, loss[loss=0.2673, simple_loss=0.3047, pruned_loss=0.115, over 12281.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2639, pruned_loss=0.07892, over 2577217.04 frames. ], batch size: 241, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:54:24,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=346239.6666666667, ans=0.0 2024-06-21 07:54:31,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=346258.0, ans=0.125 2024-06-21 07:54:32,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=346258.0, ans=0.125 2024-06-21 07:54:33,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346258.0, ans=0.125 2024-06-21 07:54:37,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=346276.3333333333, ans=0.0 2024-06-21 07:54:41,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=346276.3333333333, ans=0.0 2024-06-21 07:54:45,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346294.6666666667, ans=0.1 2024-06-21 07:54:47,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=346294.6666666667, ans=0.0 2024-06-21 07:54:47,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=346294.6666666667, ans=0.0 2024-06-21 07:54:57,073 INFO [train.py:1028] (0/2) Epoch 19, batch 6800, loss[loss=0.1861, simple_loss=0.244, pruned_loss=0.06411, over 13190.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2647, pruned_loss=0.07895, over 2579299.45 frames. ], batch size: 67, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:55:04,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=346349.6666666667, ans=0.1 2024-06-21 07:55:12,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=346349.6666666667, ans=0.125 2024-06-21 07:55:19,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=346386.3333333333, ans=0.0 2024-06-21 07:55:25,892 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.017e+02 2.126e+02 2.343e+02 3.420e+02, threshold=4.251e+02, percent-clipped=0.0 2024-06-21 07:55:26,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=346404.6666666667, ans=0.125 2024-06-21 07:55:33,254 INFO [train.py:1028] (0/2) Epoch 19, batch 6850, loss[loss=0.246, simple_loss=0.3041, pruned_loss=0.09391, over 13270.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2652, pruned_loss=0.07881, over 2582445.53 frames. ], batch size: 63, lr: 3.05e-03, grad_scale: 64.0 2024-06-21 07:55:36,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2024-06-21 07:55:44,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=346441.3333333333, ans=0.0 2024-06-21 07:55:45,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=346441.3333333333, ans=0.125 2024-06-21 07:55:48,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=346459.6666666667, ans=0.125 2024-06-21 07:55:50,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2024-06-21 07:56:02,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=346496.3333333333, ans=0.0 2024-06-21 07:56:06,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2024-06-21 07:56:09,520 INFO [train.py:1028] (0/2) Epoch 19, batch 6900, loss[loss=0.228, simple_loss=0.2776, pruned_loss=0.08916, over 13008.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2659, pruned_loss=0.07919, over 2583586.43 frames. ], batch size: 48, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:56:10,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=346514.6666666667, ans=0.05 2024-06-21 07:56:11,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.75 vs. limit=15.0 2024-06-21 07:56:13,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=346514.6666666667, ans=0.0 2024-06-21 07:56:15,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346533.0, ans=0.125 2024-06-21 07:56:17,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=346533.0, ans=0.125 2024-06-21 07:56:20,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=346533.0, ans=0.125 2024-06-21 07:56:21,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=346551.3333333333, ans=0.125 2024-06-21 07:56:24,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2024-06-21 07:56:25,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=346551.3333333333, ans=0.125 2024-06-21 07:56:26,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346551.3333333333, ans=0.1 2024-06-21 07:56:34,622 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.092e+02 2.216e+02 2.491e+02 3.460e+02, threshold=4.431e+02, percent-clipped=0.0 2024-06-21 07:56:42,072 INFO [train.py:1028] (0/2) Epoch 19, batch 6950, loss[loss=0.2109, simple_loss=0.2688, pruned_loss=0.07648, over 11769.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2664, pruned_loss=0.07895, over 2578548.47 frames. ], batch size: 17, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:56:42,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=346606.3333333333, ans=0.125 2024-06-21 07:56:44,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=346606.3333333333, ans=0.125 2024-06-21 07:56:59,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=346643.0, ans=0.025 2024-06-21 07:57:01,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=346661.3333333333, ans=0.0 2024-06-21 07:57:12,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=346679.6666666667, ans=0.125 2024-06-21 07:57:17,980 INFO [train.py:1028] (0/2) Epoch 19, batch 7000, loss[loss=0.238, simple_loss=0.2808, pruned_loss=0.09757, over 12963.00 frames. ], tot_loss[loss=0.212, simple_loss=0.266, pruned_loss=0.07895, over 2575653.43 frames. ], batch size: 158, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:57:21,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.37 vs. limit=22.5 2024-06-21 07:57:25,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=346716.3333333333, ans=0.0 2024-06-21 07:57:44,880 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.166e+02 2.379e+02 2.652e+02 3.633e+02, threshold=4.758e+02, percent-clipped=0.0 2024-06-21 07:57:52,016 INFO [train.py:1028] (0/2) Epoch 19, batch 7050, loss[loss=0.23, simple_loss=0.2797, pruned_loss=0.09013, over 12741.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2676, pruned_loss=0.07945, over 2583335.74 frames. ], batch size: 176, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:57:52,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2024-06-21 07:57:58,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=346808.0, ans=0.05 2024-06-21 07:58:07,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=346826.3333333333, ans=0.1 2024-06-21 07:58:08,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=346826.3333333333, ans=0.125 2024-06-21 07:58:13,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2024-06-21 07:58:19,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=346844.6666666667, ans=0.125 2024-06-21 07:58:27,664 INFO [train.py:1028] (0/2) Epoch 19, batch 7100, loss[loss=0.2535, simple_loss=0.3024, pruned_loss=0.1024, over 13232.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2688, pruned_loss=0.08017, over 2574821.26 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:58:36,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=346899.6666666667, ans=0.125 2024-06-21 07:58:47,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=346936.3333333333, ans=0.125 2024-06-21 07:58:52,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.099e+02 2.259e+02 2.470e+02 3.444e+02, threshold=4.518e+02, percent-clipped=0.0 2024-06-21 07:58:55,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346954.6666666667, ans=0.1 2024-06-21 07:58:59,673 INFO [train.py:1028] (0/2) Epoch 19, batch 7150, loss[loss=0.2561, simple_loss=0.2981, pruned_loss=0.1071, over 12574.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2692, pruned_loss=0.08012, over 2572632.00 frames. ], batch size: 202, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:59:09,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=346991.3333333333, ans=0.125 2024-06-21 07:59:19,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=347009.6666666667, ans=0.0 2024-06-21 07:59:22,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=347028.0, ans=0.09899494936611666 2024-06-21 07:59:31,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=347046.3333333333, ans=0.125 2024-06-21 07:59:35,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=15.0 2024-06-21 07:59:35,430 INFO [train.py:1028] (0/2) Epoch 19, batch 7200, loss[loss=0.2297, simple_loss=0.2873, pruned_loss=0.08602, over 13178.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2703, pruned_loss=0.08045, over 2578657.69 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 07:59:38,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=347064.6666666667, ans=0.125 2024-06-21 07:59:39,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=347064.6666666667, ans=0.125 2024-06-21 07:59:40,807 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 07:59:42,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=347083.0, ans=0.125 2024-06-21 07:59:43,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=347083.0, ans=0.07 2024-06-21 07:59:45,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347083.0, ans=0.1 2024-06-21 07:59:57,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=347119.6666666667, ans=0.125 2024-06-21 08:00:00,974 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 2.111e+02 2.252e+02 2.466e+02 3.220e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 08:00:06,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.86 vs. limit=15.0 2024-06-21 08:00:07,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=347138.0, ans=0.125 2024-06-21 08:00:08,690 INFO [train.py:1028] (0/2) Epoch 19, batch 7250, loss[loss=0.1995, simple_loss=0.2629, pruned_loss=0.068, over 12931.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2705, pruned_loss=0.08022, over 2578731.08 frames. ], batch size: 36, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:00:21,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=347174.6666666667, ans=0.125 2024-06-21 08:00:25,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=347193.0, ans=0.125 2024-06-21 08:00:33,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=347211.3333333333, ans=0.0 2024-06-21 08:00:43,948 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=12.0 2024-06-21 08:00:47,421 INFO [train.py:1028] (0/2) Epoch 19, batch 7300, loss[loss=0.2098, simple_loss=0.2664, pruned_loss=0.07661, over 12859.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2719, pruned_loss=0.08074, over 2578099.17 frames. ], batch size: 36, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:00:57,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=347266.3333333333, ans=0.0 2024-06-21 08:00:59,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347266.3333333333, ans=0.125 2024-06-21 08:01:13,684 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.070e+02 2.210e+02 2.407e+02 3.874e+02, threshold=4.419e+02, percent-clipped=0.0 2024-06-21 08:01:21,226 INFO [train.py:1028] (0/2) Epoch 19, batch 7350, loss[loss=0.2409, simple_loss=0.2937, pruned_loss=0.09409, over 13336.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2728, pruned_loss=0.08125, over 2578389.95 frames. ], batch size: 46, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:01:31,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347339.6666666667, ans=0.125 2024-06-21 08:01:31,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=347339.6666666667, ans=0.025 2024-06-21 08:01:34,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.38 vs. limit=15.0 2024-06-21 08:01:46,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2024-06-21 08:01:49,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=347394.6666666667, ans=0.0 2024-06-21 08:02:00,047 INFO [train.py:1028] (0/2) Epoch 19, batch 7400, loss[loss=0.2351, simple_loss=0.2926, pruned_loss=0.08884, over 13290.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2728, pruned_loss=0.08152, over 2584004.49 frames. ], batch size: 63, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:02:05,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=347431.3333333333, ans=0.125 2024-06-21 08:02:17,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=347468.0, ans=0.125 2024-06-21 08:02:19,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347468.0, ans=0.1 2024-06-21 08:02:26,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347486.3333333333, ans=0.125 2024-06-21 08:02:30,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.138e+02 2.343e+02 2.549e+02 3.300e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 08:02:33,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=347504.6666666667, ans=0.0 2024-06-21 08:02:35,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=347504.6666666667, ans=0.125 2024-06-21 08:02:37,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2024-06-21 08:02:37,804 INFO [train.py:1028] (0/2) Epoch 19, batch 7450, loss[loss=0.1931, simple_loss=0.2495, pruned_loss=0.06829, over 12596.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2729, pruned_loss=0.08133, over 2577627.93 frames. ], batch size: 29, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:03:00,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=347578.0, ans=0.035 2024-06-21 08:03:01,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=347578.0, ans=0.125 2024-06-21 08:03:01,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=347578.0, ans=0.125 2024-06-21 08:03:06,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2024-06-21 08:03:08,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=347596.3333333333, ans=10.0 2024-06-21 08:03:09,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=347596.3333333333, ans=0.125 2024-06-21 08:03:12,305 INFO [train.py:1028] (0/2) Epoch 19, batch 7500, loss[loss=0.2173, simple_loss=0.2589, pruned_loss=0.08787, over 10636.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2737, pruned_loss=0.08208, over 2575893.40 frames. ], batch size: 304, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:03:22,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=347633.0, ans=0.0 2024-06-21 08:03:24,069 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:03:39,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347669.6666666667, ans=0.125 2024-06-21 08:03:43,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.130e+02 2.256e+02 2.505e+02 3.086e+02, threshold=4.512e+02, percent-clipped=0.0 2024-06-21 08:03:49,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=347688.0, ans=0.125 2024-06-21 08:03:50,332 INFO [train.py:1028] (0/2) Epoch 19, batch 7550, loss[loss=0.2171, simple_loss=0.2704, pruned_loss=0.08188, over 12950.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2747, pruned_loss=0.08278, over 2575981.15 frames. ], batch size: 158, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:03:57,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=347724.6666666667, ans=0.2 2024-06-21 08:04:04,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.98 vs. limit=12.0 2024-06-21 08:04:05,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=347743.0, ans=0.1 2024-06-21 08:04:18,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=347761.3333333333, ans=0.015 2024-06-21 08:04:18,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=347761.3333333333, ans=10.0 2024-06-21 08:04:26,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-21 08:04:27,790 INFO [train.py:1028] (0/2) Epoch 19, batch 7600, loss[loss=0.219, simple_loss=0.2737, pruned_loss=0.0821, over 13241.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2748, pruned_loss=0.08257, over 2575792.36 frames. ], batch size: 83, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:04:32,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-21 08:04:33,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=347816.3333333333, ans=0.125 2024-06-21 08:04:43,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=347834.6666666667, ans=0.0 2024-06-21 08:04:45,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2024-06-21 08:04:48,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=347853.0, ans=0.0 2024-06-21 08:04:53,253 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.224e+02 2.502e+02 2.898e+02 4.118e+02, threshold=5.003e+02, percent-clipped=0.0 2024-06-21 08:05:00,998 INFO [train.py:1028] (0/2) Epoch 19, batch 7650, loss[loss=0.2054, simple_loss=0.2604, pruned_loss=0.07519, over 12859.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2748, pruned_loss=0.08227, over 2571160.59 frames. ], batch size: 33, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:05:05,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=347889.6666666667, ans=0.125 2024-06-21 08:05:09,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=347908.0, ans=0.125 2024-06-21 08:05:10,400 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:05:13,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=347926.3333333333, ans=0.0 2024-06-21 08:05:34,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=347963.0, ans=0.0 2024-06-21 08:05:37,785 INFO [train.py:1028] (0/2) Epoch 19, batch 7700, loss[loss=0.2211, simple_loss=0.2825, pruned_loss=0.07986, over 13277.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2753, pruned_loss=0.0825, over 2568775.36 frames. ], batch size: 63, lr: 3.04e-03, grad_scale: 128.0 2024-06-21 08:05:38,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=347981.3333333333, ans=0.125 2024-06-21 08:05:47,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.63 vs. limit=12.0 2024-06-21 08:05:51,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=348018.0, ans=0.2 2024-06-21 08:05:57,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=348036.3333333333, ans=0.125 2024-06-21 08:06:00,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=348036.3333333333, ans=0.0 2024-06-21 08:06:03,152 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.111e+02 2.206e+02 2.416e+02 3.191e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 08:06:07,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=348054.6666666667, ans=0.5 2024-06-21 08:06:10,345 INFO [train.py:1028] (0/2) Epoch 19, batch 7750, loss[loss=0.2266, simple_loss=0.2943, pruned_loss=0.07946, over 13244.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2764, pruned_loss=0.08291, over 2573745.91 frames. ], batch size: 72, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:06:22,247 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-21 08:06:46,699 INFO [train.py:1028] (0/2) Epoch 19, batch 7800, loss[loss=0.2318, simple_loss=0.2834, pruned_loss=0.09012, over 13196.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2765, pruned_loss=0.08281, over 2578435.91 frames. ], batch size: 95, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:07:00,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=348201.3333333333, ans=0.2 2024-06-21 08:07:01,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=348201.3333333333, ans=0.0 2024-06-21 08:07:05,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=348201.3333333333, ans=0.1 2024-06-21 08:07:15,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=348219.6666666667, ans=0.1 2024-06-21 08:07:16,901 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.127e+02 2.318e+02 2.553e+02 3.908e+02, threshold=4.636e+02, percent-clipped=0.0 2024-06-21 08:07:23,370 INFO [train.py:1028] (0/2) Epoch 19, batch 7850, loss[loss=0.2286, simple_loss=0.2878, pruned_loss=0.0847, over 11500.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2773, pruned_loss=0.08334, over 2572310.79 frames. ], batch size: 17, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:07:30,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=348274.6666666667, ans=0.0 2024-06-21 08:07:31,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=348274.6666666667, ans=0.2 2024-06-21 08:07:32,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=348274.6666666667, ans=0.2 2024-06-21 08:07:37,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=348293.0, ans=0.125 2024-06-21 08:07:48,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=348311.3333333333, ans=0.5 2024-06-21 08:07:49,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.34 vs. limit=22.5 2024-06-21 08:07:56,383 INFO [train.py:1028] (0/2) Epoch 19, batch 7900, loss[loss=0.2252, simple_loss=0.2876, pruned_loss=0.08145, over 13157.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2775, pruned_loss=0.08355, over 2572130.94 frames. ], batch size: 77, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:08:15,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=348384.6666666667, ans=0.125 2024-06-21 08:08:26,020 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.169e+02 2.314e+02 2.473e+02 3.845e+02, threshold=4.627e+02, percent-clipped=0.0 2024-06-21 08:08:28,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=348421.3333333333, ans=0.125 2024-06-21 08:08:32,382 INFO [train.py:1028] (0/2) Epoch 19, batch 7950, loss[loss=0.2138, simple_loss=0.2558, pruned_loss=0.08587, over 10531.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2773, pruned_loss=0.08334, over 2575266.23 frames. ], batch size: 303, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:08:44,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348458.0, ans=0.125 2024-06-21 08:08:58,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348513.0, ans=0.1 2024-06-21 08:09:05,016 INFO [train.py:1028] (0/2) Epoch 19, batch 8000, loss[loss=0.2303, simple_loss=0.2864, pruned_loss=0.08708, over 12755.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2785, pruned_loss=0.0839, over 2572805.13 frames. ], batch size: 29, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:09:12,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348549.6666666667, ans=0.1 2024-06-21 08:09:21,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=348568.0, ans=0.125 2024-06-21 08:09:25,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=348568.0, ans=0.125 2024-06-21 08:09:33,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348604.6666666667, ans=0.1 2024-06-21 08:09:34,266 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.155e+02 2.286e+02 2.574e+02 3.438e+02, threshold=4.571e+02, percent-clipped=0.0 2024-06-21 08:09:38,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=348604.6666666667, ans=0.0 2024-06-21 08:09:40,726 INFO [train.py:1028] (0/2) Epoch 19, batch 8050, loss[loss=0.2206, simple_loss=0.2793, pruned_loss=0.08095, over 13198.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2785, pruned_loss=0.0839, over 2572230.07 frames. ], batch size: 83, lr: 3.04e-03, grad_scale: 64.0 2024-06-21 08:09:42,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=348623.0, ans=0.125 2024-06-21 08:09:45,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=348623.0, ans=0.05 2024-06-21 08:09:49,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2024-06-21 08:09:53,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=348659.6666666667, ans=0.025 2024-06-21 08:09:53,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=348659.6666666667, ans=0.025 2024-06-21 08:09:57,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=348659.6666666667, ans=0.125 2024-06-21 08:09:57,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=348659.6666666667, ans=0.025 2024-06-21 08:10:05,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=348696.3333333333, ans=0.125 2024-06-21 08:10:15,853 INFO [train.py:1028] (0/2) Epoch 19, batch 8100, loss[loss=0.2118, simple_loss=0.2726, pruned_loss=0.07554, over 13166.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2792, pruned_loss=0.08411, over 2576335.00 frames. ], batch size: 112, lr: 3.04e-03, grad_scale: 32.0 2024-06-21 08:10:25,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348733.0, ans=0.125 2024-06-21 08:10:40,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=348769.6666666667, ans=0.125 2024-06-21 08:10:42,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=348788.0, ans=0.5 2024-06-21 08:10:43,081 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.117e+02 2.249e+02 2.407e+02 3.517e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-21 08:10:45,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2024-06-21 08:10:48,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2024-06-21 08:10:49,412 INFO [train.py:1028] (0/2) Epoch 19, batch 8150, loss[loss=0.206, simple_loss=0.2584, pruned_loss=0.0768, over 13122.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2796, pruned_loss=0.08392, over 2579829.20 frames. ], batch size: 121, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:11:01,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=348824.6666666667, ans=0.2 2024-06-21 08:11:01,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=348843.0, ans=0.125 2024-06-21 08:11:03,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.56 vs. limit=15.0 2024-06-21 08:11:06,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=348843.0, ans=0.125 2024-06-21 08:11:13,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=348861.3333333333, ans=0.0 2024-06-21 08:11:14,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=348879.6666666667, ans=0.125 2024-06-21 08:11:22,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=348879.6666666667, ans=0.0 2024-06-21 08:11:25,094 INFO [train.py:1028] (0/2) Epoch 19, batch 8200, loss[loss=0.2413, simple_loss=0.2928, pruned_loss=0.09487, over 13097.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2804, pruned_loss=0.0841, over 2583212.71 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:11:40,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348934.6666666667, ans=0.1 2024-06-21 08:11:42,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=348934.6666666667, ans=0.09899494936611666 2024-06-21 08:11:43,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=348934.6666666667, ans=0.2 2024-06-21 08:11:44,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=348953.0, ans=0.2 2024-06-21 08:11:46,216 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:11:52,719 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.175e+02 2.320e+02 2.656e+02 3.379e+02, threshold=4.640e+02, percent-clipped=0.0 2024-06-21 08:11:55,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=15.0 2024-06-21 08:11:56,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=348971.3333333333, ans=0.0 2024-06-21 08:11:58,642 INFO [train.py:1028] (0/2) Epoch 19, batch 8250, loss[loss=0.2123, simple_loss=0.2747, pruned_loss=0.07494, over 13240.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2806, pruned_loss=0.08404, over 2584653.56 frames. ], batch size: 52, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:12:06,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=349008.0, ans=0.0 2024-06-21 08:12:08,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=349008.0, ans=0.125 2024-06-21 08:12:17,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.06 vs. limit=15.0 2024-06-21 08:12:19,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=349026.3333333333, ans=0.125 2024-06-21 08:12:27,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-21 08:12:35,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=349063.0, ans=0.2 2024-06-21 08:12:36,222 INFO [train.py:1028] (0/2) Epoch 19, batch 8300, loss[loss=0.2363, simple_loss=0.2818, pruned_loss=0.09537, over 12997.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.28, pruned_loss=0.08384, over 2580004.96 frames. ], batch size: 102, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:12:47,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=349099.6666666667, ans=0.035 2024-06-21 08:12:49,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=349118.0, ans=0.025 2024-06-21 08:12:54,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=349118.0, ans=0.0 2024-06-21 08:12:57,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=349136.3333333333, ans=0.0 2024-06-21 08:13:03,813 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.149e+02 2.314e+02 2.548e+02 3.509e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-21 08:13:05,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=349154.6666666667, ans=0.125 2024-06-21 08:13:09,579 INFO [train.py:1028] (0/2) Epoch 19, batch 8350, loss[loss=0.237, simple_loss=0.2907, pruned_loss=0.09165, over 13204.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2798, pruned_loss=0.08346, over 2581793.20 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:13:20,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=349191.3333333333, ans=0.1 2024-06-21 08:13:23,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=349191.3333333333, ans=0.0 2024-06-21 08:13:33,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=349228.0, ans=0.125 2024-06-21 08:13:38,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=349228.0, ans=0.0 2024-06-21 08:13:41,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=349246.3333333333, ans=0.125 2024-06-21 08:13:46,280 INFO [train.py:1028] (0/2) Epoch 19, batch 8400, loss[loss=0.1944, simple_loss=0.2468, pruned_loss=0.07099, over 12899.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2804, pruned_loss=0.08403, over 2577064.51 frames. ], batch size: 39, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:13:58,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2024-06-21 08:14:01,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=349301.3333333333, ans=0.125 2024-06-21 08:14:10,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=349319.6666666667, ans=0.125 2024-06-21 08:14:11,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=349319.6666666667, ans=0.2 2024-06-21 08:14:12,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=349338.0, ans=0.025 2024-06-21 08:14:12,770 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.113e+02 2.250e+02 2.433e+02 3.273e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-21 08:14:21,763 INFO [train.py:1028] (0/2) Epoch 19, batch 8450, loss[loss=0.2145, simple_loss=0.2749, pruned_loss=0.07705, over 13182.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2803, pruned_loss=0.08392, over 2579409.91 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:14:24,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=349356.3333333333, ans=0.125 2024-06-21 08:14:25,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=349356.3333333333, ans=0.0 2024-06-21 08:14:38,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=349393.0, ans=0.125 2024-06-21 08:14:51,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=349429.6666666667, ans=0.125 2024-06-21 08:14:53,968 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.00 vs. limit=22.5 2024-06-21 08:14:54,917 INFO [train.py:1028] (0/2) Epoch 19, batch 8500, loss[loss=0.2416, simple_loss=0.2911, pruned_loss=0.09601, over 12698.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2812, pruned_loss=0.08427, over 2577725.41 frames. ], batch size: 29, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:14:58,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=349448.0, ans=0.125 2024-06-21 08:15:06,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349466.3333333333, ans=0.1 2024-06-21 08:15:12,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=349484.6666666667, ans=0.2 2024-06-21 08:15:22,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=349503.0, ans=0.125 2024-06-21 08:15:25,493 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.133e+02 2.237e+02 2.425e+02 3.564e+02, threshold=4.474e+02, percent-clipped=0.0 2024-06-21 08:15:31,371 INFO [train.py:1028] (0/2) Epoch 19, batch 8550, loss[loss=0.2183, simple_loss=0.2777, pruned_loss=0.07944, over 12516.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2806, pruned_loss=0.08376, over 2575599.23 frames. ], batch size: 22, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:15:31,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=349539.6666666667, ans=0.025 2024-06-21 08:15:44,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=349576.3333333333, ans=0.125 2024-06-21 08:15:45,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=12.0 2024-06-21 08:15:54,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=349594.6666666667, ans=0.125 2024-06-21 08:16:00,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=349613.0, ans=0.125 2024-06-21 08:16:05,000 INFO [train.py:1028] (0/2) Epoch 19, batch 8600, loss[loss=0.2019, simple_loss=0.2595, pruned_loss=0.0721, over 13124.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2807, pruned_loss=0.08373, over 2572884.09 frames. ], batch size: 112, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:16:12,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.76 vs. limit=22.5 2024-06-21 08:16:36,081 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.162e+02 2.401e+02 2.690e+02 3.858e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-21 08:16:42,317 INFO [train.py:1028] (0/2) Epoch 19, batch 8650, loss[loss=0.2206, simple_loss=0.2746, pruned_loss=0.08333, over 13057.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2807, pruned_loss=0.08374, over 2576082.74 frames. ], batch size: 102, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:16:42,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=349723.0, ans=0.2 2024-06-21 08:16:54,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2024-06-21 08:17:08,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.31 vs. limit=6.0 2024-06-21 08:17:15,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=349814.6666666667, ans=0.1 2024-06-21 08:17:18,998 INFO [train.py:1028] (0/2) Epoch 19, batch 8700, loss[loss=0.2404, simple_loss=0.301, pruned_loss=0.08987, over 13233.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2812, pruned_loss=0.08404, over 2573081.63 frames. ], batch size: 59, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:17:24,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=349814.6666666667, ans=0.125 2024-06-21 08:17:28,357 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.27 vs. limit=12.0 2024-06-21 08:17:34,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=349851.3333333333, ans=0.0 2024-06-21 08:17:36,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=349851.3333333333, ans=0.0 2024-06-21 08:17:40,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=349869.6666666667, ans=10.0 2024-06-21 08:17:44,833 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:17:46,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.152e+02 2.260e+02 2.429e+02 3.156e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 08:17:52,636 INFO [train.py:1028] (0/2) Epoch 19, batch 8750, loss[loss=0.211, simple_loss=0.2668, pruned_loss=0.0776, over 13088.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2816, pruned_loss=0.08446, over 2569182.27 frames. ], batch size: 121, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:17:55,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 08:18:05,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.10 vs. limit=15.0 2024-06-21 08:18:05,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=349943.0, ans=0.0 2024-06-21 08:18:06,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=349943.0, ans=0.125 2024-06-21 08:18:07,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.51 vs. limit=22.5 2024-06-21 08:18:09,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=349943.0, ans=0.2 2024-06-21 08:18:23,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=349979.6666666667, ans=0.125 2024-06-21 08:18:29,309 INFO [train.py:1028] (0/2) Epoch 19, batch 8800, loss[loss=0.2267, simple_loss=0.2963, pruned_loss=0.07852, over 13295.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2826, pruned_loss=0.08499, over 2574050.80 frames. ], batch size: 72, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:18:29,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349998.0, ans=0.1 2024-06-21 08:18:37,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=350016.3333333333, ans=0.125 2024-06-21 08:18:37,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=350016.3333333333, ans=0.0 2024-06-21 08:18:57,119 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.783e+02 2.127e+02 2.296e+02 2.456e+02 2.976e+02, threshold=4.593e+02, percent-clipped=0.0 2024-06-21 08:18:59,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=350071.3333333333, ans=0.07 2024-06-21 08:19:01,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.26 vs. limit=22.5 2024-06-21 08:19:01,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=350071.3333333333, ans=0.125 2024-06-21 08:19:03,287 INFO [train.py:1028] (0/2) Epoch 19, batch 8850, loss[loss=0.2338, simple_loss=0.2876, pruned_loss=0.09004, over 12594.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2815, pruned_loss=0.08472, over 2562581.79 frames. ], batch size: 202, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:19:09,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350089.6666666667, ans=0.1 2024-06-21 08:19:16,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=350108.0, ans=0.2 2024-06-21 08:19:22,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=350126.3333333333, ans=0.125 2024-06-21 08:19:29,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.09 vs. limit=22.5 2024-06-21 08:19:34,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350163.0, ans=0.1 2024-06-21 08:19:35,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350163.0, ans=0.1 2024-06-21 08:19:36,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350163.0, ans=0.1 2024-06-21 08:19:39,894 INFO [train.py:1028] (0/2) Epoch 19, batch 8900, loss[loss=0.222, simple_loss=0.2813, pruned_loss=0.08131, over 12828.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2825, pruned_loss=0.08528, over 2560812.00 frames. ], batch size: 33, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:19:40,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=350181.3333333333, ans=0.2 2024-06-21 08:19:44,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=350181.3333333333, ans=10.0 2024-06-21 08:19:47,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=350199.6666666667, ans=0.0 2024-06-21 08:19:49,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=350199.6666666667, ans=0.125 2024-06-21 08:19:54,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.66 vs. limit=15.0 2024-06-21 08:19:57,846 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.76 vs. limit=22.5 2024-06-21 08:20:02,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=350236.3333333333, ans=0.0 2024-06-21 08:20:06,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.163e+02 2.301e+02 2.546e+02 3.527e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 08:20:06,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=350254.6666666667, ans=15.0 2024-06-21 08:20:08,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=12.0 2024-06-21 08:20:17,183 INFO [train.py:1028] (0/2) Epoch 19, batch 8950, loss[loss=0.2288, simple_loss=0.2822, pruned_loss=0.0877, over 12530.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2829, pruned_loss=0.08496, over 2561175.53 frames. ], batch size: 202, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:20:30,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=350309.6666666667, ans=0.2 2024-06-21 08:20:35,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=350309.6666666667, ans=0.125 2024-06-21 08:20:47,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=350346.3333333333, ans=0.125 2024-06-21 08:20:50,484 INFO [train.py:1028] (0/2) Epoch 19, batch 9000, loss[loss=0.2169, simple_loss=0.2735, pruned_loss=0.0801, over 13276.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2826, pruned_loss=0.08445, over 2566989.30 frames. ], batch size: 46, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:20:50,485 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 08:20:58,231 INFO [train.py:1060] (0/2) Epoch 19, validation: loss=0.1869, simple_loss=0.2513, pruned_loss=0.06122, over 351949.00 frames. 2024-06-21 08:20:58,231 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 08:21:00,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.22 vs. limit=22.5 2024-06-21 08:21:07,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=350383.0, ans=0.025 2024-06-21 08:21:17,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=350419.6666666667, ans=0.2 2024-06-21 08:21:24,433 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.167e+02 2.393e+02 2.729e+02 4.002e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-21 08:21:25,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350438.0, ans=0.1 2024-06-21 08:21:33,484 INFO [train.py:1028] (0/2) Epoch 19, batch 9050, loss[loss=0.2288, simple_loss=0.2862, pruned_loss=0.08568, over 10673.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2837, pruned_loss=0.0848, over 2565133.03 frames. ], batch size: 16, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:21:33,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=350456.3333333333, ans=0.125 2024-06-21 08:21:39,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=350474.6666666667, ans=0.125 2024-06-21 08:22:01,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=350529.6666666667, ans=0.0 2024-06-21 08:22:01,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=350529.6666666667, ans=10.0 2024-06-21 08:22:05,608 INFO [train.py:1028] (0/2) Epoch 19, batch 9100, loss[loss=0.2328, simple_loss=0.2921, pruned_loss=0.08676, over 13235.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2835, pruned_loss=0.08468, over 2566337.76 frames. ], batch size: 72, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:22:07,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350548.0, ans=0.1 2024-06-21 08:22:08,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=350548.0, ans=0.0 2024-06-21 08:22:18,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350584.6666666667, ans=0.0 2024-06-21 08:22:19,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=350584.6666666667, ans=0.125 2024-06-21 08:22:20,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=350584.6666666667, ans=0.125 2024-06-21 08:22:21,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=350584.6666666667, ans=0.07 2024-06-21 08:22:23,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350584.6666666667, ans=0.0 2024-06-21 08:22:28,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=15.0 2024-06-21 08:22:31,638 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.108e+02 2.251e+02 2.454e+02 3.174e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 08:22:34,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=350621.3333333333, ans=0.0 2024-06-21 08:22:37,207 INFO [train.py:1028] (0/2) Epoch 19, batch 9150, loss[loss=0.2122, simple_loss=0.2778, pruned_loss=0.07326, over 13175.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2836, pruned_loss=0.08486, over 2567515.73 frames. ], batch size: 77, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:22:41,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=350639.6666666667, ans=0.125 2024-06-21 08:22:48,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=350658.0, ans=0.1 2024-06-21 08:23:08,841 INFO [train.py:1028] (0/2) Epoch 19, batch 9200, loss[loss=0.2175, simple_loss=0.2798, pruned_loss=0.07762, over 12983.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2834, pruned_loss=0.08455, over 2570381.40 frames. ], batch size: 36, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:23:15,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=350749.6666666667, ans=0.125 2024-06-21 08:23:22,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=350768.0, ans=0.025 2024-06-21 08:23:26,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=350786.3333333333, ans=0.04949747468305833 2024-06-21 08:23:30,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=350786.3333333333, ans=0.0 2024-06-21 08:23:31,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=350786.3333333333, ans=0.2 2024-06-21 08:23:34,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.106e+02 2.278e+02 2.450e+02 3.188e+02, threshold=4.556e+02, percent-clipped=0.0 2024-06-21 08:23:44,372 INFO [train.py:1028] (0/2) Epoch 19, batch 9250, loss[loss=0.2103, simple_loss=0.2724, pruned_loss=0.07405, over 13200.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2831, pruned_loss=0.08422, over 2573097.96 frames. ], batch size: 67, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:23:50,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=350841.3333333333, ans=0.1 2024-06-21 08:23:54,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2024-06-21 08:24:01,545 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.31 vs. limit=6.0 2024-06-21 08:24:02,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=350878.0, ans=0.125 2024-06-21 08:24:06,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350878.0, ans=0.1 2024-06-21 08:24:11,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=350896.3333333333, ans=0.125 2024-06-21 08:24:11,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=350896.3333333333, ans=0.125 2024-06-21 08:24:16,043 INFO [train.py:1028] (0/2) Epoch 19, batch 9300, loss[loss=0.2048, simple_loss=0.258, pruned_loss=0.07576, over 12879.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2827, pruned_loss=0.08385, over 2570476.93 frames. ], batch size: 39, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:24:24,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2024-06-21 08:24:30,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=350951.3333333333, ans=0.125 2024-06-21 08:24:35,872 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:24:36,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=350969.6666666667, ans=0.125 2024-06-21 08:24:41,346 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.120e+02 2.240e+02 2.415e+02 3.625e+02, threshold=4.481e+02, percent-clipped=0.0 2024-06-21 08:24:47,027 INFO [train.py:1028] (0/2) Epoch 19, batch 9350, loss[loss=0.2309, simple_loss=0.2859, pruned_loss=0.08792, over 12457.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2828, pruned_loss=0.08429, over 2567712.55 frames. ], batch size: 22, lr: 3.03e-03, grad_scale: 32.0 2024-06-21 08:24:52,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=351024.6666666667, ans=0.0 2024-06-21 08:24:57,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=351024.6666666667, ans=0.0 2024-06-21 08:24:57,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=351024.6666666667, ans=0.0 2024-06-21 08:25:05,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=351061.3333333333, ans=0.125 2024-06-21 08:25:11,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351061.3333333333, ans=0.1 2024-06-21 08:25:15,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=351079.6666666667, ans=0.0 2024-06-21 08:25:16,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=351079.6666666667, ans=0.0 2024-06-21 08:25:20,298 INFO [train.py:1028] (0/2) Epoch 19, batch 9400, loss[loss=0.254, simple_loss=0.3133, pruned_loss=0.09735, over 13257.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2834, pruned_loss=0.08458, over 2567211.89 frames. ], batch size: 52, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:25:35,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351134.6666666667, ans=0.1 2024-06-21 08:25:40,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=351153.0, ans=0.125 2024-06-21 08:25:42,328 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.73 vs. limit=15.0 2024-06-21 08:25:42,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=351153.0, ans=0.0 2024-06-21 08:25:45,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.231e+02 2.406e+02 2.615e+02 3.702e+02, threshold=4.813e+02, percent-clipped=0.0 2024-06-21 08:25:46,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2024-06-21 08:25:49,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=351171.3333333333, ans=0.07 2024-06-21 08:25:50,937 INFO [train.py:1028] (0/2) Epoch 19, batch 9450, loss[loss=0.2362, simple_loss=0.2879, pruned_loss=0.09226, over 12553.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2844, pruned_loss=0.085, over 2567476.87 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:25:51,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.76 vs. limit=10.0 2024-06-21 08:26:02,725 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:26:04,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351226.3333333333, ans=0.125 2024-06-21 08:26:13,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=351244.6666666667, ans=0.125 2024-06-21 08:26:15,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=351263.0, ans=0.125 2024-06-21 08:26:17,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=351263.0, ans=0.0 2024-06-21 08:26:17,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=351263.0, ans=0.025 2024-06-21 08:26:18,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=351263.0, ans=0.125 2024-06-21 08:26:21,204 INFO [train.py:1028] (0/2) Epoch 19, batch 9500, loss[loss=0.2253, simple_loss=0.2835, pruned_loss=0.08351, over 13250.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2843, pruned_loss=0.08461, over 2576140.43 frames. ], batch size: 43, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:26:28,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=351299.6666666667, ans=0.0 2024-06-21 08:26:39,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-21 08:26:41,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=351336.3333333333, ans=0.0 2024-06-21 08:26:48,175 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.133e+02 2.289e+02 2.499e+02 3.226e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 08:26:53,660 INFO [train.py:1028] (0/2) Epoch 19, batch 9550, loss[loss=0.2164, simple_loss=0.2769, pruned_loss=0.07795, over 12894.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2841, pruned_loss=0.08466, over 2572313.94 frames. ], batch size: 39, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:26:57,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=351373.0, ans=0.125 2024-06-21 08:27:12,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=351428.0, ans=0.125 2024-06-21 08:27:13,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2024-06-21 08:27:15,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=351428.0, ans=0.0 2024-06-21 08:27:23,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351464.6666666667, ans=0.125 2024-06-21 08:27:23,734 INFO [train.py:1028] (0/2) Epoch 19, batch 9600, loss[loss=0.2546, simple_loss=0.2993, pruned_loss=0.1049, over 10616.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2841, pruned_loss=0.08468, over 2570149.03 frames. ], batch size: 304, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:27:25,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=351464.6666666667, ans=0.125 2024-06-21 08:27:29,049 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2024-06-21 08:27:30,169 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.74 vs. limit=6.0 2024-06-21 08:27:35,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351501.3333333333, ans=0.125 2024-06-21 08:27:37,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=351501.3333333333, ans=0.125 2024-06-21 08:27:44,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=351519.6666666667, ans=0.125 2024-06-21 08:27:47,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=351519.6666666667, ans=0.125 2024-06-21 08:27:50,459 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.164e+02 2.321e+02 2.550e+02 3.114e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 08:27:52,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351538.0, ans=0.1 2024-06-21 08:27:55,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=351556.3333333333, ans=0.125 2024-06-21 08:27:56,080 INFO [train.py:1028] (0/2) Epoch 19, batch 9650, loss[loss=0.2376, simple_loss=0.2863, pruned_loss=0.09445, over 13094.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2838, pruned_loss=0.08491, over 2560148.37 frames. ], batch size: 132, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:27:58,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=351556.3333333333, ans=0.2 2024-06-21 08:27:59,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.61 vs. limit=22.5 2024-06-21 08:28:01,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=351574.6666666667, ans=0.1 2024-06-21 08:28:03,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=351574.6666666667, ans=0.1 2024-06-21 08:28:11,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=351593.0, ans=0.07 2024-06-21 08:28:12,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2024-06-21 08:28:21,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=351629.6666666667, ans=0.1 2024-06-21 08:28:26,116 INFO [train.py:1028] (0/2) Epoch 19, batch 9700, loss[loss=0.2278, simple_loss=0.2783, pruned_loss=0.08865, over 13045.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2833, pruned_loss=0.08507, over 2555243.80 frames. ], batch size: 144, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:28:43,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=351684.6666666667, ans=0.125 2024-06-21 08:28:49,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=351703.0, ans=0.0 2024-06-21 08:28:52,352 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.176e+02 2.339e+02 2.643e+02 3.345e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 08:28:57,727 INFO [train.py:1028] (0/2) Epoch 19, batch 9750, loss[loss=0.2186, simple_loss=0.2694, pruned_loss=0.08385, over 13060.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2819, pruned_loss=0.08443, over 2552055.74 frames. ], batch size: 132, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:29:00,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=351739.6666666667, ans=0.125 2024-06-21 08:29:05,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=351758.0, ans=0.2 2024-06-21 08:29:06,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=351758.0, ans=0.125 2024-06-21 08:29:08,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=351758.0, ans=0.125 2024-06-21 08:29:08,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=351758.0, ans=0.125 2024-06-21 08:29:18,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=351794.6666666667, ans=0.2 2024-06-21 08:29:19,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351794.6666666667, ans=0.1 2024-06-21 08:29:24,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=351813.0, ans=0.125 2024-06-21 08:29:27,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351813.0, ans=0.125 2024-06-21 08:29:28,425 INFO [train.py:1028] (0/2) Epoch 19, batch 9800, loss[loss=0.2135, simple_loss=0.2737, pruned_loss=0.07666, over 12905.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2817, pruned_loss=0.08423, over 2544890.39 frames. ], batch size: 39, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:29:44,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=351868.0, ans=0.0 2024-06-21 08:29:46,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351868.0, ans=0.1 2024-06-21 08:29:47,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=351886.3333333333, ans=0.025 2024-06-21 08:29:47,415 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:29:47,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=351886.3333333333, ans=0.125 2024-06-21 08:29:54,181 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.110e+02 2.278e+02 2.479e+02 2.878e+02, threshold=4.556e+02, percent-clipped=0.0 2024-06-21 08:29:59,518 INFO [train.py:1028] (0/2) Epoch 19, batch 9850, loss[loss=0.2273, simple_loss=0.2828, pruned_loss=0.08597, over 13027.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2809, pruned_loss=0.08373, over 2537360.47 frames. ], batch size: 102, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:30:03,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=351923.0, ans=0.0 2024-06-21 08:30:08,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=351941.3333333333, ans=0.2 2024-06-21 08:30:25,903 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-192000.pt 2024-06-21 08:30:32,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=351996.3333333333, ans=0.1 2024-06-21 08:30:34,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=351996.3333333333, ans=0.125 2024-06-21 08:30:36,592 INFO [train.py:1028] (0/2) Epoch 19, batch 9900, loss[loss=0.2125, simple_loss=0.285, pruned_loss=0.06996, over 12938.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2793, pruned_loss=0.08314, over 2531024.19 frames. ], batch size: 39, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:30:37,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2024-06-21 08:30:51,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.43 vs. limit=12.0 2024-06-21 08:31:03,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.142e+02 2.293e+02 2.443e+02 3.034e+02, threshold=4.585e+02, percent-clipped=0.0 2024-06-21 08:31:09,362 INFO [train.py:1028] (0/2) Epoch 19, batch 9950, loss[loss=0.2218, simple_loss=0.2779, pruned_loss=0.08289, over 12739.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2783, pruned_loss=0.08347, over 2525692.31 frames. ], batch size: 29, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:31:11,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=352106.3333333333, ans=22.5 2024-06-21 08:31:13,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=352106.3333333333, ans=0.0 2024-06-21 08:31:15,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=352124.6666666667, ans=0.0 2024-06-21 08:31:16,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=352124.6666666667, ans=0.125 2024-06-21 08:31:24,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=352143.0, ans=0.125 2024-06-21 08:31:29,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=352161.3333333333, ans=0.125 2024-06-21 08:31:36,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=352179.6666666667, ans=0.2 2024-06-21 08:31:37,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=352179.6666666667, ans=0.125 2024-06-21 08:31:37,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=352179.6666666667, ans=0.125 2024-06-21 08:31:41,482 INFO [train.py:1028] (0/2) Epoch 19, batch 10000, loss[loss=0.2263, simple_loss=0.2813, pruned_loss=0.08562, over 12390.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2795, pruned_loss=0.08417, over 2489011.69 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:31:50,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2024-06-21 08:31:53,058 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.830e-01 2024-06-21 08:32:01,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=352253.0, ans=0.125 2024-06-21 08:32:02,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=352253.0, ans=0.125 2024-06-21 08:32:07,408 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.225e+02 2.432e+02 2.702e+02 3.815e+02, threshold=4.865e+02, percent-clipped=0.0 2024-06-21 08:32:08,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=352271.3333333333, ans=0.0 2024-06-21 08:32:08,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=352271.3333333333, ans=0.125 2024-06-21 08:32:09,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=352271.3333333333, ans=0.0 2024-06-21 08:32:13,403 INFO [train.py:1028] (0/2) Epoch 19, batch 10050, loss[loss=0.2159, simple_loss=0.276, pruned_loss=0.07793, over 12568.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2795, pruned_loss=0.08488, over 2446639.46 frames. ], batch size: 22, lr: 3.02e-03, grad_scale: 32.0 2024-06-21 08:32:19,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=352308.0, ans=0.2 2024-06-21 08:32:24,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=352308.0, ans=0.125 2024-06-21 08:32:26,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=352326.3333333333, ans=0.125 2024-06-21 08:32:44,313 INFO [train.py:1028] (0/2) Epoch 19, batch 10100, loss[loss=0.1979, simple_loss=0.2559, pruned_loss=0.07, over 10890.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2795, pruned_loss=0.08468, over 2427527.22 frames. ], batch size: 16, lr: 3.02e-03, grad_scale: 64.0 2024-06-21 08:32:46,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=10.0 2024-06-21 08:32:47,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=352381.3333333333, ans=0.0 2024-06-21 08:32:51,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2024-06-21 08:32:57,415 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-19.pt 2024-06-21 08:35:01,684 INFO [train.py:1028] (0/2) Epoch 20, batch 0, loss[loss=0.2024, simple_loss=0.2575, pruned_loss=0.07364, over 12967.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2575, pruned_loss=0.07364, over 12967.00 frames. ], batch size: 36, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:35:01,685 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 08:35:05,420 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.1442, 2.3597, 2.8638, 1.6156], device='cuda:0') 2024-06-21 08:35:08,564 INFO [train.py:1060] (0/2) Epoch 20, validation: loss=0.1882, simple_loss=0.2529, pruned_loss=0.06178, over 351949.00 frames. 2024-06-21 08:35:08,564 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 08:35:11,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.72 vs. limit=15.0 2024-06-21 08:35:14,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=22.5 2024-06-21 08:35:16,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=352430.8333333333, ans=0.125 2024-06-21 08:35:17,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=352430.8333333333, ans=0.125 2024-06-21 08:35:25,172 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.051e+02 2.258e+02 2.481e+02 3.459e+02, threshold=4.516e+02, percent-clipped=0.0 2024-06-21 08:35:28,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.02 vs. limit=22.5 2024-06-21 08:35:29,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2024-06-21 08:35:30,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352467.5, ans=0.1 2024-06-21 08:35:34,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=352467.5, ans=0.0 2024-06-21 08:35:39,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=352485.8333333333, ans=0.1 2024-06-21 08:35:42,854 INFO [train.py:1028] (0/2) Epoch 20, batch 50, loss[loss=0.203, simple_loss=0.2682, pruned_loss=0.06889, over 12631.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2617, pruned_loss=0.07763, over 575081.36 frames. ], batch size: 29, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:35:47,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=352504.1666666667, ans=0.125 2024-06-21 08:35:49,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=352522.5, ans=0.0 2024-06-21 08:36:07,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=352577.5, ans=0.07 2024-06-21 08:36:16,474 INFO [train.py:1028] (0/2) Epoch 20, batch 100, loss[loss=0.2067, simple_loss=0.268, pruned_loss=0.07272, over 13300.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2609, pruned_loss=0.07726, over 1017875.74 frames. ], batch size: 46, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:36:26,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=352614.1666666667, ans=0.0 2024-06-21 08:36:28,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.27 vs. limit=22.5 2024-06-21 08:36:30,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=352632.5, ans=0.0 2024-06-21 08:36:31,797 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.041e+02 2.153e+02 2.355e+02 3.255e+02, threshold=4.307e+02, percent-clipped=0.0 2024-06-21 08:36:33,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.40 vs. limit=15.0 2024-06-21 08:36:39,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.49 vs. limit=12.0 2024-06-21 08:36:43,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=352650.8333333333, ans=0.0 2024-06-21 08:36:44,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=352650.8333333333, ans=0.0 2024-06-21 08:36:52,169 INFO [train.py:1028] (0/2) Epoch 20, batch 150, loss[loss=0.2046, simple_loss=0.2695, pruned_loss=0.06978, over 12649.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2599, pruned_loss=0.07544, over 1365883.78 frames. ], batch size: 29, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:36:52,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=352687.5, ans=0.125 2024-06-21 08:36:53,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=352687.5, ans=0.025 2024-06-21 08:36:53,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=352687.5, ans=0.125 2024-06-21 08:37:03,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352705.8333333333, ans=0.0 2024-06-21 08:37:04,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=352724.1666666667, ans=0.125 2024-06-21 08:37:10,695 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2024-06-21 08:37:24,314 INFO [train.py:1028] (0/2) Epoch 20, batch 200, loss[loss=0.2231, simple_loss=0.2723, pruned_loss=0.08693, over 12545.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2605, pruned_loss=0.07551, over 1635254.86 frames. ], batch size: 202, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:37:25,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2024-06-21 08:37:30,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=12.0 2024-06-21 08:37:39,863 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 1.997e+02 2.133e+02 2.256e+02 3.157e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 08:37:44,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352834.1666666667, ans=0.1 2024-06-21 08:37:46,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=352834.1666666667, ans=0.0 2024-06-21 08:37:49,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=352834.1666666667, ans=0.125 2024-06-21 08:37:50,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352852.5, ans=0.0 2024-06-21 08:37:56,527 INFO [train.py:1028] (0/2) Epoch 20, batch 250, loss[loss=0.2034, simple_loss=0.2516, pruned_loss=0.07765, over 13060.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2592, pruned_loss=0.07523, over 1846833.25 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:38:03,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=352889.1666666667, ans=0.125 2024-06-21 08:38:13,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.80 vs. limit=22.5 2024-06-21 08:38:33,897 INFO [train.py:1028] (0/2) Epoch 20, batch 300, loss[loss=0.2025, simple_loss=0.25, pruned_loss=0.07756, over 13218.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2591, pruned_loss=0.07514, over 2010075.82 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:38:41,097 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.59 vs. limit=22.5 2024-06-21 08:38:45,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=352980.8333333333, ans=0.0 2024-06-21 08:38:52,656 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.027e+02 2.166e+02 2.374e+02 3.059e+02, threshold=4.333e+02, percent-clipped=0.0 2024-06-21 08:39:08,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=353035.8333333333, ans=0.125 2024-06-21 08:39:09,299 INFO [train.py:1028] (0/2) Epoch 20, batch 350, loss[loss=0.2042, simple_loss=0.2633, pruned_loss=0.07258, over 12887.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2589, pruned_loss=0.07486, over 2139281.52 frames. ], batch size: 33, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:39:22,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=353090.8333333333, ans=0.125 2024-06-21 08:39:26,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=353090.8333333333, ans=0.125 2024-06-21 08:39:38,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=353127.5, ans=10.0 2024-06-21 08:39:41,317 INFO [train.py:1028] (0/2) Epoch 20, batch 400, loss[loss=0.1997, simple_loss=0.2625, pruned_loss=0.06843, over 13266.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2588, pruned_loss=0.0746, over 2240325.95 frames. ], batch size: 63, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:39:42,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.29 vs. limit=22.5 2024-06-21 08:39:56,375 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 1.993e+02 2.159e+02 2.410e+02 3.746e+02, threshold=4.317e+02, percent-clipped=0.0 2024-06-21 08:40:12,695 INFO [train.py:1028] (0/2) Epoch 20, batch 450, loss[loss=0.1903, simple_loss=0.2468, pruned_loss=0.06695, over 13216.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2589, pruned_loss=0.0745, over 2314281.13 frames. ], batch size: 67, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:40:47,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.72 vs. limit=22.5 2024-06-21 08:40:53,478 INFO [train.py:1028] (0/2) Epoch 20, batch 500, loss[loss=0.1955, simple_loss=0.249, pruned_loss=0.07102, over 13042.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.259, pruned_loss=0.07424, over 2376359.90 frames. ], batch size: 121, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:41:02,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.13 vs. limit=15.0 2024-06-21 08:41:08,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=353365.8333333333, ans=0.125 2024-06-21 08:41:08,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.038e+02 2.188e+02 2.436e+02 3.019e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 08:41:13,007 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:41:25,683 INFO [train.py:1028] (0/2) Epoch 20, batch 550, loss[loss=0.2188, simple_loss=0.2657, pruned_loss=0.086, over 12903.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2585, pruned_loss=0.07411, over 2421124.85 frames. ], batch size: 158, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:41:27,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.07 vs. limit=22.5 2024-06-21 08:41:43,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353457.5, ans=0.1 2024-06-21 08:41:43,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=353457.5, ans=0.0 2024-06-21 08:41:44,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=353475.8333333333, ans=15.0 2024-06-21 08:41:47,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2024-06-21 08:41:56,497 INFO [train.py:1028] (0/2) Epoch 20, batch 600, loss[loss=0.196, simple_loss=0.2425, pruned_loss=0.0748, over 13030.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2583, pruned_loss=0.07403, over 2459441.65 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:42:04,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=353530.8333333333, ans=0.2 2024-06-21 08:42:07,354 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:42:09,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2024-06-21 08:42:11,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=353549.1666666667, ans=0.0 2024-06-21 08:42:11,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 1.985e+02 2.100e+02 2.261e+02 2.880e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 08:42:13,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=353549.1666666667, ans=0.125 2024-06-21 08:42:16,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=353567.5, ans=0.1 2024-06-21 08:42:17,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=353567.5, ans=0.0 2024-06-21 08:42:28,432 INFO [train.py:1028] (0/2) Epoch 20, batch 650, loss[loss=0.2041, simple_loss=0.2685, pruned_loss=0.06981, over 13233.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2588, pruned_loss=0.07412, over 2490937.34 frames. ], batch size: 59, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:42:40,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=353622.5, ans=0.0 2024-06-21 08:42:42,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=353622.5, ans=0.125 2024-06-21 08:42:47,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=353640.8333333333, ans=0.025 2024-06-21 08:42:54,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.10 vs. limit=10.0 2024-06-21 08:43:06,239 INFO [train.py:1028] (0/2) Epoch 20, batch 700, loss[loss=0.2094, simple_loss=0.2648, pruned_loss=0.07698, over 13372.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2586, pruned_loss=0.0741, over 2514280.70 frames. ], batch size: 46, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:43:15,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=353714.1666666667, ans=0.1 2024-06-21 08:43:17,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.54 vs. limit=15.0 2024-06-21 08:43:19,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=353732.5, ans=0.2 2024-06-21 08:43:20,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353732.5, ans=0.1 2024-06-21 08:43:21,351 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.058e+02 2.232e+02 2.397e+02 4.050e+02, threshold=4.465e+02, percent-clipped=0.0 2024-06-21 08:43:38,022 INFO [train.py:1028] (0/2) Epoch 20, batch 750, loss[loss=0.1816, simple_loss=0.2438, pruned_loss=0.05975, over 13239.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2593, pruned_loss=0.07439, over 2528881.73 frames. ], batch size: 63, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:43:40,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=353787.5, ans=0.0 2024-06-21 08:43:44,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.54 vs. limit=15.0 2024-06-21 08:43:52,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353824.1666666667, ans=0.1 2024-06-21 08:43:53,263 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:44:03,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=353860.8333333333, ans=0.0 2024-06-21 08:44:09,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=353879.1666666667, ans=0.125 2024-06-21 08:44:09,948 INFO [train.py:1028] (0/2) Epoch 20, batch 800, loss[loss=0.1835, simple_loss=0.246, pruned_loss=0.06049, over 12840.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2588, pruned_loss=0.07431, over 2541015.55 frames. ], batch size: 36, lr: 2.94e-03, grad_scale: 64.0 2024-06-21 08:44:11,906 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:44:21,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2024-06-21 08:44:25,285 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.033e+02 2.142e+02 2.349e+02 3.193e+02, threshold=4.284e+02, percent-clipped=0.0 2024-06-21 08:44:27,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=353915.8333333333, ans=0.1 2024-06-21 08:44:30,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=353934.1666666667, ans=0.125 2024-06-21 08:44:35,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=353934.1666666667, ans=0.125 2024-06-21 08:44:39,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=353952.5, ans=0.125 2024-06-21 08:44:41,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=353952.5, ans=0.125 2024-06-21 08:44:46,907 INFO [train.py:1028] (0/2) Epoch 20, batch 850, loss[loss=0.1997, simple_loss=0.2539, pruned_loss=0.07272, over 13171.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2585, pruned_loss=0.0741, over 2551217.18 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:44:47,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=353970.8333333333, ans=0.125 2024-06-21 08:44:48,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=353970.8333333333, ans=0.0 2024-06-21 08:44:49,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2024-06-21 08:44:59,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=353989.1666666667, ans=0.0 2024-06-21 08:45:12,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.17 vs. limit=15.0 2024-06-21 08:45:15,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=354044.1666666667, ans=0.0 2024-06-21 08:45:21,577 INFO [train.py:1028] (0/2) Epoch 20, batch 900, loss[loss=0.1855, simple_loss=0.2475, pruned_loss=0.06175, over 12970.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2586, pruned_loss=0.07432, over 2556118.09 frames. ], batch size: 36, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:45:24,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=354062.5, ans=0.2 2024-06-21 08:45:32,653 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.28 vs. limit=22.5 2024-06-21 08:45:36,683 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.997e+02 2.099e+02 2.263e+02 3.423e+02, threshold=4.199e+02, percent-clipped=0.0 2024-06-21 08:45:36,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=354099.1666666667, ans=0.0 2024-06-21 08:45:52,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=354135.8333333333, ans=0.2 2024-06-21 08:45:53,627 INFO [train.py:1028] (0/2) Epoch 20, batch 950, loss[loss=0.2013, simple_loss=0.2598, pruned_loss=0.07142, over 12869.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2591, pruned_loss=0.07428, over 2558734.83 frames. ], batch size: 39, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:45:57,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=354154.1666666667, ans=0.025 2024-06-21 08:45:57,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=354154.1666666667, ans=0.125 2024-06-21 08:45:57,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354154.1666666667, ans=0.1 2024-06-21 08:46:00,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=354172.5, ans=0.0 2024-06-21 08:46:09,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=354190.8333333333, ans=0.125 2024-06-21 08:46:10,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354190.8333333333, ans=0.1 2024-06-21 08:46:11,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=354190.8333333333, ans=0.1 2024-06-21 08:46:11,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=354190.8333333333, ans=0.2 2024-06-21 08:46:20,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=354227.5, ans=0.125 2024-06-21 08:46:25,925 INFO [train.py:1028] (0/2) Epoch 20, batch 1000, loss[loss=0.2004, simple_loss=0.2613, pruned_loss=0.06977, over 13321.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2583, pruned_loss=0.07407, over 2560939.88 frames. ], batch size: 49, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:46:32,371 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:46:33,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=354245.8333333333, ans=0.2 2024-06-21 08:46:44,641 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.041e+02 2.142e+02 2.410e+02 3.076e+02, threshold=4.285e+02, percent-clipped=0.0 2024-06-21 08:46:53,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=354300.8333333333, ans=0.025 2024-06-21 08:46:54,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354300.8333333333, ans=0.1 2024-06-21 08:47:04,471 INFO [train.py:1028] (0/2) Epoch 20, batch 1050, loss[loss=0.1981, simple_loss=0.2553, pruned_loss=0.07047, over 13149.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2592, pruned_loss=0.07431, over 2565002.46 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:47:05,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=22.5 2024-06-21 08:47:06,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=354337.5, ans=0.125 2024-06-21 08:47:10,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354355.8333333333, ans=0.1 2024-06-21 08:47:12,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=354355.8333333333, ans=0.015 2024-06-21 08:47:22,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=354374.1666666667, ans=0.0 2024-06-21 08:47:30,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=354410.8333333333, ans=0.1 2024-06-21 08:47:36,893 INFO [train.py:1028] (0/2) Epoch 20, batch 1100, loss[loss=0.2025, simple_loss=0.2591, pruned_loss=0.07295, over 13217.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2594, pruned_loss=0.0743, over 2569746.55 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:47:46,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.48 vs. limit=22.5 2024-06-21 08:47:52,615 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.054e+02 2.192e+02 2.330e+02 2.871e+02, threshold=4.383e+02, percent-clipped=0.0 2024-06-21 08:47:57,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=354484.1666666667, ans=0.125 2024-06-21 08:47:58,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=354484.1666666667, ans=0.125 2024-06-21 08:48:09,293 INFO [train.py:1028] (0/2) Epoch 20, batch 1150, loss[loss=0.2274, simple_loss=0.2844, pruned_loss=0.08523, over 13292.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2598, pruned_loss=0.07461, over 2571030.75 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:48:12,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=354520.8333333333, ans=0.0 2024-06-21 08:48:17,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.62 vs. limit=22.5 2024-06-21 08:48:19,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=354539.1666666667, ans=0.0 2024-06-21 08:48:19,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.62 vs. limit=10.0 2024-06-21 08:48:21,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2024-06-21 08:48:31,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=354575.8333333333, ans=0.2 2024-06-21 08:48:33,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=354575.8333333333, ans=0.04949747468305833 2024-06-21 08:48:44,384 INFO [train.py:1028] (0/2) Epoch 20, batch 1200, loss[loss=0.1978, simple_loss=0.2502, pruned_loss=0.07264, over 13137.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2602, pruned_loss=0.07518, over 2574114.25 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:48:58,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354630.8333333333, ans=0.125 2024-06-21 08:49:02,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.97 vs. limit=22.5 2024-06-21 08:49:03,448 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.081e+02 2.258e+02 2.489e+02 3.694e+02, threshold=4.517e+02, percent-clipped=0.0 2024-06-21 08:49:10,994 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.43 vs. limit=5.0 2024-06-21 08:49:19,780 INFO [train.py:1028] (0/2) Epoch 20, batch 1250, loss[loss=0.2153, simple_loss=0.267, pruned_loss=0.0818, over 13195.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2599, pruned_loss=0.075, over 2583738.43 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:49:21,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354704.1666666667, ans=0.1 2024-06-21 08:49:25,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=354704.1666666667, ans=0.125 2024-06-21 08:49:35,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=354740.8333333333, ans=0.125 2024-06-21 08:49:38,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=354759.1666666667, ans=0.025 2024-06-21 08:49:44,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=354759.1666666667, ans=0.025 2024-06-21 08:49:51,992 INFO [train.py:1028] (0/2) Epoch 20, batch 1300, loss[loss=0.2052, simple_loss=0.2574, pruned_loss=0.07652, over 12795.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2603, pruned_loss=0.07521, over 2583777.20 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:49:54,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.67 vs. limit=15.0 2024-06-21 08:49:55,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=354795.8333333333, ans=0.0 2024-06-21 08:49:57,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=354795.8333333333, ans=0.1 2024-06-21 08:49:57,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=354814.1666666667, ans=0.0 2024-06-21 08:50:00,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354814.1666666667, ans=0.1 2024-06-21 08:50:04,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2024-06-21 08:50:07,573 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.031e+02 2.138e+02 2.259e+02 3.212e+02, threshold=4.275e+02, percent-clipped=0.0 2024-06-21 08:50:16,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=354850.8333333333, ans=0.1 2024-06-21 08:50:18,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2024-06-21 08:50:21,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=354869.1666666667, ans=0.125 2024-06-21 08:50:21,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=354869.1666666667, ans=0.125 2024-06-21 08:50:23,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=354869.1666666667, ans=0.0 2024-06-21 08:50:25,017 INFO [train.py:1028] (0/2) Epoch 20, batch 1350, loss[loss=0.1901, simple_loss=0.2519, pruned_loss=0.06418, over 13153.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2596, pruned_loss=0.07481, over 2585579.77 frames. ], batch size: 59, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:50:25,379 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=12.0 2024-06-21 08:50:38,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=354924.1666666667, ans=0.125 2024-06-21 08:50:51,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=354942.5, ans=0.125 2024-06-21 08:50:52,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354942.5, ans=0.1 2024-06-21 08:50:58,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354960.8333333333, ans=0.1 2024-06-21 08:50:58,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.30 vs. limit=22.5 2024-06-21 08:50:59,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=354960.8333333333, ans=0.0 2024-06-21 08:51:04,197 INFO [train.py:1028] (0/2) Epoch 20, batch 1400, loss[loss=0.2072, simple_loss=0.2633, pruned_loss=0.07559, over 12884.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2593, pruned_loss=0.07503, over 2586558.36 frames. ], batch size: 26, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:51:07,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=354979.1666666667, ans=0.125 2024-06-21 08:51:07,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=354979.1666666667, ans=0.0 2024-06-21 08:51:08,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.58 vs. limit=15.0 2024-06-21 08:51:08,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=354979.1666666667, ans=0.0 2024-06-21 08:51:19,651 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.050e+02 2.150e+02 2.262e+02 2.982e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 08:51:19,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=355015.8333333333, ans=0.125 2024-06-21 08:51:22,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2024-06-21 08:51:34,200 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=8.340e-02 2024-06-21 08:51:36,785 INFO [train.py:1028] (0/2) Epoch 20, batch 1450, loss[loss=0.2098, simple_loss=0.263, pruned_loss=0.07828, over 13093.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2589, pruned_loss=0.07487, over 2588032.46 frames. ], batch size: 121, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:51:38,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=355070.8333333333, ans=0.1 2024-06-21 08:51:41,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=355070.8333333333, ans=0.125 2024-06-21 08:51:42,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=355070.8333333333, ans=0.125 2024-06-21 08:51:44,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=355089.1666666667, ans=0.125 2024-06-21 08:51:49,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355107.5, ans=0.0 2024-06-21 08:51:52,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=355107.5, ans=0.125 2024-06-21 08:52:05,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.61 vs. limit=22.5 2024-06-21 08:52:05,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=355144.1666666667, ans=0.2 2024-06-21 08:52:09,287 INFO [train.py:1028] (0/2) Epoch 20, batch 1500, loss[loss=0.2041, simple_loss=0.2599, pruned_loss=0.07417, over 13199.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2587, pruned_loss=0.07514, over 2590162.32 frames. ], batch size: 83, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:52:20,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.87 vs. limit=10.0 2024-06-21 08:52:20,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2024-06-21 08:52:24,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 08:52:25,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.113e+02 2.227e+02 2.437e+02 3.305e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-21 08:52:27,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=355199.1666666667, ans=0.125 2024-06-21 08:52:35,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2024-06-21 08:52:36,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=355217.5, ans=0.125 2024-06-21 08:52:44,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=355254.1666666667, ans=0.125 2024-06-21 08:52:44,878 INFO [train.py:1028] (0/2) Epoch 20, batch 1550, loss[loss=0.197, simple_loss=0.2447, pruned_loss=0.07461, over 13017.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2593, pruned_loss=0.07567, over 2585747.64 frames. ], batch size: 102, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:52:45,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=355254.1666666667, ans=0.125 2024-06-21 08:52:59,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=355272.5, ans=0.125 2024-06-21 08:53:04,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.30 vs. limit=10.0 2024-06-21 08:53:10,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355309.1666666667, ans=0.1 2024-06-21 08:53:13,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355327.5, ans=0.0 2024-06-21 08:53:16,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=355327.5, ans=0.125 2024-06-21 08:53:16,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=355327.5, ans=0.125 2024-06-21 08:53:20,712 INFO [train.py:1028] (0/2) Epoch 20, batch 1600, loss[loss=0.2267, simple_loss=0.2863, pruned_loss=0.08353, over 13156.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2601, pruned_loss=0.07575, over 2580122.46 frames. ], batch size: 77, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:53:20,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=355345.8333333333, ans=0.125 2024-06-21 08:53:23,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=25.67 vs. limit=22.5 2024-06-21 08:53:29,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=12.0 2024-06-21 08:53:30,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=355364.1666666667, ans=0.125 2024-06-21 08:53:35,776 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.765e+02 2.018e+02 2.149e+02 2.335e+02 2.816e+02, threshold=4.297e+02, percent-clipped=0.0 2024-06-21 08:53:40,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=355400.8333333333, ans=0.125 2024-06-21 08:53:45,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=355400.8333333333, ans=0.125 2024-06-21 08:53:49,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=355419.1666666667, ans=0.0 2024-06-21 08:53:52,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355437.5, ans=0.0 2024-06-21 08:53:52,668 INFO [train.py:1028] (0/2) Epoch 20, batch 1650, loss[loss=0.2121, simple_loss=0.265, pruned_loss=0.07961, over 13199.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2607, pruned_loss=0.07599, over 2576743.03 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:53:56,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=355437.5, ans=0.2 2024-06-21 08:53:59,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.91 vs. limit=22.5 2024-06-21 08:54:05,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.37 vs. limit=15.0 2024-06-21 08:54:11,459 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:54:15,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.89 vs. limit=15.0 2024-06-21 08:54:15,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355492.5, ans=0.1 2024-06-21 08:54:17,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=355492.5, ans=0.0 2024-06-21 08:54:19,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=355510.8333333333, ans=0.0 2024-06-21 08:54:20,341 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2024-06-21 08:54:20,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=355510.8333333333, ans=0.04949747468305833 2024-06-21 08:54:21,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=355510.8333333333, ans=0.125 2024-06-21 08:54:25,660 INFO [train.py:1028] (0/2) Epoch 20, batch 1700, loss[loss=0.2471, simple_loss=0.294, pruned_loss=0.1001, over 12457.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2612, pruned_loss=0.07594, over 2581065.23 frames. ], batch size: 25, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:54:30,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=355529.1666666667, ans=0.125 2024-06-21 08:54:30,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=355529.1666666667, ans=0.125 2024-06-21 08:54:30,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=355529.1666666667, ans=0.125 2024-06-21 08:54:35,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355547.5, ans=0.1 2024-06-21 08:54:39,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=355547.5, ans=0.125 2024-06-21 08:54:44,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 1.992e+02 2.090e+02 2.291e+02 3.156e+02, threshold=4.180e+02, percent-clipped=0.0 2024-06-21 08:54:44,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=355565.8333333333, ans=0.0 2024-06-21 08:54:51,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=355584.1666666667, ans=0.2 2024-06-21 08:54:53,609 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:54:54,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=355584.1666666667, ans=0.2 2024-06-21 08:55:03,589 INFO [train.py:1028] (0/2) Epoch 20, batch 1750, loss[loss=0.1938, simple_loss=0.2565, pruned_loss=0.06551, over 12326.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.261, pruned_loss=0.07581, over 2582896.36 frames. ], batch size: 22, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:55:06,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=355620.8333333333, ans=0.125 2024-06-21 08:55:16,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=355657.5, ans=0.125 2024-06-21 08:55:28,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=355675.8333333333, ans=0.0 2024-06-21 08:55:28,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=355675.8333333333, ans=0.125 2024-06-21 08:55:30,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=355694.1666666667, ans=0.125 2024-06-21 08:55:32,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=355694.1666666667, ans=0.125 2024-06-21 08:55:35,758 INFO [train.py:1028] (0/2) Epoch 20, batch 1800, loss[loss=0.2348, simple_loss=0.2904, pruned_loss=0.08961, over 13209.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.261, pruned_loss=0.07587, over 2582939.13 frames. ], batch size: 67, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:55:37,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355712.5, ans=0.1 2024-06-21 08:55:37,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=355712.5, ans=0.025 2024-06-21 08:55:41,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=355712.5, ans=0.125 2024-06-21 08:55:45,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2024-06-21 08:55:48,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.21 vs. limit=22.5 2024-06-21 08:55:51,342 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.090e+02 2.199e+02 2.361e+02 3.493e+02, threshold=4.398e+02, percent-clipped=0.0 2024-06-21 08:55:54,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2024-06-21 08:55:56,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=355767.5, ans=0.0 2024-06-21 08:56:04,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=355785.8333333333, ans=0.0 2024-06-21 08:56:08,532 INFO [train.py:1028] (0/2) Epoch 20, batch 1850, loss[loss=0.2085, simple_loss=0.2574, pruned_loss=0.07979, over 13204.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2607, pruned_loss=0.07572, over 2583934.08 frames. ], batch size: 83, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:56:10,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355804.1666666667, ans=0.0 2024-06-21 08:56:15,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=355822.5, ans=0.0 2024-06-21 08:56:22,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=355840.8333333333, ans=0.125 2024-06-21 08:56:22,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=355840.8333333333, ans=0.125 2024-06-21 08:56:31,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=355859.1666666667, ans=0.0 2024-06-21 08:56:37,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.68 vs. limit=22.5 2024-06-21 08:56:39,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=355877.5, ans=0.2 2024-06-21 08:56:39,892 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:56:42,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=355895.8333333333, ans=0.125 2024-06-21 08:56:43,291 INFO [train.py:1028] (0/2) Epoch 20, batch 1900, loss[loss=0.194, simple_loss=0.249, pruned_loss=0.06948, over 13173.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.26, pruned_loss=0.07569, over 2585864.41 frames. ], batch size: 95, lr: 2.93e-03, grad_scale: 64.0 2024-06-21 08:56:52,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.74 vs. limit=12.0 2024-06-21 08:57:02,544 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.046e+02 2.127e+02 2.298e+02 2.982e+02, threshold=4.254e+02, percent-clipped=0.0 2024-06-21 08:57:06,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=355950.8333333333, ans=0.125 2024-06-21 08:57:07,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=355950.8333333333, ans=0.125 2024-06-21 08:57:09,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=355950.8333333333, ans=0.2 2024-06-21 08:57:09,342 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 08:57:16,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=355969.1666666667, ans=0.0 2024-06-21 08:57:18,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.31 vs. limit=15.0 2024-06-21 08:57:19,093 INFO [train.py:1028] (0/2) Epoch 20, batch 1950, loss[loss=0.2034, simple_loss=0.2684, pruned_loss=0.06921, over 13279.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2598, pruned_loss=0.07567, over 2591730.78 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:57:32,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=356024.1666666667, ans=0.0 2024-06-21 08:57:38,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=356042.5, ans=0.0 2024-06-21 08:57:42,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=356042.5, ans=0.09899494936611666 2024-06-21 08:57:43,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=356042.5, ans=10.0 2024-06-21 08:57:49,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356060.8333333333, ans=0.1 2024-06-21 08:57:51,752 INFO [train.py:1028] (0/2) Epoch 20, batch 2000, loss[loss=0.204, simple_loss=0.2678, pruned_loss=0.07007, over 12680.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2599, pruned_loss=0.07598, over 2586741.79 frames. ], batch size: 22, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:57:57,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=356097.5, ans=0.0 2024-06-21 08:58:03,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2024-06-21 08:58:07,241 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.026e+02 2.133e+02 2.230e+02 3.147e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 08:58:08,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356115.8333333333, ans=0.125 2024-06-21 08:58:26,822 INFO [train.py:1028] (0/2) Epoch 20, batch 2050, loss[loss=0.2085, simple_loss=0.2724, pruned_loss=0.07224, over 12507.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2599, pruned_loss=0.07596, over 2582409.92 frames. ], batch size: 29, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:58:27,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2024-06-21 08:59:02,427 INFO [train.py:1028] (0/2) Epoch 20, batch 2100, loss[loss=0.2152, simple_loss=0.2674, pruned_loss=0.08144, over 13227.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2605, pruned_loss=0.07594, over 2584851.24 frames. ], batch size: 59, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:59:11,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=15.0 2024-06-21 08:59:18,068 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 1.999e+02 2.147e+02 2.325e+02 2.808e+02, threshold=4.294e+02, percent-clipped=0.0 2024-06-21 08:59:23,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=356317.5, ans=0.125 2024-06-21 08:59:30,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=356335.8333333333, ans=0.0 2024-06-21 08:59:34,719 INFO [train.py:1028] (0/2) Epoch 20, batch 2150, loss[loss=0.2, simple_loss=0.2616, pruned_loss=0.06923, over 13259.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2602, pruned_loss=0.07538, over 2588023.13 frames. ], batch size: 52, lr: 2.93e-03, grad_scale: 128.0 2024-06-21 08:59:37,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.11 vs. limit=22.5 2024-06-21 08:59:47,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356372.5, ans=0.1 2024-06-21 08:59:50,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=356390.8333333333, ans=0.0 2024-06-21 08:59:59,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=356409.1666666667, ans=0.125 2024-06-21 09:00:01,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2024-06-21 09:00:01,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=15.0 2024-06-21 09:00:01,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=356427.5, ans=0.125 2024-06-21 09:00:02,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=356427.5, ans=0.125 2024-06-21 09:00:04,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=356427.5, ans=0.0 2024-06-21 09:00:07,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=356445.8333333333, ans=0.1 2024-06-21 09:00:08,198 INFO [train.py:1028] (0/2) Epoch 20, batch 2200, loss[loss=0.222, simple_loss=0.2678, pruned_loss=0.08806, over 13179.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2608, pruned_loss=0.07566, over 2587436.88 frames. ], batch size: 83, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:00:15,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=356464.1666666667, ans=10.0 2024-06-21 09:00:24,258 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.042e+02 2.153e+02 2.369e+02 3.097e+02, threshold=4.306e+02, percent-clipped=0.0 2024-06-21 09:00:33,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356500.8333333333, ans=0.1 2024-06-21 09:00:39,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.79 vs. limit=10.0 2024-06-21 09:00:45,086 INFO [train.py:1028] (0/2) Epoch 20, batch 2250, loss[loss=0.1855, simple_loss=0.2465, pruned_loss=0.06222, over 13283.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2604, pruned_loss=0.07583, over 2586257.84 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:00:54,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=356555.8333333333, ans=0.0 2024-06-21 09:00:55,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=356555.8333333333, ans=0.125 2024-06-21 09:00:55,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.63 vs. limit=22.5 2024-06-21 09:00:56,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=356555.8333333333, ans=0.0 2024-06-21 09:01:16,111 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.657e-01 2024-06-21 09:01:22,547 INFO [train.py:1028] (0/2) Epoch 20, batch 2300, loss[loss=0.1738, simple_loss=0.2281, pruned_loss=0.05975, over 12848.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2605, pruned_loss=0.07563, over 2580788.34 frames. ], batch size: 33, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:01:35,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=356665.8333333333, ans=0.0 2024-06-21 09:01:38,476 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.053e+02 2.182e+02 2.448e+02 3.072e+02, threshold=4.365e+02, percent-clipped=0.0 2024-06-21 09:01:44,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=356684.1666666667, ans=0.125 2024-06-21 09:01:48,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=356702.5, ans=0.125 2024-06-21 09:01:55,881 INFO [train.py:1028] (0/2) Epoch 20, batch 2350, loss[loss=0.193, simple_loss=0.2508, pruned_loss=0.06755, over 13213.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.26, pruned_loss=0.07546, over 2584954.98 frames. ], batch size: 67, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:02:00,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.75 vs. limit=12.0 2024-06-21 09:02:11,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=356757.5, ans=0.0 2024-06-21 09:02:28,968 INFO [train.py:1028] (0/2) Epoch 20, batch 2400, loss[loss=0.1985, simple_loss=0.2607, pruned_loss=0.06819, over 13372.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2597, pruned_loss=0.07544, over 2587642.42 frames. ], batch size: 46, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:02:36,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=356812.5, ans=0.125 2024-06-21 09:02:41,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2024-06-21 09:02:45,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=356849.1666666667, ans=0.07 2024-06-21 09:02:47,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356849.1666666667, ans=0.125 2024-06-21 09:02:48,498 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.028e+02 2.159e+02 2.269e+02 2.784e+02, threshold=4.319e+02, percent-clipped=0.0 2024-06-21 09:02:48,652 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.625e-01 2024-06-21 09:02:50,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=356849.1666666667, ans=0.07 2024-06-21 09:03:05,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.60 vs. limit=15.0 2024-06-21 09:03:08,291 INFO [train.py:1028] (0/2) Epoch 20, batch 2450, loss[loss=0.1936, simple_loss=0.2397, pruned_loss=0.07376, over 13252.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2588, pruned_loss=0.07546, over 2584271.68 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:03:11,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=356904.1666666667, ans=0.2 2024-06-21 09:03:12,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=356904.1666666667, ans=0.125 2024-06-21 09:03:16,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356922.5, ans=0.1 2024-06-21 09:03:17,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=356922.5, ans=0.125 2024-06-21 09:03:21,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.35 vs. limit=22.5 2024-06-21 09:03:26,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=356940.8333333333, ans=0.5 2024-06-21 09:03:26,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356940.8333333333, ans=0.1 2024-06-21 09:03:28,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=356959.1666666667, ans=0.125 2024-06-21 09:03:35,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2024-06-21 09:03:40,893 INFO [train.py:1028] (0/2) Epoch 20, batch 2500, loss[loss=0.1872, simple_loss=0.2405, pruned_loss=0.06693, over 13233.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2581, pruned_loss=0.07504, over 2587304.15 frames. ], batch size: 83, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:03:47,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=357014.1666666667, ans=0.0 2024-06-21 09:03:48,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357014.1666666667, ans=0.0 2024-06-21 09:03:48,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=357014.1666666667, ans=0.125 2024-06-21 09:03:50,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=357014.1666666667, ans=0.125 2024-06-21 09:03:51,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=357014.1666666667, ans=0.07 2024-06-21 09:03:55,752 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.544e-03 2024-06-21 09:03:56,348 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.068e+02 2.198e+02 2.397e+02 3.315e+02, threshold=4.395e+02, percent-clipped=0.0 2024-06-21 09:03:58,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=357032.5, ans=0.5 2024-06-21 09:04:04,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=357050.8333333333, ans=0.125 2024-06-21 09:04:11,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2024-06-21 09:04:13,269 INFO [train.py:1028] (0/2) Epoch 20, batch 2550, loss[loss=0.2213, simple_loss=0.2796, pruned_loss=0.08148, over 12700.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.257, pruned_loss=0.07476, over 2587134.91 frames. ], batch size: 22, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:04:26,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.60 vs. limit=22.5 2024-06-21 09:04:29,433 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:04:42,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=357142.5, ans=0.0 2024-06-21 09:04:51,728 INFO [train.py:1028] (0/2) Epoch 20, batch 2600, loss[loss=0.1908, simple_loss=0.253, pruned_loss=0.06429, over 13256.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2551, pruned_loss=0.07399, over 2586877.63 frames. ], batch size: 52, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:05:07,333 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.667e+02 2.027e+02 2.159e+02 2.358e+02 3.160e+02, threshold=4.319e+02, percent-clipped=0.0 2024-06-21 09:05:08,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=357215.8333333333, ans=0.025 2024-06-21 09:05:10,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=357234.1666666667, ans=0.125 2024-06-21 09:05:13,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=357234.1666666667, ans=10.0 2024-06-21 09:05:22,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2024-06-21 09:05:24,342 INFO [train.py:1028] (0/2) Epoch 20, batch 2650, loss[loss=0.1996, simple_loss=0.2463, pruned_loss=0.07641, over 13053.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2538, pruned_loss=0.07355, over 2587885.00 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 128.0 2024-06-21 09:05:38,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=357307.5, ans=0.0 2024-06-21 09:05:47,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=357325.8333333333, ans=0.125 2024-06-21 09:05:50,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=357344.1666666667, ans=0.0 2024-06-21 09:05:50,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=357344.1666666667, ans=0.0 2024-06-21 09:05:57,198 INFO [train.py:1028] (0/2) Epoch 20, batch 2700, loss[loss=0.2018, simple_loss=0.2483, pruned_loss=0.07763, over 13210.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2526, pruned_loss=0.07339, over 2585724.92 frames. ], batch size: 89, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:06:02,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=357362.5, ans=0.125 2024-06-21 09:06:09,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=357380.8333333333, ans=0.125 2024-06-21 09:06:13,726 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.010e+02 2.109e+02 2.292e+02 2.711e+02, threshold=4.218e+02, percent-clipped=0.0 2024-06-21 09:06:25,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=357417.5, ans=0.95 2024-06-21 09:06:29,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=15.0 2024-06-21 09:06:33,778 INFO [train.py:1028] (0/2) Epoch 20, batch 2750, loss[loss=0.2095, simple_loss=0.2595, pruned_loss=0.07969, over 13280.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2523, pruned_loss=0.07325, over 2583140.15 frames. ], batch size: 43, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:06:38,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=357454.1666666667, ans=0.07 2024-06-21 09:06:38,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=357454.1666666667, ans=0.125 2024-06-21 09:06:41,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=357454.1666666667, ans=0.125 2024-06-21 09:06:41,614 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:06:43,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=357472.5, ans=0.125 2024-06-21 09:06:44,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=357472.5, ans=0.2 2024-06-21 09:06:54,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=357490.8333333333, ans=0.025 2024-06-21 09:06:55,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357490.8333333333, ans=0.1 2024-06-21 09:06:56,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=357509.1666666667, ans=0.025 2024-06-21 09:07:09,910 INFO [train.py:1028] (0/2) Epoch 20, batch 2800, loss[loss=0.2119, simple_loss=0.2561, pruned_loss=0.08389, over 11046.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2518, pruned_loss=0.07315, over 2580856.54 frames. ], batch size: 303, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:07:22,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=357582.5, ans=0.0 2024-06-21 09:07:22,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=357582.5, ans=0.05 2024-06-21 09:07:25,693 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.006e+02 2.169e+02 2.325e+02 2.896e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 09:07:32,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357600.8333333333, ans=0.1 2024-06-21 09:07:32,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=357600.8333333333, ans=0.125 2024-06-21 09:07:41,956 INFO [train.py:1028] (0/2) Epoch 20, batch 2850, loss[loss=0.1712, simple_loss=0.2294, pruned_loss=0.0565, over 13275.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.251, pruned_loss=0.07304, over 2578763.13 frames. ], batch size: 49, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:07:47,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=357637.5, ans=0.125 2024-06-21 09:07:50,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=357655.8333333333, ans=0.0 2024-06-21 09:07:51,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2024-06-21 09:07:52,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=357655.8333333333, ans=0.2 2024-06-21 09:07:52,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=357655.8333333333, ans=0.125 2024-06-21 09:07:52,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=357655.8333333333, ans=0.07 2024-06-21 09:07:57,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.91 vs. limit=15.0 2024-06-21 09:07:58,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=357674.1666666667, ans=0.07 2024-06-21 09:08:09,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2024-06-21 09:08:16,658 INFO [train.py:1028] (0/2) Epoch 20, batch 2900, loss[loss=0.1707, simple_loss=0.2245, pruned_loss=0.05843, over 13131.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2495, pruned_loss=0.07253, over 2586298.17 frames. ], batch size: 55, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:08:18,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.58 vs. limit=15.0 2024-06-21 09:08:18,821 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2024-06-21 09:08:33,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=357765.8333333333, ans=0.0 2024-06-21 09:08:35,854 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.029e+02 2.261e+02 2.450e+02 3.381e+02, threshold=4.523e+02, percent-clipped=0.0 2024-06-21 09:08:41,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.01 vs. limit=22.5 2024-06-21 09:08:44,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=357784.1666666667, ans=0.125 2024-06-21 09:08:52,254 INFO [train.py:1028] (0/2) Epoch 20, batch 2950, loss[loss=0.1968, simple_loss=0.2517, pruned_loss=0.07092, over 13291.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2497, pruned_loss=0.07244, over 2580334.86 frames. ], batch size: 43, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:08:58,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=357839.1666666667, ans=0.0 2024-06-21 09:08:58,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2024-06-21 09:09:10,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=357857.5, ans=0.125 2024-06-21 09:09:10,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=357857.5, ans=0.125 2024-06-21 09:09:20,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2024-06-21 09:09:22,051 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2024-06-21 09:09:22,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.50 vs. limit=12.0 2024-06-21 09:09:26,054 INFO [train.py:1028] (0/2) Epoch 20, batch 3000, loss[loss=0.1895, simple_loss=0.244, pruned_loss=0.06752, over 13169.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2485, pruned_loss=0.07227, over 2578294.75 frames. ], batch size: 59, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:09:26,055 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 09:09:33,923 INFO [train.py:1060] (0/2) Epoch 20, validation: loss=0.1864, simple_loss=0.2507, pruned_loss=0.06101, over 351949.00 frames. 2024-06-21 09:09:33,923 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 09:09:36,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=357912.5, ans=0.0 2024-06-21 09:09:36,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=357912.5, ans=0.125 2024-06-21 09:09:49,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357949.1666666667, ans=0.1 2024-06-21 09:09:50,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=357949.1666666667, ans=0.125 2024-06-21 09:09:50,593 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 2.000e+02 2.115e+02 2.305e+02 3.438e+02, threshold=4.230e+02, percent-clipped=0.0 2024-06-21 09:09:52,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=357949.1666666667, ans=0.125 2024-06-21 09:09:52,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.63 vs. limit=10.0 2024-06-21 09:09:52,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=357949.1666666667, ans=0.035 2024-06-21 09:09:54,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=357967.5, ans=0.0 2024-06-21 09:09:56,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=357967.5, ans=0.125 2024-06-21 09:09:56,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357967.5, ans=0.1 2024-06-21 09:09:57,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=357967.5, ans=0.0 2024-06-21 09:10:01,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=357967.5, ans=0.2 2024-06-21 09:10:02,593 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:10:10,153 INFO [train.py:1028] (0/2) Epoch 20, batch 3050, loss[loss=0.2033, simple_loss=0.2596, pruned_loss=0.07354, over 13288.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2481, pruned_loss=0.07229, over 2578081.11 frames. ], batch size: 46, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:10:11,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.74 vs. limit=22.5 2024-06-21 09:10:22,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=358022.5, ans=0.125 2024-06-21 09:10:24,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=358040.8333333333, ans=0.0 2024-06-21 09:10:26,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=358040.8333333333, ans=0.04949747468305833 2024-06-21 09:10:26,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=358040.8333333333, ans=0.0 2024-06-21 09:10:37,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358059.1666666667, ans=0.1 2024-06-21 09:10:46,120 INFO [train.py:1028] (0/2) Epoch 20, batch 3100, loss[loss=0.1931, simple_loss=0.2381, pruned_loss=0.07404, over 13082.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2468, pruned_loss=0.07162, over 2578148.48 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:10:47,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358095.8333333333, ans=0.1 2024-06-21 09:10:50,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=358095.8333333333, ans=0.0 2024-06-21 09:10:51,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=358095.8333333333, ans=0.125 2024-06-21 09:10:51,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=358095.8333333333, ans=0.125 2024-06-21 09:10:52,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358114.1666666667, ans=0.1 2024-06-21 09:11:02,532 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.009e+02 2.183e+02 2.380e+02 3.057e+02, threshold=4.366e+02, percent-clipped=0.0 2024-06-21 09:11:06,011 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=5.322e-03 2024-06-21 09:11:14,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=358169.1666666667, ans=0.125 2024-06-21 09:11:14,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.54 vs. limit=15.0 2024-06-21 09:11:18,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=358169.1666666667, ans=0.2 2024-06-21 09:11:19,245 INFO [train.py:1028] (0/2) Epoch 20, batch 3150, loss[loss=0.197, simple_loss=0.2429, pruned_loss=0.07557, over 12934.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2461, pruned_loss=0.07111, over 2579641.39 frames. ], batch size: 158, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:11:25,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=358205.8333333333, ans=0.125 2024-06-21 09:11:43,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=358242.5, ans=0.07 2024-06-21 09:11:50,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358260.8333333333, ans=0.1 2024-06-21 09:11:51,423 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:11:51,858 INFO [train.py:1028] (0/2) Epoch 20, batch 3200, loss[loss=0.187, simple_loss=0.2402, pruned_loss=0.0669, over 13123.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2454, pruned_loss=0.07086, over 2581349.22 frames. ], batch size: 55, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:11:59,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2024-06-21 09:12:09,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=358315.8333333333, ans=0.1 2024-06-21 09:12:10,954 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 1.992e+02 2.108e+02 2.252e+02 2.950e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-21 09:12:12,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=358315.8333333333, ans=0.125 2024-06-21 09:12:13,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=358334.1666666667, ans=0.125 2024-06-21 09:12:16,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=358334.1666666667, ans=0.125 2024-06-21 09:12:18,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=358334.1666666667, ans=0.125 2024-06-21 09:12:19,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=358352.5, ans=0.125 2024-06-21 09:12:26,739 INFO [train.py:1028] (0/2) Epoch 20, batch 3250, loss[loss=0.1859, simple_loss=0.2375, pruned_loss=0.0672, over 13181.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2455, pruned_loss=0.07107, over 2585282.15 frames. ], batch size: 72, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:12:41,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=358389.1666666667, ans=0.0 2024-06-21 09:12:48,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=358407.5, ans=0.0 2024-06-21 09:12:51,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=358425.8333333333, ans=0.125 2024-06-21 09:12:52,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=358425.8333333333, ans=0.125 2024-06-21 09:13:03,229 INFO [train.py:1028] (0/2) Epoch 20, batch 3300, loss[loss=0.2227, simple_loss=0.2653, pruned_loss=0.09009, over 12686.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2452, pruned_loss=0.07084, over 2580333.54 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:13:03,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=358462.5, ans=0.125 2024-06-21 09:13:07,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=358462.5, ans=0.0 2024-06-21 09:13:15,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=358499.1666666667, ans=0.125 2024-06-21 09:13:18,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=358499.1666666667, ans=0.125 2024-06-21 09:13:19,348 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 1.998e+02 2.151e+02 2.365e+02 3.190e+02, threshold=4.303e+02, percent-clipped=0.0 2024-06-21 09:13:22,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=358517.5, ans=0.025 2024-06-21 09:13:26,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=358517.5, ans=0.125 2024-06-21 09:13:35,479 INFO [train.py:1028] (0/2) Epoch 20, batch 3350, loss[loss=0.1902, simple_loss=0.2392, pruned_loss=0.07066, over 12925.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2445, pruned_loss=0.0707, over 2575801.37 frames. ], batch size: 158, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:13:36,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=358554.1666666667, ans=0.125 2024-06-21 09:14:04,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=15.0 2024-06-21 09:14:12,790 INFO [train.py:1028] (0/2) Epoch 20, batch 3400, loss[loss=0.2, simple_loss=0.2543, pruned_loss=0.0729, over 12710.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.244, pruned_loss=0.07072, over 2574575.39 frames. ], batch size: 22, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:14:13,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2024-06-21 09:14:16,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=358645.8333333333, ans=0.0 2024-06-21 09:14:16,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=358645.8333333333, ans=0.125 2024-06-21 09:14:18,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=358645.8333333333, ans=0.0 2024-06-21 09:14:29,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-21 09:14:32,099 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.966e+02 2.093e+02 2.290e+02 3.085e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 09:14:48,753 INFO [train.py:1028] (0/2) Epoch 20, batch 3450, loss[loss=0.1912, simple_loss=0.2359, pruned_loss=0.07323, over 12843.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2433, pruned_loss=0.07031, over 2575373.67 frames. ], batch size: 177, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:14:53,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=358737.5, ans=0.2 2024-06-21 09:15:02,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=358774.1666666667, ans=0.0 2024-06-21 09:15:03,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358774.1666666667, ans=0.1 2024-06-21 09:15:12,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2024-06-21 09:15:21,834 INFO [train.py:1028] (0/2) Epoch 20, batch 3500, loss[loss=0.1954, simple_loss=0.2456, pruned_loss=0.0726, over 12845.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2432, pruned_loss=0.07016, over 2574478.67 frames. ], batch size: 33, lr: 2.92e-03, grad_scale: 64.0 2024-06-21 09:15:31,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=358847.5, ans=0.125 2024-06-21 09:15:35,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=358865.8333333333, ans=0.125 2024-06-21 09:15:37,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=15.0 2024-06-21 09:15:38,661 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.926e+02 2.032e+02 2.224e+02 2.906e+02, threshold=4.063e+02, percent-clipped=0.0 2024-06-21 09:15:40,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=358865.8333333333, ans=0.2 2024-06-21 09:15:42,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=358884.1666666667, ans=0.125 2024-06-21 09:15:46,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=358884.1666666667, ans=0.2 2024-06-21 09:15:46,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=358884.1666666667, ans=0.2 2024-06-21 09:15:58,579 INFO [train.py:1028] (0/2) Epoch 20, batch 3550, loss[loss=0.1809, simple_loss=0.232, pruned_loss=0.06492, over 13121.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2424, pruned_loss=0.06975, over 2576836.45 frames. ], batch size: 95, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:16:04,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=358939.1666666667, ans=0.2 2024-06-21 09:16:05,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2024-06-21 09:16:34,128 INFO [train.py:1028] (0/2) Epoch 20, batch 3600, loss[loss=0.2067, simple_loss=0.2554, pruned_loss=0.07894, over 13064.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2414, pruned_loss=0.06942, over 2579852.04 frames. ], batch size: 48, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:16:39,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=359012.5, ans=0.125 2024-06-21 09:16:42,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=359030.8333333333, ans=0.025 2024-06-21 09:16:43,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.81 vs. limit=15.0 2024-06-21 09:16:50,383 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.957e+02 2.064e+02 2.266e+02 3.230e+02, threshold=4.128e+02, percent-clipped=0.0 2024-06-21 09:17:04,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=359085.8333333333, ans=0.125 2024-06-21 09:17:06,651 INFO [train.py:1028] (0/2) Epoch 20, batch 3650, loss[loss=0.1702, simple_loss=0.2133, pruned_loss=0.06354, over 13040.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2416, pruned_loss=0.06907, over 2578335.58 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:17:09,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=359104.1666666667, ans=0.025 2024-06-21 09:17:22,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2024-06-21 09:17:28,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=359159.1666666667, ans=0.125 2024-06-21 09:17:37,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=359177.5, ans=0.025 2024-06-21 09:17:39,107 INFO [train.py:1028] (0/2) Epoch 20, batch 3700, loss[loss=0.1966, simple_loss=0.2553, pruned_loss=0.06894, over 13238.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2411, pruned_loss=0.0691, over 2583318.52 frames. ], batch size: 72, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:17:49,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=359214.1666666667, ans=0.0 2024-06-21 09:17:49,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=359214.1666666667, ans=0.2 2024-06-21 09:17:58,667 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.718e+02 1.938e+02 2.020e+02 2.165e+02 2.733e+02, threshold=4.040e+02, percent-clipped=0.0 2024-06-21 09:18:06,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=15.0 2024-06-21 09:18:15,078 INFO [train.py:1028] (0/2) Epoch 20, batch 3750, loss[loss=0.18, simple_loss=0.2362, pruned_loss=0.06187, over 12746.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2406, pruned_loss=0.06886, over 2585633.35 frames. ], batch size: 22, lr: 2.91e-03, grad_scale: 64.0 2024-06-21 09:18:33,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=15.0 2024-06-21 09:18:33,582 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-196000.pt 2024-06-21 09:18:39,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=359324.1666666667, ans=0.0 2024-06-21 09:18:55,358 INFO [train.py:1028] (0/2) Epoch 20, batch 3800, loss[loss=0.19, simple_loss=0.2424, pruned_loss=0.06876, over 13192.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2401, pruned_loss=0.06845, over 2583861.77 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:19:11,995 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 1.947e+02 2.134e+02 2.264e+02 3.069e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 09:19:20,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=359434.1666666667, ans=0.125 2024-06-21 09:19:22,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=359452.5, ans=0.0 2024-06-21 09:19:23,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=359452.5, ans=0.1 2024-06-21 09:19:28,036 INFO [train.py:1028] (0/2) Epoch 20, batch 3850, loss[loss=0.2004, simple_loss=0.2434, pruned_loss=0.07867, over 13013.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2401, pruned_loss=0.06829, over 2584331.55 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:19:29,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=359470.8333333333, ans=15.0 2024-06-21 09:19:38,660 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.06 vs. limit=15.0 2024-06-21 09:19:44,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=359507.5, ans=0.125 2024-06-21 09:19:56,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=359544.1666666667, ans=0.0 2024-06-21 09:20:00,826 INFO [train.py:1028] (0/2) Epoch 20, batch 3900, loss[loss=0.1817, simple_loss=0.2261, pruned_loss=0.06863, over 13219.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2392, pruned_loss=0.06792, over 2587634.93 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:20:06,587 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=12.0 2024-06-21 09:20:07,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=359580.8333333333, ans=0.125 2024-06-21 09:20:18,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=359599.1666666667, ans=0.1 2024-06-21 09:20:20,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359599.1666666667, ans=0.1 2024-06-21 09:20:21,258 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.974e+02 2.120e+02 2.265e+02 2.992e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-21 09:20:24,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=359617.5, ans=0.0 2024-06-21 09:20:28,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=359617.5, ans=0.07 2024-06-21 09:20:32,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2024-06-21 09:20:33,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=359635.8333333333, ans=0.95 2024-06-21 09:20:37,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2024-06-21 09:20:37,129 INFO [train.py:1028] (0/2) Epoch 20, batch 3950, loss[loss=0.198, simple_loss=0.2412, pruned_loss=0.07733, over 13129.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2388, pruned_loss=0.06763, over 2589080.27 frames. ], batch size: 132, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:20:37,355 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:20:50,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=359672.5, ans=0.2 2024-06-21 09:21:05,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=359709.1666666667, ans=0.125 2024-06-21 09:21:05,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=359709.1666666667, ans=0.0 2024-06-21 09:21:07,823 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:21:13,722 INFO [train.py:1028] (0/2) Epoch 20, batch 4000, loss[loss=0.2069, simple_loss=0.2529, pruned_loss=0.08042, over 12901.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2388, pruned_loss=0.06812, over 2583576.16 frames. ], batch size: 39, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:21:15,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.47 vs. limit=22.5 2024-06-21 09:21:22,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=359764.1666666667, ans=0.025 2024-06-21 09:21:31,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.941e+02 2.111e+02 2.278e+02 3.436e+02, threshold=4.223e+02, percent-clipped=0.0 2024-06-21 09:21:35,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=359800.8333333333, ans=0.0 2024-06-21 09:21:41,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=359819.1666666667, ans=0.09899494936611666 2024-06-21 09:21:44,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=359819.1666666667, ans=0.0 2024-06-21 09:21:46,968 INFO [train.py:1028] (0/2) Epoch 20, batch 4050, loss[loss=0.2112, simple_loss=0.2464, pruned_loss=0.08794, over 11108.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2383, pruned_loss=0.06801, over 2581285.71 frames. ], batch size: 304, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:21:49,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=359837.5, ans=0.025 2024-06-21 09:21:58,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=359855.8333333333, ans=0.2 2024-06-21 09:21:59,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=359874.1666666667, ans=0.125 2024-06-21 09:22:02,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=359874.1666666667, ans=0.1 2024-06-21 09:22:05,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=359892.5, ans=0.0 2024-06-21 09:22:13,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=359892.5, ans=0.125 2024-06-21 09:22:13,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=359892.5, ans=0.125 2024-06-21 09:22:16,639 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:22:20,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=359910.8333333333, ans=0.125 2024-06-21 09:22:22,691 INFO [train.py:1028] (0/2) Epoch 20, batch 4100, loss[loss=0.1861, simple_loss=0.2359, pruned_loss=0.06812, over 13073.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2386, pruned_loss=0.0683, over 2578236.48 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:22:25,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=359929.1666666667, ans=0.125 2024-06-21 09:22:28,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=359929.1666666667, ans=0.125 2024-06-21 09:22:33,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=359947.5, ans=0.125 2024-06-21 09:22:43,111 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.979e+02 2.099e+02 2.305e+02 2.832e+02, threshold=4.197e+02, percent-clipped=0.0 2024-06-21 09:22:46,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=359984.1666666667, ans=0.125 2024-06-21 09:22:49,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=359984.1666666667, ans=0.125 2024-06-21 09:22:58,851 INFO [train.py:1028] (0/2) Epoch 20, batch 4150, loss[loss=0.1909, simple_loss=0.2418, pruned_loss=0.07005, over 13127.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2385, pruned_loss=0.06838, over 2576691.75 frames. ], batch size: 55, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:22:59,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=360020.8333333333, ans=0.125 2024-06-21 09:23:00,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=360020.8333333333, ans=0.125 2024-06-21 09:23:05,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=360039.1666666667, ans=0.0 2024-06-21 09:23:22,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2024-06-21 09:23:31,833 INFO [train.py:1028] (0/2) Epoch 20, batch 4200, loss[loss=0.1849, simple_loss=0.226, pruned_loss=0.07187, over 13047.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2379, pruned_loss=0.0684, over 2578986.36 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:23:32,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360112.5, ans=0.1 2024-06-21 09:23:39,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.47 vs. limit=12.0 2024-06-21 09:23:48,600 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.960e+02 2.062e+02 2.237e+02 3.205e+02, threshold=4.125e+02, percent-clipped=0.0 2024-06-21 09:23:53,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.65 vs. limit=12.0 2024-06-21 09:23:57,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=360185.8333333333, ans=0.125 2024-06-21 09:24:03,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.63 vs. limit=22.5 2024-06-21 09:24:04,229 INFO [train.py:1028] (0/2) Epoch 20, batch 4250, loss[loss=0.1849, simple_loss=0.2444, pruned_loss=0.06268, over 13255.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2377, pruned_loss=0.06822, over 2580556.81 frames. ], batch size: 46, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:24:13,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=360222.5, ans=0.125 2024-06-21 09:24:19,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=360222.5, ans=0.125 2024-06-21 09:24:23,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=360240.8333333333, ans=0.125 2024-06-21 09:24:23,892 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.35 vs. limit=15.0 2024-06-21 09:24:31,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=360259.1666666667, ans=0.0 2024-06-21 09:24:32,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360259.1666666667, ans=0.125 2024-06-21 09:24:40,617 INFO [train.py:1028] (0/2) Epoch 20, batch 4300, loss[loss=0.1663, simple_loss=0.222, pruned_loss=0.05525, over 13240.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2374, pruned_loss=0.06807, over 2580325.14 frames. ], batch size: 59, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:24:48,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360295.8333333333, ans=0.1 2024-06-21 09:24:55,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=360314.1666666667, ans=0.025 2024-06-21 09:24:55,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360314.1666666667, ans=0.1 2024-06-21 09:24:56,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=360332.5, ans=0.025 2024-06-21 09:24:59,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2024-06-21 09:25:00,451 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 1.979e+02 2.066e+02 2.282e+02 3.069e+02, threshold=4.132e+02, percent-clipped=0.0 2024-06-21 09:25:04,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=360350.8333333333, ans=0.2 2024-06-21 09:25:14,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2024-06-21 09:25:15,558 INFO [train.py:1028] (0/2) Epoch 20, batch 4350, loss[loss=0.1906, simple_loss=0.2454, pruned_loss=0.0679, over 13160.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2371, pruned_loss=0.06794, over 2584535.29 frames. ], batch size: 59, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:25:26,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=360405.8333333333, ans=0.0 2024-06-21 09:25:42,939 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:25:46,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.65 vs. limit=15.0 2024-06-21 09:25:48,123 INFO [train.py:1028] (0/2) Epoch 20, batch 4400, loss[loss=0.1959, simple_loss=0.2416, pruned_loss=0.07514, over 13198.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2367, pruned_loss=0.06786, over 2584968.43 frames. ], batch size: 83, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:25:49,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=360479.1666666667, ans=0.0 2024-06-21 09:25:52,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=360479.1666666667, ans=0.04949747468305833 2024-06-21 09:25:54,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=360497.5, ans=0.125 2024-06-21 09:26:03,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=12.0 2024-06-21 09:26:04,346 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 1.980e+02 2.151e+02 2.336e+02 3.377e+02, threshold=4.301e+02, percent-clipped=0.0 2024-06-21 09:26:08,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360515.8333333333, ans=0.125 2024-06-21 09:26:09,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=360534.1666666667, ans=0.025 2024-06-21 09:26:11,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=360534.1666666667, ans=0.07 2024-06-21 09:26:13,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360534.1666666667, ans=0.1 2024-06-21 09:26:15,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=360534.1666666667, ans=0.0 2024-06-21 09:26:17,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=360552.5, ans=0.125 2024-06-21 09:26:19,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=360552.5, ans=0.0 2024-06-21 09:26:23,415 INFO [train.py:1028] (0/2) Epoch 20, batch 4450, loss[loss=0.1764, simple_loss=0.2306, pruned_loss=0.0611, over 12950.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2372, pruned_loss=0.06796, over 2579440.69 frames. ], batch size: 33, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:26:26,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=360570.8333333333, ans=0.2 2024-06-21 09:26:31,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2024-06-21 09:26:51,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=360625.8333333333, ans=0.125 2024-06-21 09:26:56,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=360644.1666666667, ans=0.1 2024-06-21 09:26:58,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.03 vs. limit=12.0 2024-06-21 09:26:59,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=360644.1666666667, ans=0.125 2024-06-21 09:26:59,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=360644.1666666667, ans=0.125 2024-06-21 09:27:01,227 INFO [train.py:1028] (0/2) Epoch 20, batch 4500, loss[loss=0.1769, simple_loss=0.2288, pruned_loss=0.06255, over 13217.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2365, pruned_loss=0.06788, over 2583987.37 frames. ], batch size: 89, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:27:08,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=360680.8333333333, ans=0.125 2024-06-21 09:27:10,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.25 vs. limit=10.0 2024-06-21 09:27:17,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.89 vs. limit=22.5 2024-06-21 09:27:19,128 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.926e+02 2.034e+02 2.197e+02 2.807e+02, threshold=4.069e+02, percent-clipped=0.0 2024-06-21 09:27:21,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=360717.5, ans=0.0 2024-06-21 09:27:31,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=360735.8333333333, ans=0.04949747468305833 2024-06-21 09:27:33,979 INFO [train.py:1028] (0/2) Epoch 20, batch 4550, loss[loss=0.1758, simple_loss=0.2312, pruned_loss=0.06019, over 13298.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2366, pruned_loss=0.06787, over 2588228.70 frames. ], batch size: 52, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:27:36,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=360754.1666666667, ans=0.125 2024-06-21 09:27:42,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=360772.5, ans=0.125 2024-06-21 09:28:04,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=360827.5, ans=0.125 2024-06-21 09:28:04,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=360827.5, ans=0.125 2024-06-21 09:28:11,150 INFO [train.py:1028] (0/2) Epoch 20, batch 4600, loss[loss=0.2159, simple_loss=0.2589, pruned_loss=0.08647, over 12441.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2366, pruned_loss=0.06796, over 2583657.85 frames. ], batch size: 202, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:28:12,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.77 vs. limit=22.5 2024-06-21 09:28:14,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=360845.8333333333, ans=0.07 2024-06-21 09:28:28,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=360882.5, ans=0.05 2024-06-21 09:28:28,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 2.026e+02 2.145e+02 2.422e+02 3.065e+02, threshold=4.289e+02, percent-clipped=0.0 2024-06-21 09:28:29,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=360882.5, ans=0.125 2024-06-21 09:28:34,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=360900.8333333333, ans=15.0 2024-06-21 09:28:46,600 INFO [train.py:1028] (0/2) Epoch 20, batch 4650, loss[loss=0.1859, simple_loss=0.2268, pruned_loss=0.07253, over 13087.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2362, pruned_loss=0.06778, over 2586166.92 frames. ], batch size: 132, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:28:46,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=360937.5, ans=0.125 2024-06-21 09:28:53,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=360955.8333333333, ans=0.0 2024-06-21 09:29:01,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=360974.1666666667, ans=0.2 2024-06-21 09:29:01,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2024-06-21 09:29:04,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=360974.1666666667, ans=0.0 2024-06-21 09:29:19,189 INFO [train.py:1028] (0/2) Epoch 20, batch 4700, loss[loss=0.1708, simple_loss=0.2301, pruned_loss=0.05577, over 13036.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2364, pruned_loss=0.06773, over 2581991.54 frames. ], batch size: 26, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:29:22,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=361029.1666666667, ans=0.0 2024-06-21 09:29:25,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2024-06-21 09:29:34,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2024-06-21 09:29:36,313 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 1.991e+02 2.163e+02 2.422e+02 3.211e+02, threshold=4.325e+02, percent-clipped=0.0 2024-06-21 09:29:41,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=361084.1666666667, ans=0.125 2024-06-21 09:29:42,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=361084.1666666667, ans=0.0 2024-06-21 09:29:42,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=361084.1666666667, ans=0.04949747468305833 2024-06-21 09:29:51,772 INFO [train.py:1028] (0/2) Epoch 20, batch 4750, loss[loss=0.2107, simple_loss=0.2588, pruned_loss=0.08128, over 12538.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2364, pruned_loss=0.06808, over 2578765.58 frames. ], batch size: 202, lr: 2.91e-03, grad_scale: 16.0 2024-06-21 09:29:56,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=361120.8333333333, ans=15.0 2024-06-21 09:30:00,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.02 vs. limit=22.5 2024-06-21 09:30:20,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=361175.8333333333, ans=0.5 2024-06-21 09:30:21,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=361194.1666666667, ans=0.0 2024-06-21 09:30:24,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=361194.1666666667, ans=0.0 2024-06-21 09:30:28,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361194.1666666667, ans=0.1 2024-06-21 09:30:29,594 INFO [train.py:1028] (0/2) Epoch 20, batch 4800, loss[loss=0.1795, simple_loss=0.2292, pruned_loss=0.0649, over 13268.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.236, pruned_loss=0.0681, over 2575232.14 frames. ], batch size: 63, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:30:32,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.58 vs. limit=15.0 2024-06-21 09:30:45,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.91 vs. limit=15.0 2024-06-21 09:30:53,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.713e+02 1.956e+02 2.139e+02 2.366e+02 3.039e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 09:30:54,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=361249.1666666667, ans=0.125 2024-06-21 09:30:54,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=361249.1666666667, ans=22.5 2024-06-21 09:30:55,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=361267.5, ans=0.0 2024-06-21 09:31:02,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=361285.8333333333, ans=0.2 2024-06-21 09:31:04,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=361285.8333333333, ans=0.0 2024-06-21 09:31:04,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=361285.8333333333, ans=0.125 2024-06-21 09:31:09,270 INFO [train.py:1028] (0/2) Epoch 20, batch 4850, loss[loss=0.1763, simple_loss=0.2263, pruned_loss=0.06318, over 13247.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2354, pruned_loss=0.06761, over 2572502.19 frames. ], batch size: 89, lr: 2.91e-03, grad_scale: 32.0 2024-06-21 09:31:16,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=361322.5, ans=0.0 2024-06-21 09:31:16,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=361322.5, ans=0.1 2024-06-21 09:31:18,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2024-06-21 09:31:19,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=361322.5, ans=0.0 2024-06-21 09:31:21,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=361322.5, ans=0.125 2024-06-21 09:31:27,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=361340.8333333333, ans=0.0 2024-06-21 09:31:37,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=361377.5, ans=0.125 2024-06-21 09:31:41,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=361377.5, ans=0.125 2024-06-21 09:31:44,279 INFO [train.py:1028] (0/2) Epoch 20, batch 4900, loss[loss=0.1942, simple_loss=0.2494, pruned_loss=0.06948, over 13170.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2358, pruned_loss=0.06781, over 2572753.00 frames. ], batch size: 59, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:31:50,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361414.1666666667, ans=0.1 2024-06-21 09:32:07,292 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 1.960e+02 2.104e+02 2.344e+02 3.133e+02, threshold=4.209e+02, percent-clipped=0.0 2024-06-21 09:32:07,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=361432.5, ans=0.125 2024-06-21 09:32:09,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=361450.8333333333, ans=0.0 2024-06-21 09:32:16,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=361469.1666666667, ans=0.0 2024-06-21 09:32:16,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2024-06-21 09:32:21,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=361469.1666666667, ans=0.0 2024-06-21 09:32:21,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=361487.5, ans=0.125 2024-06-21 09:32:22,344 INFO [train.py:1028] (0/2) Epoch 20, batch 4950, loss[loss=0.2005, simple_loss=0.2433, pruned_loss=0.07883, over 10831.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2354, pruned_loss=0.06777, over 2567448.02 frames. ], batch size: 303, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:32:27,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=361487.5, ans=0.0 2024-06-21 09:32:33,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=361505.8333333333, ans=0.0 2024-06-21 09:32:35,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=361524.1666666667, ans=0.2 2024-06-21 09:32:53,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=361560.8333333333, ans=0.125 2024-06-21 09:32:53,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=361560.8333333333, ans=0.125 2024-06-21 09:32:58,458 INFO [train.py:1028] (0/2) Epoch 20, batch 5000, loss[loss=0.1858, simple_loss=0.229, pruned_loss=0.07128, over 13172.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2345, pruned_loss=0.06732, over 2572093.23 frames. ], batch size: 95, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:33:02,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=361579.1666666667, ans=0.0 2024-06-21 09:33:03,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=361579.1666666667, ans=0.0 2024-06-21 09:33:07,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.50 vs. limit=22.5 2024-06-21 09:33:07,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=361597.5, ans=0.125 2024-06-21 09:33:07,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=22.5 2024-06-21 09:33:15,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=361615.8333333333, ans=0.0 2024-06-21 09:33:16,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=361615.8333333333, ans=0.125 2024-06-21 09:33:17,823 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 1.929e+02 2.036e+02 2.247e+02 3.575e+02, threshold=4.073e+02, percent-clipped=0.0 2024-06-21 09:33:32,410 INFO [train.py:1028] (0/2) Epoch 20, batch 5050, loss[loss=0.1888, simple_loss=0.2412, pruned_loss=0.06818, over 12984.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2349, pruned_loss=0.06738, over 2571221.42 frames. ], batch size: 36, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:33:36,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=361670.8333333333, ans=0.0 2024-06-21 09:33:44,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.37 vs. limit=15.0 2024-06-21 09:33:57,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=361725.8333333333, ans=0.05 2024-06-21 09:34:09,247 INFO [train.py:1028] (0/2) Epoch 20, batch 5100, loss[loss=0.1807, simple_loss=0.2383, pruned_loss=0.06155, over 12960.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2352, pruned_loss=0.06773, over 2567633.89 frames. ], batch size: 39, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:34:11,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=22.5 2024-06-21 09:34:27,935 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 1.920e+02 2.021e+02 2.178e+02 2.707e+02, threshold=4.042e+02, percent-clipped=0.0 2024-06-21 09:34:34,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=361835.8333333333, ans=0.125 2024-06-21 09:34:45,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=15.0 2024-06-21 09:34:45,383 INFO [train.py:1028] (0/2) Epoch 20, batch 5150, loss[loss=0.1678, simple_loss=0.2129, pruned_loss=0.06139, over 13135.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.235, pruned_loss=0.0676, over 2570431.50 frames. ], batch size: 132, lr: 2.90e-03, grad_scale: 16.0 2024-06-21 09:34:52,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=361872.5, ans=0.125 2024-06-21 09:34:57,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=361872.5, ans=0.125 2024-06-21 09:35:18,440 INFO [train.py:1028] (0/2) Epoch 20, batch 5200, loss[loss=0.1771, simple_loss=0.2215, pruned_loss=0.06635, over 13181.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2347, pruned_loss=0.06745, over 2573111.48 frames. ], batch size: 95, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:35:27,781 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2024-06-21 09:35:36,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 1.965e+02 2.076e+02 2.260e+02 3.205e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-21 09:35:39,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=362000.8333333333, ans=0.125 2024-06-21 09:35:44,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362019.1666666667, ans=0.0 2024-06-21 09:35:44,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362019.1666666667, ans=0.125 2024-06-21 09:35:51,179 INFO [train.py:1028] (0/2) Epoch 20, batch 5250, loss[loss=0.1776, simple_loss=0.2327, pruned_loss=0.06121, over 13327.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2351, pruned_loss=0.06765, over 2569637.98 frames. ], batch size: 52, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:35:53,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362037.5, ans=0.1 2024-06-21 09:35:53,387 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:35:58,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2024-06-21 09:36:00,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=362055.8333333333, ans=0.0 2024-06-21 09:36:19,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=362092.5, ans=0.0 2024-06-21 09:36:22,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=362092.5, ans=0.125 2024-06-21 09:36:28,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=362110.8333333333, ans=0.0 2024-06-21 09:36:30,025 INFO [train.py:1028] (0/2) Epoch 20, batch 5300, loss[loss=0.1797, simple_loss=0.2307, pruned_loss=0.06435, over 13020.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2345, pruned_loss=0.06726, over 2566503.15 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:36:32,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=362129.1666666667, ans=0.125 2024-06-21 09:36:32,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=362129.1666666667, ans=0.125 2024-06-21 09:36:52,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 1.930e+02 1.996e+02 2.162e+02 2.822e+02, threshold=3.992e+02, percent-clipped=0.0 2024-06-21 09:36:56,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=362184.1666666667, ans=0.2 2024-06-21 09:36:56,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362184.1666666667, ans=0.1 2024-06-21 09:36:56,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=362184.1666666667, ans=0.125 2024-06-21 09:36:58,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=362184.1666666667, ans=0.1 2024-06-21 09:37:05,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=362202.5, ans=0.0 2024-06-21 09:37:05,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362202.5, ans=0.1 2024-06-21 09:37:06,926 INFO [train.py:1028] (0/2) Epoch 20, batch 5350, loss[loss=0.1727, simple_loss=0.2316, pruned_loss=0.05688, over 12276.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2345, pruned_loss=0.06735, over 2574348.90 frames. ], batch size: 18, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:37:33,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=362294.1666666667, ans=0.0 2024-06-21 09:37:37,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2024-06-21 09:37:39,324 INFO [train.py:1028] (0/2) Epoch 20, batch 5400, loss[loss=0.2129, simple_loss=0.252, pruned_loss=0.08691, over 12242.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2345, pruned_loss=0.06752, over 2567960.52 frames. ], batch size: 240, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:37:43,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2024-06-21 09:37:53,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2024-06-21 09:38:02,647 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.962e+02 2.110e+02 2.280e+02 2.839e+02, threshold=4.221e+02, percent-clipped=0.0 2024-06-21 09:38:06,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362367.5, ans=0.1 2024-06-21 09:38:14,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=362385.8333333333, ans=0.125 2024-06-21 09:38:17,512 INFO [train.py:1028] (0/2) Epoch 20, batch 5450, loss[loss=0.181, simple_loss=0.2304, pruned_loss=0.06578, over 12907.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2348, pruned_loss=0.06739, over 2572144.24 frames. ], batch size: 26, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:38:24,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=362422.5, ans=0.0 2024-06-21 09:38:31,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=362440.8333333333, ans=0.0 2024-06-21 09:38:37,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=362440.8333333333, ans=0.125 2024-06-21 09:38:41,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=362459.1666666667, ans=0.0 2024-06-21 09:38:42,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=362459.1666666667, ans=0.125 2024-06-21 09:38:51,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=362477.5, ans=0.0 2024-06-21 09:38:54,293 INFO [train.py:1028] (0/2) Epoch 20, batch 5500, loss[loss=0.213, simple_loss=0.2524, pruned_loss=0.08678, over 12098.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2345, pruned_loss=0.06742, over 2563848.65 frames. ], batch size: 240, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:38:55,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=362495.8333333333, ans=0.0 2024-06-21 09:39:03,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=362514.1666666667, ans=0.5 2024-06-21 09:39:06,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=362514.1666666667, ans=0.0 2024-06-21 09:39:13,056 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.664e+02 1.990e+02 2.133e+02 2.331e+02 2.975e+02, threshold=4.266e+02, percent-clipped=0.0 2024-06-21 09:39:22,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=362569.1666666667, ans=0.025 2024-06-21 09:39:25,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=362569.1666666667, ans=0.0 2024-06-21 09:39:25,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=362569.1666666667, ans=0.025 2024-06-21 09:39:27,595 INFO [train.py:1028] (0/2) Epoch 20, batch 5550, loss[loss=0.1878, simple_loss=0.2349, pruned_loss=0.07033, over 13211.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2337, pruned_loss=0.06693, over 2567251.89 frames. ], batch size: 43, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:39:37,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=362605.8333333333, ans=0.025 2024-06-21 09:39:48,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=362642.5, ans=0.125 2024-06-21 09:39:49,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=362642.5, ans=0.125 2024-06-21 09:39:52,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=362642.5, ans=0.0 2024-06-21 09:39:52,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=362642.5, ans=0.0 2024-06-21 09:39:54,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.69 vs. limit=15.0 2024-06-21 09:39:54,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=362660.8333333333, ans=0.05 2024-06-21 09:40:00,423 INFO [train.py:1028] (0/2) Epoch 20, batch 5600, loss[loss=0.1747, simple_loss=0.2248, pruned_loss=0.06235, over 13216.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2332, pruned_loss=0.06655, over 2569929.86 frames. ], batch size: 89, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:40:13,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=362697.5, ans=0.07 2024-06-21 09:40:22,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 1.910e+02 2.001e+02 2.140e+02 3.055e+02, threshold=4.002e+02, percent-clipped=0.0 2024-06-21 09:40:26,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=362734.1666666667, ans=0.2 2024-06-21 09:40:27,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362734.1666666667, ans=0.1 2024-06-21 09:40:29,031 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:40:36,608 INFO [train.py:1028] (0/2) Epoch 20, batch 5650, loss[loss=0.1904, simple_loss=0.2388, pruned_loss=0.071, over 12556.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2338, pruned_loss=0.06676, over 2574403.87 frames. ], batch size: 202, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:40:37,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=362770.8333333333, ans=0.0 2024-06-21 09:40:40,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=362770.8333333333, ans=0.025 2024-06-21 09:41:00,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362825.8333333333, ans=0.1 2024-06-21 09:41:13,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.10 vs. limit=10.0 2024-06-21 09:41:13,451 INFO [train.py:1028] (0/2) Epoch 20, batch 5700, loss[loss=0.1745, simple_loss=0.2296, pruned_loss=0.05972, over 13258.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2332, pruned_loss=0.06649, over 2578164.18 frames. ], batch size: 63, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:41:13,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=362862.5, ans=0.0 2024-06-21 09:41:16,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362862.5, ans=0.1 2024-06-21 09:41:16,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=362862.5, ans=0.025 2024-06-21 09:41:21,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=362880.8333333333, ans=0.125 2024-06-21 09:41:29,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=362899.1666666667, ans=0.125 2024-06-21 09:41:30,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=362899.1666666667, ans=0.0 2024-06-21 09:41:31,449 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.774e+02 1.955e+02 2.086e+02 2.304e+02 2.978e+02, threshold=4.172e+02, percent-clipped=0.0 2024-06-21 09:41:36,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=362917.5, ans=0.025 2024-06-21 09:41:37,616 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.37 vs. limit=15.0 2024-06-21 09:41:37,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=362917.5, ans=0.125 2024-06-21 09:41:41,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362935.8333333333, ans=0.0 2024-06-21 09:41:45,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=362954.1666666667, ans=0.125 2024-06-21 09:41:45,581 INFO [train.py:1028] (0/2) Epoch 20, batch 5750, loss[loss=0.1858, simple_loss=0.2312, pruned_loss=0.07015, over 12724.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2343, pruned_loss=0.06705, over 2580017.03 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:41:47,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=362954.1666666667, ans=0.125 2024-06-21 09:42:02,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=15.0 2024-06-21 09:42:04,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=362990.8333333333, ans=0.125 2024-06-21 09:42:16,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=363027.5, ans=0.0 2024-06-21 09:42:21,506 INFO [train.py:1028] (0/2) Epoch 20, batch 5800, loss[loss=0.2121, simple_loss=0.2569, pruned_loss=0.08366, over 12771.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2363, pruned_loss=0.06839, over 2578461.63 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:42:24,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=363045.8333333333, ans=0.125 2024-06-21 09:42:28,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=363064.1666666667, ans=0.125 2024-06-21 09:42:43,271 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.062e+02 2.209e+02 2.485e+02 3.394e+02, threshold=4.418e+02, percent-clipped=0.0 2024-06-21 09:42:46,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=363100.8333333333, ans=0.0 2024-06-21 09:42:52,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=363119.1666666667, ans=0.09899494936611666 2024-06-21 09:42:58,169 INFO [train.py:1028] (0/2) Epoch 20, batch 5850, loss[loss=0.2076, simple_loss=0.2555, pruned_loss=0.07983, over 12552.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2379, pruned_loss=0.06909, over 2576767.73 frames. ], batch size: 202, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:43:01,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=30.74 vs. limit=22.5 2024-06-21 09:43:05,840 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:43:11,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=363174.1666666667, ans=0.125 2024-06-21 09:43:18,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=363192.5, ans=0.0 2024-06-21 09:43:21,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=363192.5, ans=0.0 2024-06-21 09:43:28,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=363210.8333333333, ans=0.0 2024-06-21 09:43:31,189 INFO [train.py:1028] (0/2) Epoch 20, batch 5900, loss[loss=0.1777, simple_loss=0.2252, pruned_loss=0.06506, over 13071.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2398, pruned_loss=0.06977, over 2576573.04 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:43:32,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363229.1666666667, ans=0.125 2024-06-21 09:43:43,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=363247.5, ans=0.125 2024-06-21 09:43:48,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=363265.8333333333, ans=0.0 2024-06-21 09:43:49,828 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 1.999e+02 2.139e+02 2.355e+02 3.485e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 09:43:51,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363284.1666666667, ans=0.125 2024-06-21 09:44:08,171 INFO [train.py:1028] (0/2) Epoch 20, batch 5950, loss[loss=0.1822, simple_loss=0.2303, pruned_loss=0.0671, over 13093.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2409, pruned_loss=0.06994, over 2580939.68 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:44:19,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=363339.1666666667, ans=0.125 2024-06-21 09:44:20,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=15.0 2024-06-21 09:44:44,805 INFO [train.py:1028] (0/2) Epoch 20, batch 6000, loss[loss=0.2327, simple_loss=0.2753, pruned_loss=0.0951, over 12218.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2419, pruned_loss=0.07031, over 2574970.85 frames. ], batch size: 240, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:44:44,806 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 09:44:52,766 INFO [train.py:1060] (0/2) Epoch 20, validation: loss=0.1874, simple_loss=0.2514, pruned_loss=0.06175, over 351949.00 frames. 2024-06-21 09:44:52,766 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 09:44:53,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=363412.5, ans=10.0 2024-06-21 09:44:59,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=363430.8333333333, ans=0.125 2024-06-21 09:44:59,536 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:45:06,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=363449.1666666667, ans=0.0 2024-06-21 09:45:11,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.069e+02 2.236e+02 2.466e+02 3.790e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 09:45:21,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2024-06-21 09:45:23,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.54 vs. limit=10.0 2024-06-21 09:45:26,223 INFO [train.py:1028] (0/2) Epoch 20, batch 6050, loss[loss=0.2034, simple_loss=0.2638, pruned_loss=0.07147, over 12928.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2438, pruned_loss=0.07082, over 2577517.60 frames. ], batch size: 39, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:45:27,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=363504.1666666667, ans=0.2 2024-06-21 09:45:30,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.91 vs. limit=10.0 2024-06-21 09:45:39,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=363540.8333333333, ans=0.04949747468305833 2024-06-21 09:45:49,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=363559.1666666667, ans=0.2 2024-06-21 09:45:55,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=363577.5, ans=0.0 2024-06-21 09:45:59,951 INFO [train.py:1028] (0/2) Epoch 20, batch 6100, loss[loss=0.1926, simple_loss=0.2445, pruned_loss=0.07037, over 13132.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2452, pruned_loss=0.07149, over 2578745.10 frames. ], batch size: 121, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:46:03,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=363595.8333333333, ans=0.2 2024-06-21 09:46:12,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.48 vs. limit=22.5 2024-06-21 09:46:23,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.764e+02 2.015e+02 2.173e+02 2.355e+02 4.142e+02, threshold=4.346e+02, percent-clipped=0.0 2024-06-21 09:46:24,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363650.8333333333, ans=0.1 2024-06-21 09:46:29,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.04 vs. limit=15.0 2024-06-21 09:46:31,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=363669.1666666667, ans=0.04949747468305833 2024-06-21 09:46:36,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=363669.1666666667, ans=0.0 2024-06-21 09:46:38,124 INFO [train.py:1028] (0/2) Epoch 20, batch 6150, loss[loss=0.2029, simple_loss=0.2468, pruned_loss=0.07953, over 10792.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2468, pruned_loss=0.07215, over 2576557.33 frames. ], batch size: 303, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:46:41,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363687.5, ans=0.125 2024-06-21 09:46:50,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=363705.8333333333, ans=0.125 2024-06-21 09:46:51,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=363705.8333333333, ans=0.0 2024-06-21 09:46:53,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2024-06-21 09:46:55,548 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:47:13,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.45 vs. limit=6.0 2024-06-21 09:47:14,735 INFO [train.py:1028] (0/2) Epoch 20, batch 6200, loss[loss=0.2135, simple_loss=0.2742, pruned_loss=0.07638, over 13284.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2484, pruned_loss=0.07269, over 2573360.87 frames. ], batch size: 89, lr: 2.90e-03, grad_scale: 32.0 2024-06-21 09:47:26,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=363797.5, ans=0.0 2024-06-21 09:47:33,341 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.066e+02 2.295e+02 2.576e+02 4.288e+02, threshold=4.590e+02, percent-clipped=0.0 2024-06-21 09:47:48,056 INFO [train.py:1028] (0/2) Epoch 20, batch 6250, loss[loss=0.217, simple_loss=0.271, pruned_loss=0.0815, over 13196.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2497, pruned_loss=0.07331, over 2566214.55 frames. ], batch size: 83, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:47:49,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363870.8333333333, ans=0.125 2024-06-21 09:47:49,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=363870.8333333333, ans=0.125 2024-06-21 09:47:56,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=363889.1666666667, ans=0.125 2024-06-21 09:48:10,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=363925.8333333333, ans=22.5 2024-06-21 09:48:19,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363944.1666666667, ans=0.125 2024-06-21 09:48:23,685 INFO [train.py:1028] (0/2) Epoch 20, batch 6300, loss[loss=0.2002, simple_loss=0.2613, pruned_loss=0.06949, over 11417.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2512, pruned_loss=0.07382, over 2561433.96 frames. ], batch size: 16, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:48:24,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=363962.5, ans=0.2 2024-06-21 09:48:29,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=363980.8333333333, ans=0.0 2024-06-21 09:48:30,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-21 09:48:30,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.47 vs. limit=22.5 2024-06-21 09:48:34,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=363980.8333333333, ans=0.0 2024-06-21 09:48:42,318 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.710e+02 2.091e+02 2.323e+02 2.620e+02 4.679e+02, threshold=4.647e+02, percent-clipped=1.0 2024-06-21 09:48:42,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=363999.1666666667, ans=0.2 2024-06-21 09:48:43,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364017.5, ans=0.1 2024-06-21 09:48:43,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=364017.5, ans=0.0 2024-06-21 09:48:50,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=364017.5, ans=0.125 2024-06-21 09:48:57,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=364035.8333333333, ans=0.125 2024-06-21 09:49:00,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=364054.1666666667, ans=0.0 2024-06-21 09:49:01,037 INFO [train.py:1028] (0/2) Epoch 20, batch 6350, loss[loss=0.2472, simple_loss=0.2936, pruned_loss=0.1004, over 12519.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2525, pruned_loss=0.07382, over 2570539.15 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:49:05,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=364054.1666666667, ans=0.0 2024-06-21 09:49:07,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=364072.5, ans=0.125 2024-06-21 09:49:11,666 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.15 vs. limit=15.0 2024-06-21 09:49:18,694 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:49:33,713 INFO [train.py:1028] (0/2) Epoch 20, batch 6400, loss[loss=0.1735, simple_loss=0.2302, pruned_loss=0.05834, over 13248.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2544, pruned_loss=0.07452, over 2572846.05 frames. ], batch size: 67, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:49:37,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364145.8333333333, ans=0.125 2024-06-21 09:49:37,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=364145.8333333333, ans=0.0 2024-06-21 09:49:45,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=364164.1666666667, ans=0.0 2024-06-21 09:49:46,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=364182.5, ans=0.04949747468305833 2024-06-21 09:49:47,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364182.5, ans=0.0 2024-06-21 09:49:52,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.063e+02 2.243e+02 2.479e+02 3.217e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-21 09:49:56,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364200.8333333333, ans=0.1 2024-06-21 09:49:57,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2024-06-21 09:49:58,184 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:50:03,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=364219.1666666667, ans=0.125 2024-06-21 09:50:04,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=364219.1666666667, ans=0.125 2024-06-21 09:50:06,531 INFO [train.py:1028] (0/2) Epoch 20, batch 6450, loss[loss=0.2252, simple_loss=0.2784, pruned_loss=0.08602, over 12546.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.255, pruned_loss=0.07441, over 2579289.56 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:50:07,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364237.5, ans=0.125 2024-06-21 09:50:16,513 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.66 vs. limit=22.5 2024-06-21 09:50:22,906 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.41 vs. limit=15.0 2024-06-21 09:50:30,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.22 vs. limit=15.0 2024-06-21 09:50:39,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=364310.8333333333, ans=0.125 2024-06-21 09:50:41,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364310.8333333333, ans=0.1 2024-06-21 09:50:44,390 INFO [train.py:1028] (0/2) Epoch 20, batch 6500, loss[loss=0.2289, simple_loss=0.2671, pruned_loss=0.09536, over 10684.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2565, pruned_loss=0.0748, over 2582767.68 frames. ], batch size: 303, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:50:48,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=364329.1666666667, ans=0.125 2024-06-21 09:50:55,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=364347.5, ans=0.125 2024-06-21 09:51:06,097 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.114e+02 2.223e+02 2.494e+02 3.269e+02, threshold=4.445e+02, percent-clipped=0.0 2024-06-21 09:51:07,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=364384.1666666667, ans=0.2 2024-06-21 09:51:15,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=364402.5, ans=0.125 2024-06-21 09:51:20,549 INFO [train.py:1028] (0/2) Epoch 20, batch 6550, loss[loss=0.1682, simple_loss=0.2296, pruned_loss=0.05345, over 12546.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2576, pruned_loss=0.0749, over 2586386.13 frames. ], batch size: 22, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:51:41,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=364475.8333333333, ans=0.125 2024-06-21 09:51:46,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=364494.1666666667, ans=0.2 2024-06-21 09:51:49,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.53 vs. limit=6.0 2024-06-21 09:51:52,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364512.5, ans=0.1 2024-06-21 09:51:53,276 INFO [train.py:1028] (0/2) Epoch 20, batch 6600, loss[loss=0.1819, simple_loss=0.2447, pruned_loss=0.05956, over 13240.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.258, pruned_loss=0.07513, over 2588845.82 frames. ], batch size: 72, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:51:54,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=364512.5, ans=0.125 2024-06-21 09:52:07,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=364549.1666666667, ans=0.125 2024-06-21 09:52:12,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.095e+02 2.268e+02 2.540e+02 3.528e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 09:52:14,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=364567.5, ans=0.125 2024-06-21 09:52:24,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=364585.8333333333, ans=0.09899494936611666 2024-06-21 09:52:26,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364585.8333333333, ans=0.1 2024-06-21 09:52:29,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=12.0 2024-06-21 09:52:31,184 INFO [train.py:1028] (0/2) Epoch 20, batch 6650, loss[loss=0.2225, simple_loss=0.2701, pruned_loss=0.08747, over 12895.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2595, pruned_loss=0.07577, over 2582714.28 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:52:44,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=364640.8333333333, ans=0.04949747468305833 2024-06-21 09:52:57,674 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2024-06-21 09:53:07,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=364695.8333333333, ans=0.125 2024-06-21 09:53:08,139 INFO [train.py:1028] (0/2) Epoch 20, batch 6700, loss[loss=0.213, simple_loss=0.2657, pruned_loss=0.08017, over 12835.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.262, pruned_loss=0.07711, over 2582685.76 frames. ], batch size: 177, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:53:13,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=364695.8333333333, ans=0.0 2024-06-21 09:53:17,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=364714.1666666667, ans=0.2 2024-06-21 09:53:25,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=364732.5, ans=0.025 2024-06-21 09:53:26,942 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.099e+02 2.245e+02 2.533e+02 3.822e+02, threshold=4.490e+02, percent-clipped=0.0 2024-06-21 09:53:27,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=364732.5, ans=0.125 2024-06-21 09:53:41,360 INFO [train.py:1028] (0/2) Epoch 20, batch 6750, loss[loss=0.228, simple_loss=0.2712, pruned_loss=0.09245, over 12320.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2628, pruned_loss=0.07762, over 2577963.47 frames. ], batch size: 241, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:53:42,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=364787.5, ans=0.125 2024-06-21 09:53:55,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=364824.1666666667, ans=0.125 2024-06-21 09:53:58,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=364824.1666666667, ans=0.125 2024-06-21 09:54:12,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=364860.8333333333, ans=0.2 2024-06-21 09:54:14,129 INFO [train.py:1028] (0/2) Epoch 20, batch 6800, loss[loss=0.2076, simple_loss=0.2663, pruned_loss=0.07442, over 13305.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2641, pruned_loss=0.07775, over 2580321.76 frames. ], batch size: 67, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:54:28,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2024-06-21 09:54:32,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=364915.8333333333, ans=0.0 2024-06-21 09:54:32,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=364915.8333333333, ans=0.025 2024-06-21 09:54:35,714 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.103e+02 2.246e+02 2.436e+02 3.776e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 09:54:50,545 INFO [train.py:1028] (0/2) Epoch 20, batch 6850, loss[loss=0.2169, simple_loss=0.2826, pruned_loss=0.07557, over 13282.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2649, pruned_loss=0.078, over 2584532.03 frames. ], batch size: 63, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:54:59,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=364989.1666666667, ans=0.0 2024-06-21 09:55:10,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=365007.5, ans=0.0 2024-06-21 09:55:17,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365025.8333333333, ans=0.1 2024-06-21 09:55:18,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=365025.8333333333, ans=0.0 2024-06-21 09:55:26,608 INFO [train.py:1028] (0/2) Epoch 20, batch 6900, loss[loss=0.2283, simple_loss=0.2849, pruned_loss=0.08582, over 13319.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.266, pruned_loss=0.07854, over 2586385.90 frames. ], batch size: 49, lr: 2.89e-03, grad_scale: 32.0 2024-06-21 09:55:28,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=365062.5, ans=0.125 2024-06-21 09:55:29,282 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:55:32,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.55 vs. limit=15.0 2024-06-21 09:55:38,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365080.8333333333, ans=0.125 2024-06-21 09:55:39,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=365099.1666666667, ans=0.125 2024-06-21 09:55:44,608 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.109e+02 2.233e+02 2.426e+02 3.290e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 09:55:45,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=365117.5, ans=0.125 2024-06-21 09:55:54,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=365135.8333333333, ans=0.125 2024-06-21 09:55:56,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=365135.8333333333, ans=0.1 2024-06-21 09:55:58,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=365135.8333333333, ans=0.125 2024-06-21 09:55:59,291 INFO [train.py:1028] (0/2) Epoch 20, batch 6950, loss[loss=0.218, simple_loss=0.2692, pruned_loss=0.08344, over 11428.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.266, pruned_loss=0.0782, over 2581320.54 frames. ], batch size: 16, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:56:02,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=365154.1666666667, ans=0.2 2024-06-21 09:56:05,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2024-06-21 09:56:05,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=365172.5, ans=0.125 2024-06-21 09:56:07,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=365172.5, ans=0.125 2024-06-21 09:56:11,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=365190.8333333333, ans=0.0 2024-06-21 09:56:23,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=365209.1666666667, ans=0.0 2024-06-21 09:56:31,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=365227.5, ans=0.125 2024-06-21 09:56:35,537 INFO [train.py:1028] (0/2) Epoch 20, batch 7000, loss[loss=0.1994, simple_loss=0.2587, pruned_loss=0.07008, over 12980.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2657, pruned_loss=0.07779, over 2576495.64 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:56:38,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=365245.8333333333, ans=0.025 2024-06-21 09:56:38,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=365245.8333333333, ans=0.05 2024-06-21 09:56:38,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=365245.8333333333, ans=0.0 2024-06-21 09:56:39,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=365245.8333333333, ans=0.125 2024-06-21 09:56:52,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2024-06-21 09:56:54,404 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.802e+02 2.092e+02 2.205e+02 2.416e+02 3.281e+02, threshold=4.410e+02, percent-clipped=0.0 2024-06-21 09:57:14,379 INFO [train.py:1028] (0/2) Epoch 20, batch 7050, loss[loss=0.2311, simple_loss=0.2821, pruned_loss=0.09004, over 12739.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2667, pruned_loss=0.07815, over 2582862.87 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:57:18,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=365337.5, ans=0.125 2024-06-21 09:57:28,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=365374.1666666667, ans=0.125 2024-06-21 09:57:30,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=365374.1666666667, ans=0.0 2024-06-21 09:57:32,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=365374.1666666667, ans=0.0 2024-06-21 09:57:46,922 INFO [train.py:1028] (0/2) Epoch 20, batch 7100, loss[loss=0.2147, simple_loss=0.2673, pruned_loss=0.08108, over 13186.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2668, pruned_loss=0.07833, over 2575900.06 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:57:50,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=365429.1666666667, ans=0.125 2024-06-21 09:57:50,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=365429.1666666667, ans=0.125 2024-06-21 09:57:56,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365447.5, ans=0.125 2024-06-21 09:57:57,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=365447.5, ans=0.125 2024-06-21 09:58:04,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=365465.8333333333, ans=0.125 2024-06-21 09:58:05,317 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.127e+02 2.309e+02 2.474e+02 3.679e+02, threshold=4.619e+02, percent-clipped=0.0 2024-06-21 09:58:09,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=365484.1666666667, ans=0.0 2024-06-21 09:58:12,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.17 vs. limit=22.5 2024-06-21 09:58:19,848 INFO [train.py:1028] (0/2) Epoch 20, batch 7150, loss[loss=0.243, simple_loss=0.2867, pruned_loss=0.0996, over 12518.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2671, pruned_loss=0.07848, over 2573706.56 frames. ], batch size: 202, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:58:21,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=365520.8333333333, ans=0.2 2024-06-21 09:58:25,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=365520.8333333333, ans=0.1 2024-06-21 09:58:31,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=365539.1666666667, ans=0.125 2024-06-21 09:58:44,551 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:58:51,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=365594.1666666667, ans=0.125 2024-06-21 09:58:53,585 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 09:58:55,995 INFO [train.py:1028] (0/2) Epoch 20, batch 7200, loss[loss=0.241, simple_loss=0.2981, pruned_loss=0.09193, over 13207.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2684, pruned_loss=0.07867, over 2578521.92 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:58:58,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=365612.5, ans=0.125 2024-06-21 09:58:59,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=365612.5, ans=0.2 2024-06-21 09:59:06,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2024-06-21 09:59:17,430 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.134e+02 2.298e+02 2.582e+02 4.025e+02, threshold=4.597e+02, percent-clipped=0.0 2024-06-21 09:59:20,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=365667.5, ans=0.025 2024-06-21 09:59:25,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=365685.8333333333, ans=0.0 2024-06-21 09:59:28,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=365685.8333333333, ans=0.025 2024-06-21 09:59:29,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=365685.8333333333, ans=0.2 2024-06-21 09:59:32,236 INFO [train.py:1028] (0/2) Epoch 20, batch 7250, loss[loss=0.1867, simple_loss=0.2512, pruned_loss=0.06108, over 13001.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2691, pruned_loss=0.07867, over 2578605.77 frames. ], batch size: 36, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 09:59:38,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=365722.5, ans=0.125 2024-06-21 09:59:42,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365722.5, ans=0.1 2024-06-21 09:59:45,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2024-06-21 09:59:56,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=365759.1666666667, ans=0.125 2024-06-21 09:59:56,872 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-21 09:59:59,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=365777.5, ans=0.1 2024-06-21 10:00:03,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=365777.5, ans=0.0 2024-06-21 10:00:04,850 INFO [train.py:1028] (0/2) Epoch 20, batch 7300, loss[loss=0.2054, simple_loss=0.2642, pruned_loss=0.07329, over 12868.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2703, pruned_loss=0.07923, over 2577785.85 frames. ], batch size: 36, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:00:07,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=365795.8333333333, ans=0.125 2024-06-21 10:00:15,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=365814.1666666667, ans=10.0 2024-06-21 10:00:18,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=365832.5, ans=0.125 2024-06-21 10:00:23,785 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.125e+02 2.306e+02 2.527e+02 3.412e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 10:00:26,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=365850.8333333333, ans=0.125 2024-06-21 10:00:38,805 INFO [train.py:1028] (0/2) Epoch 20, batch 7350, loss[loss=0.241, simple_loss=0.3032, pruned_loss=0.08942, over 13282.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.271, pruned_loss=0.07957, over 2580882.12 frames. ], batch size: 46, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:00:48,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=365905.8333333333, ans=0.125 2024-06-21 10:00:52,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=365905.8333333333, ans=0.0 2024-06-21 10:01:09,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=365960.8333333333, ans=0.07 2024-06-21 10:01:14,506 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2024-06-21 10:01:16,001 INFO [train.py:1028] (0/2) Epoch 20, batch 7400, loss[loss=0.2035, simple_loss=0.2691, pruned_loss=0.06893, over 13273.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2706, pruned_loss=0.07929, over 2586658.16 frames. ], batch size: 63, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:01:28,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=365997.5, ans=0.125 2024-06-21 10:01:28,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=365997.5, ans=0.0 2024-06-21 10:01:38,866 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.173e+02 2.370e+02 2.613e+02 3.530e+02, threshold=4.740e+02, percent-clipped=0.0 2024-06-21 10:01:47,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=366052.5, ans=0.0 2024-06-21 10:01:53,630 INFO [train.py:1028] (0/2) Epoch 20, batch 7450, loss[loss=0.1982, simple_loss=0.2525, pruned_loss=0.07194, over 12592.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.27, pruned_loss=0.07891, over 2579767.68 frames. ], batch size: 29, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:01:58,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=15.0 2024-06-21 10:02:03,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=366089.1666666667, ans=0.2 2024-06-21 10:02:11,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=366107.5, ans=0.0 2024-06-21 10:02:17,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=366125.8333333333, ans=0.09899494936611666 2024-06-21 10:02:18,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.06 vs. limit=22.5 2024-06-21 10:02:19,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=366125.8333333333, ans=0.0 2024-06-21 10:02:23,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=366144.1666666667, ans=0.125 2024-06-21 10:02:27,367 INFO [train.py:1028] (0/2) Epoch 20, batch 7500, loss[loss=0.2366, simple_loss=0.2786, pruned_loss=0.09731, over 10696.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2711, pruned_loss=0.07952, over 2577217.97 frames. ], batch size: 304, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:02:27,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366162.5, ans=0.1 2024-06-21 10:02:30,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=366162.5, ans=0.125 2024-06-21 10:02:32,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=366162.5, ans=0.05 2024-06-21 10:02:49,099 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.107e+02 2.234e+02 2.363e+02 3.209e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 10:02:51,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=366217.5, ans=0.2 2024-06-21 10:02:52,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=366217.5, ans=0.025 2024-06-21 10:02:55,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366217.5, ans=0.1 2024-06-21 10:02:57,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366235.8333333333, ans=0.125 2024-06-21 10:03:03,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=15.0 2024-06-21 10:03:03,839 INFO [train.py:1028] (0/2) Epoch 20, batch 7550, loss[loss=0.2074, simple_loss=0.2568, pruned_loss=0.07894, over 12947.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.272, pruned_loss=0.08017, over 2577356.21 frames. ], batch size: 158, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:03:05,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=366254.1666666667, ans=0.125 2024-06-21 10:03:27,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=22.5 2024-06-21 10:03:32,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=366309.1666666667, ans=0.0 2024-06-21 10:03:32,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=366309.1666666667, ans=10.0 2024-06-21 10:03:38,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=366327.5, ans=0.05 2024-06-21 10:03:40,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366345.8333333333, ans=0.1 2024-06-21 10:03:40,619 INFO [train.py:1028] (0/2) Epoch 20, batch 7600, loss[loss=0.227, simple_loss=0.2758, pruned_loss=0.08917, over 13217.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2732, pruned_loss=0.08078, over 2577561.59 frames. ], batch size: 83, lr: 2.89e-03, grad_scale: 64.0 2024-06-21 10:03:47,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=366364.1666666667, ans=0.0 2024-06-21 10:03:52,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=366364.1666666667, ans=0.0 2024-06-21 10:03:52,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366364.1666666667, ans=0.125 2024-06-21 10:03:58,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=366382.5, ans=15.0 2024-06-21 10:03:59,186 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.211e+02 2.418e+02 2.659e+02 3.492e+02, threshold=4.837e+02, percent-clipped=0.0 2024-06-21 10:04:04,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.42 vs. limit=15.0 2024-06-21 10:04:13,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.73 vs. limit=15.0 2024-06-21 10:04:13,908 INFO [train.py:1028] (0/2) Epoch 20, batch 7650, loss[loss=0.2147, simple_loss=0.2655, pruned_loss=0.08195, over 12976.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2732, pruned_loss=0.08083, over 2573467.59 frames. ], batch size: 33, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:04:27,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2024-06-21 10:04:34,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=366492.5, ans=0.125 2024-06-21 10:04:41,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2024-06-21 10:04:43,500 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=366492.5, ans=0.125 2024-06-21 10:04:43,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=366492.5, ans=0.04949747468305833 2024-06-21 10:04:48,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=366510.8333333333, ans=0.125 2024-06-21 10:04:51,322 INFO [train.py:1028] (0/2) Epoch 20, batch 7700, loss[loss=0.2111, simple_loss=0.278, pruned_loss=0.07212, over 13250.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2738, pruned_loss=0.08102, over 2570079.23 frames. ], batch size: 63, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:04:52,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=366529.1666666667, ans=0.0 2024-06-21 10:05:00,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=366547.5, ans=0.2 2024-06-21 10:05:07,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=366565.8333333333, ans=0.125 2024-06-21 10:05:09,149 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.174e+02 2.359e+02 2.625e+02 3.754e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 10:05:18,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=366584.1666666667, ans=0.125 2024-06-21 10:05:26,866 INFO [train.py:1028] (0/2) Epoch 20, batch 7750, loss[loss=0.1992, simple_loss=0.2613, pruned_loss=0.0685, over 13255.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2742, pruned_loss=0.08135, over 2574214.96 frames. ], batch size: 72, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:05:35,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366639.1666666667, ans=0.1 2024-06-21 10:05:35,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=366639.1666666667, ans=0.125 2024-06-21 10:05:37,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366639.1666666667, ans=0.125 2024-06-21 10:05:43,228 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-200000.pt 2024-06-21 10:05:52,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=366675.8333333333, ans=0.0 2024-06-21 10:05:58,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=366694.1666666667, ans=0.125 2024-06-21 10:06:05,669 INFO [train.py:1028] (0/2) Epoch 20, batch 7800, loss[loss=0.2122, simple_loss=0.2701, pruned_loss=0.07713, over 13183.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2747, pruned_loss=0.08131, over 2578873.61 frames. ], batch size: 95, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:06:07,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.20 vs. limit=15.0 2024-06-21 10:06:09,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=366712.5, ans=0.125 2024-06-21 10:06:09,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=366712.5, ans=0.0 2024-06-21 10:06:17,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=366730.8333333333, ans=0.025 2024-06-21 10:06:17,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=366730.8333333333, ans=0.125 2024-06-21 10:06:24,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.170e+02 2.348e+02 2.597e+02 3.485e+02, threshold=4.696e+02, percent-clipped=0.0 2024-06-21 10:06:36,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366785.8333333333, ans=0.1 2024-06-21 10:06:42,970 INFO [train.py:1028] (0/2) Epoch 20, batch 7850, loss[loss=0.2091, simple_loss=0.2772, pruned_loss=0.07049, over 11140.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2757, pruned_loss=0.08184, over 2571833.77 frames. ], batch size: 16, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:06:49,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=366822.5, ans=0.0 2024-06-21 10:06:53,449 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2024-06-21 10:07:09,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=366877.5, ans=0.0 2024-06-21 10:07:19,212 INFO [train.py:1028] (0/2) Epoch 20, batch 7900, loss[loss=0.2138, simple_loss=0.273, pruned_loss=0.07733, over 13172.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2762, pruned_loss=0.08238, over 2572207.51 frames. ], batch size: 77, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:07:19,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=366895.8333333333, ans=0.125 2024-06-21 10:07:23,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2024-06-21 10:07:28,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366914.1666666667, ans=0.1 2024-06-21 10:07:38,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.166e+02 2.375e+02 2.582e+02 3.560e+02, threshold=4.751e+02, percent-clipped=0.0 2024-06-21 10:07:44,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2024-06-21 10:07:47,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.14 vs. limit=6.0 2024-06-21 10:07:48,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366969.1666666667, ans=0.125 2024-06-21 10:07:52,814 INFO [train.py:1028] (0/2) Epoch 20, batch 7950, loss[loss=0.221, simple_loss=0.274, pruned_loss=0.08402, over 10636.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2762, pruned_loss=0.08208, over 2575484.72 frames. ], batch size: 304, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:07:54,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=366987.5, ans=0.125 2024-06-21 10:07:57,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=366987.5, ans=0.025 2024-06-21 10:07:57,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=366987.5, ans=0.05 2024-06-21 10:08:02,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=367005.8333333333, ans=0.125 2024-06-21 10:08:07,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=367024.1666666667, ans=0.025 2024-06-21 10:08:26,369 INFO [train.py:1028] (0/2) Epoch 20, batch 8000, loss[loss=0.2059, simple_loss=0.2701, pruned_loss=0.0708, over 12536.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2771, pruned_loss=0.08247, over 2572666.33 frames. ], batch size: 29, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:08:27,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2024-06-21 10:08:39,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2024-06-21 10:08:44,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=367115.8333333333, ans=0.0 2024-06-21 10:08:48,012 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.749e+02 2.192e+02 2.342e+02 2.556e+02 3.161e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 10:09:00,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=367152.5, ans=0.125 2024-06-21 10:09:02,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=367170.8333333333, ans=0.125 2024-06-21 10:09:02,877 INFO [train.py:1028] (0/2) Epoch 20, batch 8050, loss[loss=0.2274, simple_loss=0.2875, pruned_loss=0.08363, over 13225.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2771, pruned_loss=0.08231, over 2571870.20 frames. ], batch size: 83, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:09:04,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=367170.8333333333, ans=0.0 2024-06-21 10:09:05,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367170.8333333333, ans=0.125 2024-06-21 10:09:06,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367170.8333333333, ans=0.125 2024-06-21 10:09:07,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=367170.8333333333, ans=0.0 2024-06-21 10:09:12,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=367189.1666666667, ans=0.125 2024-06-21 10:09:28,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=367225.8333333333, ans=0.0 2024-06-21 10:09:29,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=367225.8333333333, ans=0.125 2024-06-21 10:09:32,119 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:09:35,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2024-06-21 10:09:38,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=367244.1666666667, ans=0.0 2024-06-21 10:09:40,876 INFO [train.py:1028] (0/2) Epoch 20, batch 8100, loss[loss=0.2303, simple_loss=0.2866, pruned_loss=0.08699, over 13128.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2776, pruned_loss=0.08256, over 2575949.54 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:09:45,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=367262.5, ans=0.0 2024-06-21 10:09:49,174 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=12.0 2024-06-21 10:09:59,139 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.165e+02 2.309e+02 2.498e+02 3.259e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-21 10:10:03,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=367317.5, ans=0.0 2024-06-21 10:10:13,613 INFO [train.py:1028] (0/2) Epoch 20, batch 8150, loss[loss=0.225, simple_loss=0.2769, pruned_loss=0.0865, over 13083.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2778, pruned_loss=0.08226, over 2578559.46 frames. ], batch size: 121, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:10:14,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.88 vs. limit=10.0 2024-06-21 10:10:20,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=367372.5, ans=0.0 2024-06-21 10:10:21,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367372.5, ans=0.0 2024-06-21 10:10:36,544 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.56 vs. limit=12.0 2024-06-21 10:10:42,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=367427.5, ans=0.0 2024-06-21 10:10:49,094 INFO [train.py:1028] (0/2) Epoch 20, batch 8200, loss[loss=0.2375, simple_loss=0.2931, pruned_loss=0.09093, over 13130.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.278, pruned_loss=0.08231, over 2582012.35 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:10:50,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2024-06-21 10:10:57,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367464.1666666667, ans=0.0 2024-06-21 10:11:06,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=367482.5, ans=0.0 2024-06-21 10:11:07,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=367482.5, ans=0.0 2024-06-21 10:11:07,730 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.150e+02 2.286e+02 2.575e+02 3.362e+02, threshold=4.572e+02, percent-clipped=0.0 2024-06-21 10:11:09,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=367500.8333333333, ans=0.05 2024-06-21 10:11:13,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=367500.8333333333, ans=0.2 2024-06-21 10:11:25,415 INFO [train.py:1028] (0/2) Epoch 20, batch 8250, loss[loss=0.2146, simple_loss=0.2777, pruned_loss=0.07579, over 13224.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2779, pruned_loss=0.08181, over 2582763.14 frames. ], batch size: 52, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:11:25,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=367537.5, ans=0.125 2024-06-21 10:11:29,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=367537.5, ans=0.2 2024-06-21 10:11:53,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=367610.8333333333, ans=0.025 2024-06-21 10:12:01,610 INFO [train.py:1028] (0/2) Epoch 20, batch 8300, loss[loss=0.2331, simple_loss=0.2888, pruned_loss=0.0887, over 12960.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2773, pruned_loss=0.08146, over 2580077.86 frames. ], batch size: 102, lr: 2.88e-03, grad_scale: 64.0 2024-06-21 10:12:01,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=367629.1666666667, ans=0.125 2024-06-21 10:12:23,883 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.208e+02 2.306e+02 2.518e+02 3.260e+02, threshold=4.612e+02, percent-clipped=0.0 2024-06-21 10:12:27,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=367684.1666666667, ans=0.0 2024-06-21 10:12:33,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=367702.5, ans=0.025 2024-06-21 10:12:33,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=367702.5, ans=15.0 2024-06-21 10:12:36,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2024-06-21 10:12:37,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=367702.5, ans=0.1 2024-06-21 10:12:40,478 INFO [train.py:1028] (0/2) Epoch 20, batch 8350, loss[loss=0.224, simple_loss=0.2763, pruned_loss=0.08587, over 13196.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2764, pruned_loss=0.08078, over 2580011.35 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:13:15,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=367794.1666666667, ans=0.0 2024-06-21 10:13:18,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=367794.1666666667, ans=0.125 2024-06-21 10:13:18,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.45 vs. limit=15.0 2024-06-21 10:13:21,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=367794.1666666667, ans=0.025 2024-06-21 10:13:23,269 INFO [train.py:1028] (0/2) Epoch 20, batch 8400, loss[loss=0.1877, simple_loss=0.248, pruned_loss=0.0637, over 12853.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2762, pruned_loss=0.08074, over 2577030.49 frames. ], batch size: 39, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:13:28,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=367812.5, ans=0.2 2024-06-21 10:13:38,147 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.878e-02 2024-06-21 10:13:48,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=367849.1666666667, ans=0.0 2024-06-21 10:13:50,106 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.216e+02 2.349e+02 2.557e+02 3.174e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 10:13:52,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=367867.5, ans=0.125 2024-06-21 10:13:55,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=367867.5, ans=0.0 2024-06-21 10:14:00,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2024-06-21 10:14:05,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=367904.1666666667, ans=0.2 2024-06-21 10:14:05,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=367904.1666666667, ans=0.0 2024-06-21 10:14:06,549 INFO [train.py:1028] (0/2) Epoch 20, batch 8450, loss[loss=0.2297, simple_loss=0.2879, pruned_loss=0.08577, over 13170.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2774, pruned_loss=0.08128, over 2578708.14 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:14:20,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=367922.5, ans=0.0 2024-06-21 10:14:21,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=367922.5, ans=0.125 2024-06-21 10:14:45,911 INFO [train.py:1028] (0/2) Epoch 20, batch 8500, loss[loss=0.2021, simple_loss=0.2582, pruned_loss=0.07296, over 12541.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2784, pruned_loss=0.08164, over 2576119.72 frames. ], batch size: 29, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:14:55,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.91 vs. limit=15.0 2024-06-21 10:15:08,561 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.229e+02 2.358e+02 2.557e+02 3.464e+02, threshold=4.717e+02, percent-clipped=0.0 2024-06-21 10:15:15,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=15.0 2024-06-21 10:15:18,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.47 vs. limit=22.5 2024-06-21 10:15:28,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=368087.5, ans=0.125 2024-06-21 10:15:29,376 INFO [train.py:1028] (0/2) Epoch 20, batch 8550, loss[loss=0.2116, simple_loss=0.273, pruned_loss=0.07505, over 12569.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2782, pruned_loss=0.0815, over 2574972.52 frames. ], batch size: 22, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:15:30,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=368087.5, ans=0.025 2024-06-21 10:15:36,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=368105.8333333333, ans=0.0 2024-06-21 10:15:38,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2024-06-21 10:15:48,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368124.1666666667, ans=0.125 2024-06-21 10:16:05,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2024-06-21 10:16:09,543 INFO [train.py:1028] (0/2) Epoch 20, batch 8600, loss[loss=0.2056, simple_loss=0.2647, pruned_loss=0.07321, over 13125.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2789, pruned_loss=0.08168, over 2573723.71 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:16:21,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=368197.5, ans=0.125 2024-06-21 10:16:21,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=368197.5, ans=0.0 2024-06-21 10:16:37,753 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.206e+02 2.450e+02 2.702e+02 3.750e+02, threshold=4.899e+02, percent-clipped=0.0 2024-06-21 10:16:39,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=368234.1666666667, ans=0.125 2024-06-21 10:16:41,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=368234.1666666667, ans=0.125 2024-06-21 10:16:43,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.70 vs. limit=10.0 2024-06-21 10:16:45,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=368234.1666666667, ans=0.5 2024-06-21 10:16:47,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368252.5, ans=0.1 2024-06-21 10:16:55,028 INFO [train.py:1028] (0/2) Epoch 20, batch 8650, loss[loss=0.2271, simple_loss=0.2803, pruned_loss=0.08695, over 13047.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2792, pruned_loss=0.08182, over 2576582.66 frames. ], batch size: 102, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:17:05,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368289.1666666667, ans=0.1 2024-06-21 10:17:18,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368325.8333333333, ans=0.1 2024-06-21 10:17:26,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2024-06-21 10:17:35,224 INFO [train.py:1028] (0/2) Epoch 20, batch 8700, loss[loss=0.203, simple_loss=0.2712, pruned_loss=0.06733, over 13206.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2789, pruned_loss=0.08184, over 2572836.47 frames. ], batch size: 59, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:17:52,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=368380.8333333333, ans=0.125 2024-06-21 10:17:52,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368380.8333333333, ans=0.125 2024-06-21 10:18:02,652 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.123e+02 2.265e+02 2.534e+02 3.402e+02, threshold=4.530e+02, percent-clipped=0.0 2024-06-21 10:18:03,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368417.5, ans=0.1 2024-06-21 10:18:08,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=368417.5, ans=0.125 2024-06-21 10:18:19,910 INFO [train.py:1028] (0/2) Epoch 20, batch 8750, loss[loss=0.2396, simple_loss=0.2862, pruned_loss=0.09647, over 13075.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2785, pruned_loss=0.08182, over 2569447.79 frames. ], batch size: 121, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:18:24,034 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:18:40,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368490.8333333333, ans=0.125 2024-06-21 10:18:54,230 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:18:55,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368527.5, ans=0.125 2024-06-21 10:18:58,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.35 vs. limit=15.0 2024-06-21 10:19:03,792 INFO [train.py:1028] (0/2) Epoch 20, batch 8800, loss[loss=0.2153, simple_loss=0.2897, pruned_loss=0.07044, over 13246.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2785, pruned_loss=0.08202, over 2574331.26 frames. ], batch size: 72, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:19:04,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2024-06-21 10:19:06,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=368545.8333333333, ans=0.0 2024-06-21 10:19:07,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.71 vs. limit=22.5 2024-06-21 10:19:27,039 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.176e+02 2.338e+02 2.533e+02 3.840e+02, threshold=4.677e+02, percent-clipped=0.0 2024-06-21 10:19:30,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=368600.8333333333, ans=0.025 2024-06-21 10:19:37,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=368619.1666666667, ans=0.0 2024-06-21 10:19:44,421 INFO [train.py:1028] (0/2) Epoch 20, batch 8850, loss[loss=0.2544, simple_loss=0.3075, pruned_loss=0.1006, over 12471.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2794, pruned_loss=0.08266, over 2563809.32 frames. ], batch size: 202, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:19:44,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=368637.5, ans=0.0 2024-06-21 10:19:48,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.51 vs. limit=6.0 2024-06-21 10:20:05,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=368674.1666666667, ans=0.0 2024-06-21 10:20:06,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=368674.1666666667, ans=0.125 2024-06-21 10:20:15,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=368692.5, ans=0.0 2024-06-21 10:20:16,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=368692.5, ans=0.125 2024-06-21 10:20:19,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=15.0 2024-06-21 10:20:21,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368710.8333333333, ans=0.125 2024-06-21 10:20:23,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2024-06-21 10:20:27,118 INFO [train.py:1028] (0/2) Epoch 20, batch 8900, loss[loss=0.2203, simple_loss=0.277, pruned_loss=0.08175, over 12971.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2804, pruned_loss=0.08314, over 2562002.59 frames. ], batch size: 33, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:20:45,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=368765.8333333333, ans=0.125 2024-06-21 10:20:49,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=368765.8333333333, ans=0.125 2024-06-21 10:20:49,709 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.197e+02 2.369e+02 2.565e+02 3.635e+02, threshold=4.737e+02, percent-clipped=0.0 2024-06-21 10:21:08,749 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:21:10,912 INFO [train.py:1028] (0/2) Epoch 20, batch 8950, loss[loss=0.2396, simple_loss=0.2908, pruned_loss=0.0942, over 12521.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2805, pruned_loss=0.08301, over 2562825.31 frames. ], batch size: 202, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:21:41,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2024-06-21 10:21:51,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=368894.1666666667, ans=0.07 2024-06-21 10:21:52,471 INFO [train.py:1028] (0/2) Epoch 20, batch 9000, loss[loss=0.2102, simple_loss=0.2699, pruned_loss=0.07525, over 13310.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2805, pruned_loss=0.08259, over 2569570.64 frames. ], batch size: 46, lr: 2.88e-03, grad_scale: 32.0 2024-06-21 10:21:52,472 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 10:22:01,281 INFO [train.py:1060] (0/2) Epoch 20, validation: loss=0.1881, simple_loss=0.2521, pruned_loss=0.06207, over 351949.00 frames. 2024-06-21 10:22:01,282 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 10:22:02,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368912.5, ans=0.125 2024-06-21 10:22:02,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=368912.5, ans=0.125 2024-06-21 10:22:12,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368930.8333333333, ans=0.1 2024-06-21 10:22:23,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=368949.1666666667, ans=0.5 2024-06-21 10:22:24,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=368949.1666666667, ans=0.07 2024-06-21 10:22:26,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=368949.1666666667, ans=0.5 2024-06-21 10:22:28,268 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.175e+02 2.301e+02 2.539e+02 3.214e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 10:22:28,459 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:22:31,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368967.5, ans=0.125 2024-06-21 10:22:44,387 INFO [train.py:1028] (0/2) Epoch 20, batch 9050, loss[loss=0.216, simple_loss=0.2762, pruned_loss=0.07794, over 10715.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2817, pruned_loss=0.08326, over 2568334.11 frames. ], batch size: 16, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:22:47,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=369004.1666666667, ans=0.0 2024-06-21 10:23:00,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2024-06-21 10:23:09,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.27 vs. limit=22.5 2024-06-21 10:23:22,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2024-06-21 10:23:23,780 INFO [train.py:1028] (0/2) Epoch 20, batch 9100, loss[loss=0.2171, simple_loss=0.2841, pruned_loss=0.0751, over 13253.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2814, pruned_loss=0.0832, over 2568541.83 frames. ], batch size: 72, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:23:46,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.173e+02 2.328e+02 2.501e+02 3.981e+02, threshold=4.655e+02, percent-clipped=0.0 2024-06-21 10:23:52,078 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:24:02,421 INFO [train.py:1028] (0/2) Epoch 20, batch 9150, loss[loss=0.1983, simple_loss=0.2667, pruned_loss=0.06494, over 13182.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2811, pruned_loss=0.08311, over 2569565.30 frames. ], batch size: 77, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:24:04,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=369187.5, ans=0.125 2024-06-21 10:24:08,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=369187.5, ans=0.125 2024-06-21 10:24:08,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=369187.5, ans=0.0 2024-06-21 10:24:39,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369260.8333333333, ans=0.125 2024-06-21 10:24:43,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369260.8333333333, ans=0.125 2024-06-21 10:24:45,067 INFO [train.py:1028] (0/2) Epoch 20, batch 9200, loss[loss=0.2121, simple_loss=0.2758, pruned_loss=0.07424, over 12918.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2811, pruned_loss=0.08263, over 2572454.97 frames. ], batch size: 36, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:24:48,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=369279.1666666667, ans=0.125 2024-06-21 10:24:51,281 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:24:54,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=12.0 2024-06-21 10:24:54,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=12.0 2024-06-21 10:24:55,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.19 vs. limit=15.0 2024-06-21 10:24:55,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.81 vs. limit=22.5 2024-06-21 10:24:59,028 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2024-06-21 10:25:06,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=369315.8333333333, ans=0.125 2024-06-21 10:25:07,136 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.169e+02 2.306e+02 2.454e+02 3.336e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 10:25:20,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.35 vs. limit=6.0 2024-06-21 10:25:22,809 INFO [train.py:1028] (0/2) Epoch 20, batch 9250, loss[loss=0.219, simple_loss=0.2752, pruned_loss=0.08144, over 13255.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2815, pruned_loss=0.08259, over 2573480.59 frames. ], batch size: 67, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:25:27,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=369370.8333333333, ans=0.2 2024-06-21 10:25:33,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=369389.1666666667, ans=0.0 2024-06-21 10:25:38,965 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:25:39,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=369407.5, ans=0.0 2024-06-21 10:25:39,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.23 vs. limit=15.0 2024-06-21 10:25:48,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=369425.8333333333, ans=0.125 2024-06-21 10:25:49,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=369425.8333333333, ans=0.0 2024-06-21 10:25:55,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=369444.1666666667, ans=0.0 2024-06-21 10:26:00,960 INFO [train.py:1028] (0/2) Epoch 20, batch 9300, loss[loss=0.1798, simple_loss=0.2376, pruned_loss=0.061, over 12959.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2809, pruned_loss=0.08215, over 2569451.53 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:26:01,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.75 vs. limit=15.0 2024-06-21 10:26:02,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.11 vs. limit=22.5 2024-06-21 10:26:09,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=369480.8333333333, ans=0.125 2024-06-21 10:26:11,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=369480.8333333333, ans=0.125 2024-06-21 10:26:12,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2024-06-21 10:26:22,978 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.157e+02 2.326e+02 2.504e+02 4.032e+02, threshold=4.651e+02, percent-clipped=0.0 2024-06-21 10:26:26,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=22.5 2024-06-21 10:26:26,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2024-06-21 10:26:32,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=369535.8333333333, ans=0.07 2024-06-21 10:26:34,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=369535.8333333333, ans=0.125 2024-06-21 10:26:35,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.94 vs. limit=10.0 2024-06-21 10:26:38,418 INFO [train.py:1028] (0/2) Epoch 20, batch 9350, loss[loss=0.2518, simple_loss=0.3051, pruned_loss=0.09924, over 12607.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.281, pruned_loss=0.08227, over 2566830.99 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:26:45,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2024-06-21 10:26:55,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2024-06-21 10:27:12,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=369627.5, ans=0.125 2024-06-21 10:27:16,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369627.5, ans=0.1 2024-06-21 10:27:18,739 INFO [train.py:1028] (0/2) Epoch 20, batch 9400, loss[loss=0.2248, simple_loss=0.2809, pruned_loss=0.08431, over 13278.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2818, pruned_loss=0.0829, over 2567254.40 frames. ], batch size: 52, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:27:23,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369645.8333333333, ans=0.125 2024-06-21 10:27:23,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=369645.8333333333, ans=0.025 2024-06-21 10:27:27,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=369664.1666666667, ans=0.125 2024-06-21 10:27:35,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.94 vs. limit=22.5 2024-06-21 10:27:35,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=369682.5, ans=0.125 2024-06-21 10:27:40,364 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.216e+02 2.364e+02 2.531e+02 3.150e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 10:27:54,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=369719.1666666667, ans=0.0 2024-06-21 10:27:55,936 INFO [train.py:1028] (0/2) Epoch 20, batch 9450, loss[loss=0.2084, simple_loss=0.271, pruned_loss=0.07292, over 12642.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2821, pruned_loss=0.08301, over 2568030.23 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:27:56,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=369737.5, ans=0.2 2024-06-21 10:28:00,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.48 vs. limit=12.0 2024-06-21 10:28:13,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369774.1666666667, ans=0.1 2024-06-21 10:28:13,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=369774.1666666667, ans=0.04949747468305833 2024-06-21 10:28:25,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369810.8333333333, ans=0.125 2024-06-21 10:28:29,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2024-06-21 10:28:32,187 INFO [train.py:1028] (0/2) Epoch 20, batch 9500, loss[loss=0.2229, simple_loss=0.2837, pruned_loss=0.0811, over 13287.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.282, pruned_loss=0.0829, over 2576849.48 frames. ], batch size: 43, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:28:41,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=369847.5, ans=0.0 2024-06-21 10:28:53,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=369865.8333333333, ans=0.0 2024-06-21 10:28:55,588 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.209e+02 2.389e+02 2.587e+02 3.204e+02, threshold=4.778e+02, percent-clipped=0.0 2024-06-21 10:28:57,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=369884.1666666667, ans=0.125 2024-06-21 10:28:58,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=369884.1666666667, ans=0.04949747468305833 2024-06-21 10:29:01,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=369884.1666666667, ans=6.0 2024-06-21 10:29:04,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=369902.5, ans=0.0 2024-06-21 10:29:04,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2024-06-21 10:29:10,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=369920.8333333333, ans=0.04949747468305833 2024-06-21 10:29:10,813 INFO [train.py:1028] (0/2) Epoch 20, batch 9550, loss[loss=0.1798, simple_loss=0.2474, pruned_loss=0.05611, over 12932.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2815, pruned_loss=0.08273, over 2572457.44 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:29:11,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=369920.8333333333, ans=0.07 2024-06-21 10:29:15,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=369920.8333333333, ans=0.2 2024-06-21 10:29:21,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=22.5 2024-06-21 10:29:43,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.06 vs. limit=15.0 2024-06-21 10:29:47,832 INFO [train.py:1028] (0/2) Epoch 20, batch 9600, loss[loss=0.2332, simple_loss=0.2811, pruned_loss=0.09267, over 10514.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2814, pruned_loss=0.08239, over 2570786.42 frames. ], batch size: 303, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:29:48,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370012.5, ans=0.1 2024-06-21 10:29:48,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=370012.5, ans=0.125 2024-06-21 10:29:58,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=370030.8333333333, ans=0.125 2024-06-21 10:30:08,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=370049.1666666667, ans=0.125 2024-06-21 10:30:11,730 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.269e+02 2.410e+02 2.615e+02 3.633e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-21 10:30:13,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=370067.5, ans=0.0 2024-06-21 10:30:18,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=370067.5, ans=0.2 2024-06-21 10:30:22,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.78 vs. limit=15.0 2024-06-21 10:30:25,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.52 vs. limit=15.0 2024-06-21 10:30:27,364 INFO [train.py:1028] (0/2) Epoch 20, batch 9650, loss[loss=0.2051, simple_loss=0.2604, pruned_loss=0.07495, over 13114.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2813, pruned_loss=0.08299, over 2562017.10 frames. ], batch size: 132, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:30:28,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=370104.1666666667, ans=0.1 2024-06-21 10:30:47,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=370140.8333333333, ans=0.0 2024-06-21 10:30:52,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=370159.1666666667, ans=0.0 2024-06-21 10:30:55,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=370159.1666666667, ans=0.2 2024-06-21 10:30:59,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=370177.5, ans=0.125 2024-06-21 10:31:04,564 INFO [train.py:1028] (0/2) Epoch 20, batch 9700, loss[loss=0.2193, simple_loss=0.2671, pruned_loss=0.08581, over 13052.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2804, pruned_loss=0.08269, over 2558064.51 frames. ], batch size: 144, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:31:06,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370195.8333333333, ans=0.1 2024-06-21 10:31:11,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=370214.1666666667, ans=0.2 2024-06-21 10:31:14,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=370214.1666666667, ans=0.0 2024-06-21 10:31:23,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=370232.5, ans=0.1 2024-06-21 10:31:24,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=370232.5, ans=0.0 2024-06-21 10:31:27,540 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.204e+02 2.336e+02 2.569e+02 4.894e+02, threshold=4.672e+02, percent-clipped=1.0 2024-06-21 10:31:27,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=370250.8333333333, ans=0.0 2024-06-21 10:31:31,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.46 vs. limit=15.0 2024-06-21 10:31:42,622 INFO [train.py:1028] (0/2) Epoch 20, batch 9750, loss[loss=0.2094, simple_loss=0.2642, pruned_loss=0.07731, over 13072.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.279, pruned_loss=0.08199, over 2554085.16 frames. ], batch size: 132, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:31:50,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=370305.8333333333, ans=0.2 2024-06-21 10:31:55,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=370305.8333333333, ans=0.125 2024-06-21 10:31:58,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=15.0 2024-06-21 10:31:59,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=370324.1666666667, ans=0.125 2024-06-21 10:32:03,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=370324.1666666667, ans=0.125 2024-06-21 10:32:04,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=370342.5, ans=0.2 2024-06-21 10:32:19,892 INFO [train.py:1028] (0/2) Epoch 20, batch 9800, loss[loss=0.2239, simple_loss=0.2845, pruned_loss=0.08165, over 12968.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2781, pruned_loss=0.08129, over 2546834.64 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:32:24,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=370379.1666666667, ans=0.125 2024-06-21 10:32:33,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=370397.5, ans=0.2 2024-06-21 10:32:42,458 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.137e+02 2.320e+02 2.495e+02 3.768e+02, threshold=4.640e+02, percent-clipped=0.0 2024-06-21 10:32:53,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=370452.5, ans=0.0 2024-06-21 10:32:54,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=370452.5, ans=0.125 2024-06-21 10:32:58,011 INFO [train.py:1028] (0/2) Epoch 20, batch 9850, loss[loss=0.2266, simple_loss=0.2764, pruned_loss=0.08843, over 12949.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2779, pruned_loss=0.08082, over 2539110.24 frames. ], batch size: 102, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:33:07,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2024-06-21 10:33:09,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=370489.1666666667, ans=0.125 2024-06-21 10:33:19,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2024-06-21 10:33:36,064 INFO [train.py:1028] (0/2) Epoch 20, batch 9900, loss[loss=0.2153, simple_loss=0.2787, pruned_loss=0.07596, over 12945.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2775, pruned_loss=0.08121, over 2532486.02 frames. ], batch size: 39, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:33:48,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=370580.8333333333, ans=0.2 2024-06-21 10:33:49,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=370580.8333333333, ans=0.125 2024-06-21 10:33:52,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=370599.1666666667, ans=0.125 2024-06-21 10:33:57,085 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.195e+02 2.313e+02 2.531e+02 3.404e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-21 10:33:57,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=370617.5, ans=0.125 2024-06-21 10:34:03,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=370617.5, ans=0.0 2024-06-21 10:34:03,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2024-06-21 10:34:14,583 INFO [train.py:1028] (0/2) Epoch 20, batch 9950, loss[loss=0.2634, simple_loss=0.3296, pruned_loss=0.09856, over 12762.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2772, pruned_loss=0.08147, over 2528066.05 frames. ], batch size: 29, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:34:16,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=370654.1666666667, ans=0.0 2024-06-21 10:34:17,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=370654.1666666667, ans=0.1 2024-06-21 10:34:21,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=370672.5, ans=0.05 2024-06-21 10:34:21,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=370672.5, ans=0.125 2024-06-21 10:34:34,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=370690.8333333333, ans=0.025 2024-06-21 10:34:34,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=370690.8333333333, ans=0.125 2024-06-21 10:34:51,741 INFO [train.py:1028] (0/2) Epoch 20, batch 10000, loss[loss=0.2445, simple_loss=0.3049, pruned_loss=0.09204, over 12829.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2778, pruned_loss=0.08206, over 2490805.75 frames. ], batch size: 23, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:34:53,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=370745.8333333333, ans=0.125 2024-06-21 10:34:55,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=370745.8333333333, ans=0.2 2024-06-21 10:34:55,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=370745.8333333333, ans=0.125 2024-06-21 10:35:01,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=370764.1666666667, ans=0.125 2024-06-21 10:35:12,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=15.0 2024-06-21 10:35:14,334 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.167e+02 2.318e+02 2.481e+02 3.372e+02, threshold=4.637e+02, percent-clipped=0.0 2024-06-21 10:35:29,750 INFO [train.py:1028] (0/2) Epoch 20, batch 10050, loss[loss=0.2348, simple_loss=0.2903, pruned_loss=0.08962, over 12662.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2779, pruned_loss=0.08297, over 2447937.85 frames. ], batch size: 22, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:35:43,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=370874.1666666667, ans=0.025 2024-06-21 10:35:44,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2024-06-21 10:35:56,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=370892.5, ans=0.125 2024-06-21 10:36:05,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2024-06-21 10:36:06,634 INFO [train.py:1028] (0/2) Epoch 20, batch 10100, loss[loss=0.2298, simple_loss=0.2834, pruned_loss=0.08813, over 12013.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2775, pruned_loss=0.08239, over 2428124.87 frames. ], batch size: 18, lr: 2.87e-03, grad_scale: 32.0 2024-06-21 10:36:22,509 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-20.pt 2024-06-21 10:38:39,302 INFO [train.py:1028] (0/2) Epoch 21, batch 0, loss[loss=0.1952, simple_loss=0.2516, pruned_loss=0.0694, over 12938.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2516, pruned_loss=0.0694, over 12938.00 frames. ], batch size: 36, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:38:39,303 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 10:38:47,540 INFO [train.py:1060] (0/2) Epoch 21, validation: loss=0.1888, simple_loss=0.2532, pruned_loss=0.06218, over 351949.00 frames. 2024-06-21 10:38:47,541 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 10:39:00,494 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.102e+02 2.260e+02 2.518e+02 3.443e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 10:39:00,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=370978.6666666667, ans=0.2 2024-06-21 10:39:14,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=371015.3333333333, ans=0.1 2024-06-21 10:39:20,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.65 vs. limit=15.0 2024-06-21 10:39:23,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2024-06-21 10:39:30,786 INFO [train.py:1028] (0/2) Epoch 21, batch 50, loss[loss=0.1905, simple_loss=0.2473, pruned_loss=0.06683, over 12601.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2609, pruned_loss=0.07688, over 575024.48 frames. ], batch size: 29, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:39:32,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=371052.0, ans=0.0 2024-06-21 10:39:35,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=371052.0, ans=0.0 2024-06-21 10:39:51,973 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.23 vs. limit=15.0 2024-06-21 10:39:52,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=371088.6666666667, ans=0.025 2024-06-21 10:40:05,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2024-06-21 10:40:08,395 INFO [train.py:1028] (0/2) Epoch 21, batch 100, loss[loss=0.2087, simple_loss=0.2698, pruned_loss=0.07374, over 13237.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2582, pruned_loss=0.075, over 1017601.30 frames. ], batch size: 46, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:40:17,098 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.045e+02 2.155e+02 2.378e+02 3.028e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 10:40:33,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=371180.3333333333, ans=0.2 2024-06-21 10:40:34,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=371198.6666666667, ans=0.0 2024-06-21 10:40:36,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=371198.6666666667, ans=0.125 2024-06-21 10:40:45,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371217.0, ans=0.125 2024-06-21 10:40:46,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.49 vs. limit=22.5 2024-06-21 10:40:47,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.76 vs. limit=12.0 2024-06-21 10:40:49,876 INFO [train.py:1028] (0/2) Epoch 21, batch 150, loss[loss=0.2134, simple_loss=0.2801, pruned_loss=0.07331, over 12669.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2569, pruned_loss=0.07306, over 1365454.92 frames. ], batch size: 29, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:40:59,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=371253.6666666667, ans=0.125 2024-06-21 10:41:09,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=371272.0, ans=0.025 2024-06-21 10:41:14,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=371272.0, ans=0.0 2024-06-21 10:41:15,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=371290.3333333333, ans=0.125 2024-06-21 10:41:16,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=371290.3333333333, ans=0.125 2024-06-21 10:41:18,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=371290.3333333333, ans=0.0 2024-06-21 10:41:20,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.76 vs. limit=15.0 2024-06-21 10:41:25,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=371308.6666666667, ans=0.125 2024-06-21 10:41:27,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=371308.6666666667, ans=0.0 2024-06-21 10:41:31,085 INFO [train.py:1028] (0/2) Epoch 21, batch 200, loss[loss=0.2232, simple_loss=0.2736, pruned_loss=0.08642, over 12525.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.258, pruned_loss=0.07359, over 1635141.49 frames. ], batch size: 202, lr: 2.80e-03, grad_scale: 32.0 2024-06-21 10:41:40,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.036e+02 2.192e+02 2.419e+02 3.710e+02, threshold=4.385e+02, percent-clipped=0.0 2024-06-21 10:41:40,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371345.3333333333, ans=0.1 2024-06-21 10:41:58,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=371382.0, ans=0.2 2024-06-21 10:41:58,868 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:42:05,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=371400.3333333333, ans=0.0 2024-06-21 10:42:08,759 INFO [train.py:1028] (0/2) Epoch 21, batch 250, loss[loss=0.2082, simple_loss=0.2543, pruned_loss=0.08105, over 13049.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2575, pruned_loss=0.0735, over 1846279.60 frames. ], batch size: 144, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:42:10,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=371418.6666666667, ans=0.025 2024-06-21 10:42:12,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.17 vs. limit=22.5 2024-06-21 10:42:18,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=371437.0, ans=0.0 2024-06-21 10:42:29,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=12.0 2024-06-21 10:42:32,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2024-06-21 10:42:34,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.10 vs. limit=15.0 2024-06-21 10:42:47,574 INFO [train.py:1028] (0/2) Epoch 21, batch 300, loss[loss=0.1973, simple_loss=0.2474, pruned_loss=0.07363, over 13169.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2575, pruned_loss=0.07356, over 2009722.51 frames. ], batch size: 112, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:42:58,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=371528.6666666667, ans=0.0 2024-06-21 10:43:00,406 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.089e+02 2.218e+02 2.407e+02 3.250e+02, threshold=4.437e+02, percent-clipped=0.0 2024-06-21 10:43:02,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=371528.6666666667, ans=0.2 2024-06-21 10:43:02,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2024-06-21 10:43:08,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371547.0, ans=0.1 2024-06-21 10:43:16,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371565.3333333333, ans=0.125 2024-06-21 10:43:17,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.00 vs. limit=15.0 2024-06-21 10:43:17,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=371565.3333333333, ans=0.0 2024-06-21 10:43:21,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=371583.6666666667, ans=0.125 2024-06-21 10:43:23,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=371583.6666666667, ans=0.125 2024-06-21 10:43:32,240 INFO [train.py:1028] (0/2) Epoch 21, batch 350, loss[loss=0.2042, simple_loss=0.2588, pruned_loss=0.07478, over 12970.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2572, pruned_loss=0.07344, over 2139082.64 frames. ], batch size: 33, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:43:50,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=371638.6666666667, ans=0.125 2024-06-21 10:43:51,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=371638.6666666667, ans=0.2 2024-06-21 10:44:01,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371657.0, ans=0.1 2024-06-21 10:44:02,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=371657.0, ans=0.0 2024-06-21 10:44:04,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=371675.3333333333, ans=0.125 2024-06-21 10:44:10,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2024-06-21 10:44:11,461 INFO [train.py:1028] (0/2) Epoch 21, batch 400, loss[loss=0.1723, simple_loss=0.2349, pruned_loss=0.0549, over 13245.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2572, pruned_loss=0.07319, over 2239470.63 frames. ], batch size: 63, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:44:11,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.63 vs. limit=10.0 2024-06-21 10:44:16,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371693.6666666667, ans=0.125 2024-06-21 10:44:20,927 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.074e+02 2.205e+02 2.382e+02 3.244e+02, threshold=4.410e+02, percent-clipped=0.0 2024-06-21 10:44:36,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371748.6666666667, ans=0.1 2024-06-21 10:44:49,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=371785.3333333333, ans=0.0 2024-06-21 10:44:49,990 INFO [train.py:1028] (0/2) Epoch 21, batch 450, loss[loss=0.1903, simple_loss=0.2509, pruned_loss=0.06486, over 13229.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2569, pruned_loss=0.07284, over 2314193.56 frames. ], batch size: 67, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:45:04,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=371822.0, ans=0.0 2024-06-21 10:45:08,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=371822.0, ans=0.0 2024-06-21 10:45:29,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2024-06-21 10:45:31,753 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:45:32,290 INFO [train.py:1028] (0/2) Epoch 21, batch 500, loss[loss=0.2003, simple_loss=0.251, pruned_loss=0.07481, over 13155.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2574, pruned_loss=0.07296, over 2375684.74 frames. ], batch size: 121, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:45:41,057 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.191e+02 2.322e+02 2.585e+02 3.301e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-21 10:45:43,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.64 vs. limit=22.5 2024-06-21 10:46:03,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=22.5 2024-06-21 10:46:07,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=371950.3333333333, ans=0.125 2024-06-21 10:46:13,484 INFO [train.py:1028] (0/2) Epoch 21, batch 550, loss[loss=0.2005, simple_loss=0.2466, pruned_loss=0.07721, over 12932.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2569, pruned_loss=0.07284, over 2420382.12 frames. ], batch size: 158, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:46:36,577 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=1.297e-01 2024-06-21 10:46:48,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=372042.0, ans=0.125 2024-06-21 10:46:48,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=372042.0, ans=0.1 2024-06-21 10:46:51,869 INFO [train.py:1028] (0/2) Epoch 21, batch 600, loss[loss=0.1906, simple_loss=0.2457, pruned_loss=0.06771, over 13032.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2577, pruned_loss=0.07319, over 2458717.35 frames. ], batch size: 144, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:46:56,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=372060.3333333333, ans=0.1 2024-06-21 10:47:01,078 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.042e+02 2.175e+02 2.322e+02 3.019e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 10:47:18,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=372115.3333333333, ans=0.0 2024-06-21 10:47:18,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=372115.3333333333, ans=0.125 2024-06-21 10:47:25,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=372133.6666666667, ans=0.1 2024-06-21 10:47:31,108 INFO [train.py:1028] (0/2) Epoch 21, batch 650, loss[loss=0.1906, simple_loss=0.2554, pruned_loss=0.06283, over 13182.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2577, pruned_loss=0.07267, over 2489291.51 frames. ], batch size: 59, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:47:37,025 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:47:40,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=372170.3333333333, ans=0.125 2024-06-21 10:47:42,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=372170.3333333333, ans=0.125 2024-06-21 10:47:43,158 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2024-06-21 10:47:45,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=372170.3333333333, ans=0.0 2024-06-21 10:47:54,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=372188.6666666667, ans=0.125 2024-06-21 10:48:03,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=372207.0, ans=15.0 2024-06-21 10:48:03,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=372207.0, ans=0.125 2024-06-21 10:48:05,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=372225.3333333333, ans=0.0 2024-06-21 10:48:06,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=372225.3333333333, ans=0.2 2024-06-21 10:48:13,357 INFO [train.py:1028] (0/2) Epoch 21, batch 700, loss[loss=0.2101, simple_loss=0.2636, pruned_loss=0.07825, over 13270.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2578, pruned_loss=0.07312, over 2511648.02 frames. ], batch size: 46, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:48:22,402 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.047e+02 2.178e+02 2.374e+02 3.167e+02, threshold=4.357e+02, percent-clipped=0.0 2024-06-21 10:48:39,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=372298.6666666667, ans=0.125 2024-06-21 10:48:55,436 INFO [train.py:1028] (0/2) Epoch 21, batch 750, loss[loss=0.2073, simple_loss=0.2636, pruned_loss=0.07549, over 13253.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2578, pruned_loss=0.0728, over 2527428.73 frames. ], batch size: 63, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:49:34,956 INFO [train.py:1028] (0/2) Epoch 21, batch 800, loss[loss=0.1896, simple_loss=0.2515, pruned_loss=0.06379, over 12916.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.258, pruned_loss=0.07286, over 2539735.30 frames. ], batch size: 36, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:49:40,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=372427.0, ans=0.125 2024-06-21 10:49:41,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=372427.0, ans=0.0 2024-06-21 10:49:44,228 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.067e+02 2.212e+02 2.434e+02 3.333e+02, threshold=4.425e+02, percent-clipped=0.0 2024-06-21 10:49:58,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.64 vs. limit=15.0 2024-06-21 10:50:14,344 INFO [train.py:1028] (0/2) Epoch 21, batch 850, loss[loss=0.2095, simple_loss=0.26, pruned_loss=0.07951, over 13181.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2569, pruned_loss=0.07244, over 2549941.29 frames. ], batch size: 95, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:50:21,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=372518.6666666667, ans=0.0 2024-06-21 10:50:24,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=372518.6666666667, ans=0.125 2024-06-21 10:50:26,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=372537.0, ans=0.125 2024-06-21 10:50:46,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=372573.6666666667, ans=0.025 2024-06-21 10:50:48,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=372592.0, ans=0.2 2024-06-21 10:50:59,580 INFO [train.py:1028] (0/2) Epoch 21, batch 900, loss[loss=0.1816, simple_loss=0.2444, pruned_loss=0.05943, over 12876.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2565, pruned_loss=0.07263, over 2555773.09 frames. ], batch size: 36, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:51:00,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=372610.3333333333, ans=0.0 2024-06-21 10:51:00,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=372610.3333333333, ans=0.2 2024-06-21 10:51:08,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.034e+02 2.156e+02 2.310e+02 3.373e+02, threshold=4.312e+02, percent-clipped=0.0 2024-06-21 10:51:08,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372628.6666666667, ans=0.1 2024-06-21 10:51:19,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=372647.0, ans=0.125 2024-06-21 10:51:29,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=372683.6666666667, ans=0.0 2024-06-21 10:51:31,927 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2024-06-21 10:51:33,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=372683.6666666667, ans=0.07 2024-06-21 10:51:37,598 INFO [train.py:1028] (0/2) Epoch 21, batch 950, loss[loss=0.2119, simple_loss=0.2678, pruned_loss=0.07806, over 12898.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2567, pruned_loss=0.07266, over 2559264.95 frames. ], batch size: 39, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:51:38,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=372702.0, ans=0.125 2024-06-21 10:51:48,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=372720.3333333333, ans=0.125 2024-06-21 10:51:48,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.25 vs. limit=22.5 2024-06-21 10:51:50,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.51 vs. limit=15.0 2024-06-21 10:51:57,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=372738.6666666667, ans=0.1 2024-06-21 10:51:58,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=372738.6666666667, ans=0.125 2024-06-21 10:52:04,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=372757.0, ans=0.125 2024-06-21 10:52:04,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=372757.0, ans=0.2 2024-06-21 10:52:04,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=372757.0, ans=0.0 2024-06-21 10:52:15,204 INFO [train.py:1028] (0/2) Epoch 21, batch 1000, loss[loss=0.2201, simple_loss=0.2746, pruned_loss=0.08282, over 13295.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2569, pruned_loss=0.07317, over 2562169.02 frames. ], batch size: 49, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:52:24,409 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.113e+02 2.225e+02 2.411e+02 3.273e+02, threshold=4.449e+02, percent-clipped=0.0 2024-06-21 10:52:49,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=372867.0, ans=0.2 2024-06-21 10:52:57,316 INFO [train.py:1028] (0/2) Epoch 21, batch 1050, loss[loss=0.1953, simple_loss=0.2525, pruned_loss=0.06907, over 13200.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2574, pruned_loss=0.07313, over 2564713.89 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:53:15,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=372922.0, ans=0.125 2024-06-21 10:53:19,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=372922.0, ans=0.035 2024-06-21 10:53:26,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.13 vs. limit=22.5 2024-06-21 10:53:32,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=372958.6666666667, ans=0.125 2024-06-21 10:53:39,240 INFO [train.py:1028] (0/2) Epoch 21, batch 1100, loss[loss=0.1975, simple_loss=0.2497, pruned_loss=0.07269, over 13254.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2579, pruned_loss=0.07349, over 2569348.95 frames. ], batch size: 52, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:53:48,551 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.024e+02 2.166e+02 2.326e+02 3.171e+02, threshold=4.332e+02, percent-clipped=0.0 2024-06-21 10:53:53,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=372995.3333333333, ans=0.125 2024-06-21 10:53:54,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=373013.6666666667, ans=0.125 2024-06-21 10:54:05,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=373032.0, ans=0.0 2024-06-21 10:54:12,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=373050.3333333333, ans=0.125 2024-06-21 10:54:13,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=373050.3333333333, ans=0.2 2024-06-21 10:54:18,078 INFO [train.py:1028] (0/2) Epoch 21, batch 1150, loss[loss=0.2037, simple_loss=0.2537, pruned_loss=0.07684, over 13264.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2582, pruned_loss=0.07388, over 2570186.77 frames. ], batch size: 52, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:54:19,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373068.6666666667, ans=0.1 2024-06-21 10:54:23,570 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2024-06-21 10:54:26,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=373087.0, ans=0.125 2024-06-21 10:54:28,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=373087.0, ans=0.125 2024-06-21 10:54:34,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=373105.3333333333, ans=0.2 2024-06-21 10:54:35,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373105.3333333333, ans=0.1 2024-06-21 10:54:38,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=373105.3333333333, ans=0.125 2024-06-21 10:54:45,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=373123.6666666667, ans=0.0 2024-06-21 10:54:52,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-21 10:54:56,241 INFO [train.py:1028] (0/2) Epoch 21, batch 1200, loss[loss=0.2057, simple_loss=0.2674, pruned_loss=0.07197, over 13122.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2584, pruned_loss=0.07406, over 2572777.08 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:54:56,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.38 vs. limit=22.5 2024-06-21 10:55:05,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=373160.3333333333, ans=0.0 2024-06-21 10:55:05,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.78 vs. limit=22.5 2024-06-21 10:55:08,517 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.069e+02 2.224e+02 2.492e+02 3.349e+02, threshold=4.447e+02, percent-clipped=0.0 2024-06-21 10:55:12,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=373178.6666666667, ans=0.0 2024-06-21 10:55:17,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.65 vs. limit=15.0 2024-06-21 10:55:18,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=373197.0, ans=0.125 2024-06-21 10:55:20,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=373197.0, ans=0.0 2024-06-21 10:55:34,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=373233.6666666667, ans=0.125 2024-06-21 10:55:40,015 INFO [train.py:1028] (0/2) Epoch 21, batch 1250, loss[loss=0.1821, simple_loss=0.2379, pruned_loss=0.06316, over 13154.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2583, pruned_loss=0.07387, over 2582631.80 frames. ], batch size: 112, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:55:48,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=373270.3333333333, ans=0.125 2024-06-21 10:55:56,969 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:56:16,739 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.41 vs. limit=15.0 2024-06-21 10:56:18,771 INFO [train.py:1028] (0/2) Epoch 21, batch 1300, loss[loss=0.233, simple_loss=0.2759, pruned_loss=0.09509, over 12727.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2579, pruned_loss=0.07359, over 2583134.21 frames. ], batch size: 176, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:56:21,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=373343.6666666667, ans=0.0 2024-06-21 10:56:27,317 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.060e+02 2.196e+02 2.424e+02 3.258e+02, threshold=4.391e+02, percent-clipped=0.0 2024-06-21 10:56:35,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.26 vs. limit=22.5 2024-06-21 10:56:40,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=373398.6666666667, ans=0.0 2024-06-21 10:56:43,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2024-06-21 10:56:45,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=373398.6666666667, ans=0.02 2024-06-21 10:56:55,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=373417.0, ans=0.025 2024-06-21 10:56:56,678 INFO [train.py:1028] (0/2) Epoch 21, batch 1350, loss[loss=0.2201, simple_loss=0.2823, pruned_loss=0.07896, over 13204.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2584, pruned_loss=0.07344, over 2584969.47 frames. ], batch size: 59, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:56:57,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.69 vs. limit=22.5 2024-06-21 10:57:08,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=373453.6666666667, ans=0.0 2024-06-21 10:57:10,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=373453.6666666667, ans=0.1 2024-06-21 10:57:22,545 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.76 vs. limit=15.0 2024-06-21 10:57:32,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0 2024-06-21 10:57:37,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=373527.0, ans=0.125 2024-06-21 10:57:38,418 INFO [train.py:1028] (0/2) Epoch 21, batch 1400, loss[loss=0.2019, simple_loss=0.2599, pruned_loss=0.07192, over 12564.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2577, pruned_loss=0.07325, over 2586687.30 frames. ], batch size: 25, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:57:38,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373527.0, ans=0.1 2024-06-21 10:57:43,513 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:57:47,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=373545.3333333333, ans=0.125 2024-06-21 10:57:47,729 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.099e+02 2.236e+02 2.384e+02 2.942e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 10:58:06,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=373582.0, ans=0.125 2024-06-21 10:58:14,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=373600.3333333333, ans=0.0 2024-06-21 10:58:17,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=373600.3333333333, ans=0.125 2024-06-21 10:58:19,476 INFO [train.py:1028] (0/2) Epoch 21, batch 1450, loss[loss=0.1914, simple_loss=0.2483, pruned_loss=0.06729, over 13117.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2579, pruned_loss=0.07361, over 2586181.22 frames. ], batch size: 121, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:58:23,548 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.74 vs. limit=15.0 2024-06-21 10:58:29,201 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 10:58:34,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=8.0 2024-06-21 10:58:41,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=373655.3333333333, ans=0.125 2024-06-21 10:58:41,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2024-06-21 10:58:51,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=373692.0, ans=10.0 2024-06-21 10:58:52,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=373692.0, ans=0.125 2024-06-21 10:58:58,121 INFO [train.py:1028] (0/2) Epoch 21, batch 1500, loss[loss=0.2205, simple_loss=0.2618, pruned_loss=0.08963, over 13213.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2579, pruned_loss=0.07378, over 2588867.48 frames. ], batch size: 83, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:59:06,856 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.096e+02 2.231e+02 2.416e+02 2.924e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 10:59:21,682 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.142e+01 2024-06-21 10:59:24,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373765.3333333333, ans=0.1 2024-06-21 10:59:25,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=373765.3333333333, ans=0.07 2024-06-21 10:59:26,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=373765.3333333333, ans=0.125 2024-06-21 10:59:27,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=373765.3333333333, ans=22.5 2024-06-21 10:59:31,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=373783.6666666667, ans=0.125 2024-06-21 10:59:36,287 INFO [train.py:1028] (0/2) Epoch 21, batch 1550, loss[loss=0.2172, simple_loss=0.2646, pruned_loss=0.08491, over 12965.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.258, pruned_loss=0.07376, over 2584669.43 frames. ], batch size: 102, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 10:59:49,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-21 10:59:55,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=373838.6666666667, ans=0.125 2024-06-21 11:00:22,168 INFO [train.py:1028] (0/2) Epoch 21, batch 1600, loss[loss=0.2162, simple_loss=0.27, pruned_loss=0.08121, over 13188.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2581, pruned_loss=0.07378, over 2580287.79 frames. ], batch size: 77, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:00:31,781 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.068e+02 2.220e+02 2.423e+02 3.644e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-21 11:00:42,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2024-06-21 11:00:54,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.99 vs. limit=12.0 2024-06-21 11:00:56,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=373967.0, ans=0.125 2024-06-21 11:01:01,740 INFO [train.py:1028] (0/2) Epoch 21, batch 1650, loss[loss=0.2117, simple_loss=0.2547, pruned_loss=0.08433, over 13173.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2583, pruned_loss=0.07409, over 2577328.55 frames. ], batch size: 95, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:01:02,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.08 vs. limit=15.0 2024-06-21 11:01:07,567 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-204000.pt 2024-06-21 11:01:15,000 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.87 vs. limit=12.0 2024-06-21 11:01:16,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=374003.6666666667, ans=0.2 2024-06-21 11:01:17,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.72 vs. limit=15.0 2024-06-21 11:01:25,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.48 vs. limit=22.5 2024-06-21 11:01:25,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=374022.0, ans=0.1 2024-06-21 11:01:39,395 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.74 vs. limit=15.0 2024-06-21 11:01:47,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=374077.0, ans=0.0 2024-06-21 11:01:47,549 INFO [train.py:1028] (0/2) Epoch 21, batch 1700, loss[loss=0.2034, simple_loss=0.2636, pruned_loss=0.07159, over 12853.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2593, pruned_loss=0.07436, over 2581883.05 frames. ], batch size: 26, lr: 2.79e-03, grad_scale: 64.0 2024-06-21 11:01:49,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=374077.0, ans=0.0 2024-06-21 11:01:49,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=15.0 2024-06-21 11:01:50,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.52 vs. limit=15.0 2024-06-21 11:01:52,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=374077.0, ans=0.2 2024-06-21 11:01:53,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=374077.0, ans=0.04949747468305833 2024-06-21 11:01:57,507 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.105e+02 2.230e+02 2.399e+02 3.282e+02, threshold=4.461e+02, percent-clipped=0.0 2024-06-21 11:01:57,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=374095.3333333333, ans=0.035 2024-06-21 11:01:59,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=374095.3333333333, ans=10.0 2024-06-21 11:02:05,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=374113.6666666667, ans=0.025 2024-06-21 11:02:10,586 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2024-06-21 11:02:19,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=374132.0, ans=0.1 2024-06-21 11:02:30,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=374168.6666666667, ans=0.0 2024-06-21 11:02:30,599 INFO [train.py:1028] (0/2) Epoch 21, batch 1750, loss[loss=0.2144, simple_loss=0.275, pruned_loss=0.07692, over 12518.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2592, pruned_loss=0.07412, over 2582943.04 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:02:33,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=374168.6666666667, ans=0.2 2024-06-21 11:02:33,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=374168.6666666667, ans=0.0 2024-06-21 11:02:40,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=374187.0, ans=0.125 2024-06-21 11:02:59,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=374223.6666666667, ans=0.125 2024-06-21 11:03:01,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374223.6666666667, ans=0.125 2024-06-21 11:03:07,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=374242.0, ans=0.07 2024-06-21 11:03:07,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=374242.0, ans=0.125 2024-06-21 11:03:12,292 INFO [train.py:1028] (0/2) Epoch 21, batch 1800, loss[loss=0.1909, simple_loss=0.2546, pruned_loss=0.06361, over 13196.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2588, pruned_loss=0.07383, over 2583304.72 frames. ], batch size: 67, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:03:15,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374260.3333333333, ans=0.125 2024-06-21 11:03:16,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=374260.3333333333, ans=0.05 2024-06-21 11:03:21,061 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.145e+02 2.352e+02 2.614e+02 3.267e+02, threshold=4.704e+02, percent-clipped=0.0 2024-06-21 11:03:22,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2024-06-21 11:03:38,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.53 vs. limit=12.0 2024-06-21 11:03:40,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=374315.3333333333, ans=0.0 2024-06-21 11:03:50,167 INFO [train.py:1028] (0/2) Epoch 21, batch 1850, loss[loss=0.1976, simple_loss=0.2542, pruned_loss=0.07051, over 13186.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2591, pruned_loss=0.074, over 2583818.85 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:04:09,604 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2024-06-21 11:04:27,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.45 vs. limit=15.0 2024-06-21 11:04:28,871 INFO [train.py:1028] (0/2) Epoch 21, batch 1900, loss[loss=0.195, simple_loss=0.2506, pruned_loss=0.0697, over 13169.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2585, pruned_loss=0.07381, over 2586178.51 frames. ], batch size: 95, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:04:30,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2024-06-21 11:04:42,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.101e+02 2.277e+02 2.472e+02 4.132e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 11:04:59,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=15.0 2024-06-21 11:05:02,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=374498.6666666667, ans=0.1 2024-06-21 11:05:11,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=374517.0, ans=0.125 2024-06-21 11:05:15,542 INFO [train.py:1028] (0/2) Epoch 21, batch 1950, loss[loss=0.1896, simple_loss=0.2542, pruned_loss=0.06255, over 13231.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2575, pruned_loss=0.07362, over 2592383.94 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:05:25,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=374553.6666666667, ans=0.1 2024-06-21 11:05:35,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374572.0, ans=0.125 2024-06-21 11:05:40,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=374590.3333333333, ans=0.125 2024-06-21 11:05:42,978 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:05:44,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=15.0 2024-06-21 11:05:53,732 INFO [train.py:1028] (0/2) Epoch 21, batch 2000, loss[loss=0.2193, simple_loss=0.2785, pruned_loss=0.08005, over 12639.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2574, pruned_loss=0.07378, over 2588076.42 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:05:58,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2024-06-21 11:06:03,044 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.033e+02 2.147e+02 2.345e+02 3.000e+02, threshold=4.295e+02, percent-clipped=0.0 2024-06-21 11:06:17,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.55 vs. limit=22.5 2024-06-21 11:06:20,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=374682.0, ans=0.1 2024-06-21 11:06:29,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.08 vs. limit=15.0 2024-06-21 11:06:33,938 INFO [train.py:1028] (0/2) Epoch 21, batch 2050, loss[loss=0.2028, simple_loss=0.2631, pruned_loss=0.07127, over 12579.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2577, pruned_loss=0.07395, over 2584324.35 frames. ], batch size: 29, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:06:38,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=374718.6666666667, ans=0.125 2024-06-21 11:06:38,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374718.6666666667, ans=0.125 2024-06-21 11:06:39,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2024-06-21 11:06:41,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=374737.0, ans=0.2 2024-06-21 11:07:11,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2024-06-21 11:07:13,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=374792.0, ans=0.125 2024-06-21 11:07:16,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=374810.3333333333, ans=0.2 2024-06-21 11:07:17,355 INFO [train.py:1028] (0/2) Epoch 21, batch 2100, loss[loss=0.2043, simple_loss=0.2652, pruned_loss=0.07172, over 13222.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2584, pruned_loss=0.07398, over 2586346.89 frames. ], batch size: 59, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:07:25,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=374828.6666666667, ans=0.0 2024-06-21 11:07:26,974 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.064e+02 2.188e+02 2.367e+02 3.745e+02, threshold=4.376e+02, percent-clipped=0.0 2024-06-21 11:07:40,132 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2024-06-21 11:07:42,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=374847.0, ans=0.025 2024-06-21 11:07:59,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2024-06-21 11:07:59,829 INFO [train.py:1028] (0/2) Epoch 21, batch 2150, loss[loss=0.1796, simple_loss=0.2448, pruned_loss=0.0572, over 13279.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2584, pruned_loss=0.07374, over 2588983.39 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:08:12,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=374920.3333333333, ans=0.125 2024-06-21 11:08:23,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=374938.6666666667, ans=0.0 2024-06-21 11:08:24,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=374957.0, ans=0.09899494936611666 2024-06-21 11:08:27,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=374957.0, ans=0.125 2024-06-21 11:08:30,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=374957.0, ans=0.1 2024-06-21 11:08:40,005 INFO [train.py:1028] (0/2) Epoch 21, batch 2200, loss[loss=0.1971, simple_loss=0.2448, pruned_loss=0.07465, over 13170.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2592, pruned_loss=0.07405, over 2588854.12 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:08:43,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=374993.6666666667, ans=0.125 2024-06-21 11:08:49,008 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.108e+02 2.260e+02 2.490e+02 3.168e+02, threshold=4.519e+02, percent-clipped=0.0 2024-06-21 11:08:50,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2024-06-21 11:08:53,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2024-06-21 11:08:53,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=15.0 2024-06-21 11:09:04,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.89 vs. limit=10.0 2024-06-21 11:09:16,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=375067.0, ans=0.0 2024-06-21 11:09:18,897 INFO [train.py:1028] (0/2) Epoch 21, batch 2250, loss[loss=0.195, simple_loss=0.2588, pruned_loss=0.06555, over 13264.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2593, pruned_loss=0.07423, over 2587737.63 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:09:18,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=375085.3333333333, ans=0.025 2024-06-21 11:09:31,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=375103.6666666667, ans=0.0 2024-06-21 11:09:31,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=375103.6666666667, ans=0.125 2024-06-21 11:10:01,097 INFO [train.py:1028] (0/2) Epoch 21, batch 2300, loss[loss=0.2053, simple_loss=0.2667, pruned_loss=0.07196, over 12863.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2595, pruned_loss=0.07412, over 2582342.48 frames. ], batch size: 33, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:10:04,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.44 vs. limit=22.5 2024-06-21 11:10:13,505 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.074e+02 2.185e+02 2.372e+02 3.265e+02, threshold=4.371e+02, percent-clipped=0.0 2024-06-21 11:10:14,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.64 vs. limit=15.0 2024-06-21 11:10:25,680 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2024-06-21 11:10:30,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=375232.0, ans=0.125 2024-06-21 11:10:33,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=375232.0, ans=0.95 2024-06-21 11:10:37,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=375250.3333333333, ans=0.125 2024-06-21 11:10:39,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=375250.3333333333, ans=0.07 2024-06-21 11:10:42,722 INFO [train.py:1028] (0/2) Epoch 21, batch 2350, loss[loss=0.2046, simple_loss=0.2643, pruned_loss=0.07242, over 13180.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2596, pruned_loss=0.07433, over 2586024.82 frames. ], batch size: 67, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:10:42,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=375268.6666666667, ans=0.035 2024-06-21 11:10:44,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=375268.6666666667, ans=0.2 2024-06-21 11:10:46,169 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:10:58,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=375305.3333333333, ans=0.035 2024-06-21 11:11:09,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=375323.6666666667, ans=0.0 2024-06-21 11:11:10,861 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.75 vs. limit=10.0 2024-06-21 11:11:15,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=375342.0, ans=0.125 2024-06-21 11:11:15,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=375342.0, ans=0.125 2024-06-21 11:11:22,496 INFO [train.py:1028] (0/2) Epoch 21, batch 2400, loss[loss=0.1894, simple_loss=0.2467, pruned_loss=0.06606, over 13321.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2585, pruned_loss=0.07382, over 2588944.07 frames. ], batch size: 46, lr: 2.78e-03, grad_scale: 128.0 2024-06-21 11:11:24,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=375360.3333333333, ans=0.125 2024-06-21 11:11:28,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375360.3333333333, ans=0.1 2024-06-21 11:11:31,505 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.061e+02 2.233e+02 2.357e+02 3.164e+02, threshold=4.467e+02, percent-clipped=0.0 2024-06-21 11:11:53,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=375415.3333333333, ans=0.125 2024-06-21 11:12:03,307 INFO [train.py:1028] (0/2) Epoch 21, batch 2450, loss[loss=0.173, simple_loss=0.2328, pruned_loss=0.05662, over 13281.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2576, pruned_loss=0.07385, over 2584398.41 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:12:05,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=12.72 vs. limit=15.0 2024-06-21 11:12:07,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=375452.0, ans=0.0 2024-06-21 11:12:18,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=375470.3333333333, ans=0.025 2024-06-21 11:12:19,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2024-06-21 11:12:20,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=375470.3333333333, ans=0.1 2024-06-21 11:12:24,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=375488.6666666667, ans=0.125 2024-06-21 11:12:25,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2024-06-21 11:12:35,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=375507.0, ans=0.125 2024-06-21 11:12:37,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=375525.3333333333, ans=0.125 2024-06-21 11:12:45,279 INFO [train.py:1028] (0/2) Epoch 21, batch 2500, loss[loss=0.1881, simple_loss=0.2401, pruned_loss=0.06806, over 13218.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2566, pruned_loss=0.07331, over 2587699.59 frames. ], batch size: 83, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:12:55,376 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.055e+02 2.206e+02 2.433e+02 2.903e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 11:12:57,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=375562.0, ans=0.0 2024-06-21 11:13:22,462 INFO [train.py:1028] (0/2) Epoch 21, batch 2550, loss[loss=0.1906, simple_loss=0.2502, pruned_loss=0.06543, over 12576.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2558, pruned_loss=0.07305, over 2586823.11 frames. ], batch size: 22, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:13:24,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375635.3333333333, ans=0.1 2024-06-21 11:13:33,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.11 vs. limit=22.5 2024-06-21 11:13:40,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=375672.0, ans=0.0 2024-06-21 11:13:46,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2024-06-21 11:13:49,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=375690.3333333333, ans=0.125 2024-06-21 11:14:02,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=375708.6666666667, ans=22.5 2024-06-21 11:14:04,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=375708.6666666667, ans=0.125 2024-06-21 11:14:07,061 INFO [train.py:1028] (0/2) Epoch 21, batch 2600, loss[loss=0.1869, simple_loss=0.2448, pruned_loss=0.06446, over 13325.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2543, pruned_loss=0.0727, over 2586236.03 frames. ], batch size: 52, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:14:17,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2024-06-21 11:14:26,381 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.061e+02 2.159e+02 2.335e+02 2.992e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 11:14:32,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=375763.6666666667, ans=0.0 2024-06-21 11:14:51,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2024-06-21 11:14:52,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=375782.0, ans=0.0 2024-06-21 11:14:54,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=375782.0, ans=0.2 2024-06-21 11:14:55,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=375782.0, ans=0.09899494936611666 2024-06-21 11:15:00,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=375800.3333333333, ans=0.1 2024-06-21 11:15:10,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375818.6666666667, ans=0.1 2024-06-21 11:15:10,848 INFO [train.py:1028] (0/2) Epoch 21, batch 2650, loss[loss=0.1864, simple_loss=0.2293, pruned_loss=0.07171, over 13049.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.253, pruned_loss=0.07208, over 2586952.91 frames. ], batch size: 144, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:15:19,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=375818.6666666667, ans=0.125 2024-06-21 11:15:22,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.15 vs. limit=15.0 2024-06-21 11:15:49,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=375892.0, ans=0.125 2024-06-21 11:15:51,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=375892.0, ans=0.125 2024-06-21 11:15:57,765 INFO [train.py:1028] (0/2) Epoch 21, batch 2700, loss[loss=0.1922, simple_loss=0.2466, pruned_loss=0.06887, over 13235.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2523, pruned_loss=0.07218, over 2583217.55 frames. ], batch size: 89, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:16:08,568 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.735e+02 2.047e+02 2.185e+02 2.378e+02 3.062e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 11:16:19,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-21 11:16:30,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.71 vs. limit=15.0 2024-06-21 11:16:45,564 INFO [train.py:1028] (0/2) Epoch 21, batch 2750, loss[loss=0.1993, simple_loss=0.2478, pruned_loss=0.07543, over 13362.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2516, pruned_loss=0.07185, over 2579580.26 frames. ], batch size: 43, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:17:06,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2024-06-21 11:17:13,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=376038.6666666667, ans=0.125 2024-06-21 11:17:15,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=376038.6666666667, ans=0.125 2024-06-21 11:17:35,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=376057.0, ans=0.2 2024-06-21 11:17:51,680 INFO [train.py:1028] (0/2) Epoch 21, batch 2800, loss[loss=0.2019, simple_loss=0.2471, pruned_loss=0.07838, over 10971.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2507, pruned_loss=0.07162, over 2579296.59 frames. ], batch size: 304, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:17:58,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=376093.6666666667, ans=0.125 2024-06-21 11:18:04,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.066e+02 2.250e+02 2.456e+02 4.507e+02, threshold=4.501e+02, percent-clipped=1.0 2024-06-21 11:18:18,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=376130.3333333333, ans=0.0 2024-06-21 11:18:42,574 INFO [train.py:1028] (0/2) Epoch 21, batch 2850, loss[loss=0.2053, simple_loss=0.2595, pruned_loss=0.07553, over 13252.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.25, pruned_loss=0.07163, over 2577144.95 frames. ], batch size: 49, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:18:55,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=376203.6666666667, ans=0.05 2024-06-21 11:19:03,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.73 vs. limit=15.0 2024-06-21 11:19:14,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=376240.3333333333, ans=0.125 2024-06-21 11:19:17,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=376240.3333333333, ans=0.125 2024-06-21 11:19:29,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=376258.6666666667, ans=0.0 2024-06-21 11:19:30,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=376277.0, ans=0.05 2024-06-21 11:19:30,875 INFO [train.py:1028] (0/2) Epoch 21, batch 2900, loss[loss=0.1813, simple_loss=0.2325, pruned_loss=0.06503, over 13106.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.248, pruned_loss=0.07095, over 2585849.28 frames. ], batch size: 55, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:19:44,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=376295.3333333333, ans=0.125 2024-06-21 11:19:45,856 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.033e+02 2.161e+02 2.416e+02 3.249e+02, threshold=4.321e+02, percent-clipped=0.0 2024-06-21 11:19:47,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=376295.3333333333, ans=0.025 2024-06-21 11:19:57,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=376313.6666666667, ans=0.025 2024-06-21 11:20:27,809 INFO [train.py:1028] (0/2) Epoch 21, batch 2950, loss[loss=0.1945, simple_loss=0.2433, pruned_loss=0.07282, over 13239.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2473, pruned_loss=0.07057, over 2581267.51 frames. ], batch size: 43, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:21:06,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=376423.6666666667, ans=0.125 2024-06-21 11:21:11,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=376423.6666666667, ans=0.125 2024-06-21 11:21:24,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=376442.0, ans=0.09899494936611666 2024-06-21 11:21:27,925 INFO [train.py:1028] (0/2) Epoch 21, batch 3000, loss[loss=0.199, simple_loss=0.2533, pruned_loss=0.07232, over 13209.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2462, pruned_loss=0.07019, over 2579748.00 frames. ], batch size: 59, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:21:27,926 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 11:21:34,431 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([2.9339, 2.6253, 2.6356, 2.5115, 2.3900, 2.6161, 2.5751, 2.3990], device='cuda:0') 2024-06-21 11:21:36,421 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.6655, 3.6246, 3.0981, 3.3872], device='cuda:0') 2024-06-21 11:21:40,208 INFO [train.py:1060] (0/2) Epoch 21, validation: loss=0.1865, simple_loss=0.2506, pruned_loss=0.06124, over 351949.00 frames. 2024-06-21 11:21:40,209 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 11:21:42,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=376460.3333333333, ans=0.0 2024-06-21 11:21:54,027 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.016e+02 2.141e+02 2.320e+02 3.810e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-21 11:22:33,172 INFO [train.py:1028] (0/2) Epoch 21, batch 3050, loss[loss=0.1928, simple_loss=0.2463, pruned_loss=0.06966, over 13261.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2458, pruned_loss=0.07033, over 2579065.00 frames. ], batch size: 46, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:23:10,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=376607.0, ans=0.025 2024-06-21 11:23:16,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.22 vs. limit=15.0 2024-06-21 11:23:17,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=376607.0, ans=0.0 2024-06-21 11:23:18,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=376625.3333333333, ans=0.125 2024-06-21 11:23:26,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=376625.3333333333, ans=0.0 2024-06-21 11:23:28,270 INFO [train.py:1028] (0/2) Epoch 21, batch 3100, loss[loss=0.1745, simple_loss=0.2178, pruned_loss=0.06563, over 13059.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2446, pruned_loss=0.06977, over 2579654.57 frames. ], batch size: 144, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:23:47,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.042e+02 2.176e+02 2.465e+02 3.504e+02, threshold=4.352e+02, percent-clipped=0.0 2024-06-21 11:23:50,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.79 vs. limit=22.5 2024-06-21 11:23:51,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=376662.0, ans=0.125 2024-06-21 11:23:57,006 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2024-06-21 11:24:01,421 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:24:26,710 INFO [train.py:1028] (0/2) Epoch 21, batch 3150, loss[loss=0.1826, simple_loss=0.2309, pruned_loss=0.06718, over 12964.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2436, pruned_loss=0.06933, over 2581721.55 frames. ], batch size: 158, lr: 2.78e-03, grad_scale: 64.0 2024-06-21 11:24:30,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=376735.3333333333, ans=0.025 2024-06-21 11:24:37,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376753.6666666667, ans=0.1 2024-06-21 11:24:40,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2024-06-21 11:24:49,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.51 vs. limit=22.5 2024-06-21 11:25:17,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=376808.6666666667, ans=0.1 2024-06-21 11:25:20,021 INFO [train.py:1028] (0/2) Epoch 21, batch 3200, loss[loss=0.1867, simple_loss=0.2386, pruned_loss=0.06738, over 13180.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2435, pruned_loss=0.06914, over 2581675.59 frames. ], batch size: 55, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:25:33,976 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.040e+02 2.208e+02 2.423e+02 3.162e+02, threshold=4.417e+02, percent-clipped=0.0 2024-06-21 11:25:47,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=376863.6666666667, ans=0.125 2024-06-21 11:25:49,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=376863.6666666667, ans=0.125 2024-06-21 11:25:50,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2024-06-21 11:25:57,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=376882.0, ans=0.2 2024-06-21 11:25:58,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2024-06-21 11:26:10,424 INFO [train.py:1028] (0/2) Epoch 21, batch 3250, loss[loss=0.1867, simple_loss=0.2477, pruned_loss=0.06287, over 13240.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2436, pruned_loss=0.06922, over 2586114.97 frames. ], batch size: 72, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:26:17,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=376918.6666666667, ans=0.025 2024-06-21 11:26:44,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=376955.3333333333, ans=0.125 2024-06-21 11:27:00,454 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2024-06-21 11:27:21,198 INFO [train.py:1028] (0/2) Epoch 21, batch 3300, loss[loss=0.2036, simple_loss=0.2531, pruned_loss=0.07704, over 12765.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2428, pruned_loss=0.0689, over 2583108.24 frames. ], batch size: 176, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:27:23,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.05 vs. limit=10.0 2024-06-21 11:27:34,068 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 1.982e+02 2.127e+02 2.237e+02 3.220e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-21 11:27:40,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=377047.0, ans=0.2 2024-06-21 11:27:58,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=377083.6666666667, ans=0.125 2024-06-21 11:28:03,800 INFO [train.py:1028] (0/2) Epoch 21, batch 3350, loss[loss=0.2008, simple_loss=0.2514, pruned_loss=0.07507, over 12925.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2424, pruned_loss=0.06911, over 2576949.11 frames. ], batch size: 158, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:28:06,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=377102.0, ans=0.025 2024-06-21 11:28:18,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=377120.3333333333, ans=0.125 2024-06-21 11:28:26,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=377138.6666666667, ans=0.0 2024-06-21 11:28:27,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=377138.6666666667, ans=0.05 2024-06-21 11:28:41,891 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.771e+01 2024-06-21 11:28:55,983 INFO [train.py:1028] (0/2) Epoch 21, batch 3400, loss[loss=0.1829, simple_loss=0.2422, pruned_loss=0.06176, over 12595.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2418, pruned_loss=0.06886, over 2575096.25 frames. ], batch size: 22, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:29:17,028 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 1.997e+02 2.093e+02 2.252e+02 3.299e+02, threshold=4.187e+02, percent-clipped=0.0 2024-06-21 11:29:35,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=377248.6666666667, ans=0.5 2024-06-21 11:29:41,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=377248.6666666667, ans=0.125 2024-06-21 11:29:53,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=377267.0, ans=0.125 2024-06-21 11:30:03,738 INFO [train.py:1028] (0/2) Epoch 21, batch 3450, loss[loss=0.1909, simple_loss=0.2434, pruned_loss=0.06921, over 12831.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2414, pruned_loss=0.06886, over 2575514.23 frames. ], batch size: 177, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:30:23,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377303.6666666667, ans=0.1 2024-06-21 11:30:58,262 INFO [train.py:1028] (0/2) Epoch 21, batch 3500, loss[loss=0.1779, simple_loss=0.2304, pruned_loss=0.06268, over 12938.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2411, pruned_loss=0.0684, over 2575022.75 frames. ], batch size: 33, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:31:09,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=377395.3333333333, ans=0.125 2024-06-21 11:31:10,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=377395.3333333333, ans=0.0 2024-06-21 11:31:12,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 1.995e+02 2.135e+02 2.298e+02 2.834e+02, threshold=4.270e+02, percent-clipped=0.0 2024-06-21 11:31:16,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=377395.3333333333, ans=0.025 2024-06-21 11:31:31,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=377432.0, ans=0.0 2024-06-21 11:31:34,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=377432.0, ans=0.0 2024-06-21 11:31:36,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=377432.0, ans=0.04949747468305833 2024-06-21 11:31:41,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.60 vs. limit=22.5 2024-06-21 11:31:49,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=377450.3333333333, ans=0.2 2024-06-21 11:31:51,515 INFO [train.py:1028] (0/2) Epoch 21, batch 3550, loss[loss=0.1934, simple_loss=0.2439, pruned_loss=0.07146, over 13166.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2413, pruned_loss=0.0686, over 2575857.72 frames. ], batch size: 95, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:32:13,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=377505.3333333333, ans=0.1 2024-06-21 11:32:22,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=377523.6666666667, ans=0.0 2024-06-21 11:32:27,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.37 vs. limit=10.0 2024-06-21 11:32:47,337 INFO [train.py:1028] (0/2) Epoch 21, batch 3600, loss[loss=0.1594, simple_loss=0.2143, pruned_loss=0.05231, over 13312.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2402, pruned_loss=0.06802, over 2580029.62 frames. ], batch size: 49, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:32:57,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377578.6666666667, ans=0.125 2024-06-21 11:33:02,461 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.649e+02 2.011e+02 2.156e+02 2.362e+02 3.391e+02, threshold=4.311e+02, percent-clipped=0.0 2024-06-21 11:33:16,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=377597.0, ans=0.2 2024-06-21 11:33:21,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=377597.0, ans=0.125 2024-06-21 11:33:32,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=377615.3333333333, ans=0.125 2024-06-21 11:33:43,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2024-06-21 11:33:43,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2024-06-21 11:33:46,607 INFO [train.py:1028] (0/2) Epoch 21, batch 3650, loss[loss=0.1715, simple_loss=0.2221, pruned_loss=0.06046, over 13068.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2394, pruned_loss=0.06752, over 2578786.86 frames. ], batch size: 102, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:33:56,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=377670.3333333333, ans=0.0 2024-06-21 11:34:00,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=377670.3333333333, ans=0.2 2024-06-21 11:34:00,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=377670.3333333333, ans=0.0 2024-06-21 11:34:05,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=377688.6666666667, ans=0.1 2024-06-21 11:34:07,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=377688.6666666667, ans=0.05 2024-06-21 11:34:09,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=377688.6666666667, ans=0.125 2024-06-21 11:34:14,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=377707.0, ans=0.2 2024-06-21 11:34:18,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.04 vs. limit=22.5 2024-06-21 11:34:32,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=377725.3333333333, ans=0.125 2024-06-21 11:34:35,455 INFO [train.py:1028] (0/2) Epoch 21, batch 3700, loss[loss=0.1708, simple_loss=0.2309, pruned_loss=0.05538, over 13225.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.239, pruned_loss=0.06729, over 2584246.83 frames. ], batch size: 72, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:34:35,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=377743.6666666667, ans=0.2 2024-06-21 11:34:46,336 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 1.973e+02 2.105e+02 2.278e+02 3.267e+02, threshold=4.211e+02, percent-clipped=0.0 2024-06-21 11:35:05,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=377817.0, ans=0.0 2024-06-21 11:35:09,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=377817.0, ans=0.0 2024-06-21 11:35:15,006 INFO [train.py:1028] (0/2) Epoch 21, batch 3750, loss[loss=0.2166, simple_loss=0.2711, pruned_loss=0.08106, over 12646.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2391, pruned_loss=0.06742, over 2586592.16 frames. ], batch size: 22, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:35:20,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.95 vs. limit=15.0 2024-06-21 11:35:21,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2024-06-21 11:35:31,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=377853.6666666667, ans=0.0 2024-06-21 11:35:51,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.31 vs. limit=15.0 2024-06-21 11:35:52,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.45 vs. limit=10.0 2024-06-21 11:36:21,235 INFO [train.py:1028] (0/2) Epoch 21, batch 3800, loss[loss=0.189, simple_loss=0.2448, pruned_loss=0.0666, over 13180.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2389, pruned_loss=0.06727, over 2584840.86 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:36:23,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=377927.0, ans=0.025 2024-06-21 11:36:28,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=377927.0, ans=0.125 2024-06-21 11:36:32,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=377945.3333333333, ans=0.0 2024-06-21 11:36:33,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.73 vs. limit=22.5 2024-06-21 11:36:34,639 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 1.971e+02 2.075e+02 2.219e+02 2.925e+02, threshold=4.151e+02, percent-clipped=0.0 2024-06-21 11:36:39,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=12.0 2024-06-21 11:36:41,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=377963.6666666667, ans=0.05 2024-06-21 11:36:49,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=377963.6666666667, ans=0.125 2024-06-21 11:36:49,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=377963.6666666667, ans=0.125 2024-06-21 11:36:56,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=377982.0, ans=0.0 2024-06-21 11:37:05,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=378000.3333333333, ans=0.2 2024-06-21 11:37:10,567 INFO [train.py:1028] (0/2) Epoch 21, batch 3850, loss[loss=0.1785, simple_loss=0.23, pruned_loss=0.06348, over 13120.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2384, pruned_loss=0.06692, over 2584274.10 frames. ], batch size: 144, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:37:29,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=378037.0, ans=0.125 2024-06-21 11:37:33,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378055.3333333333, ans=0.1 2024-06-21 11:37:38,546 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.62 vs. limit=15.0 2024-06-21 11:37:39,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=378055.3333333333, ans=0.0 2024-06-21 11:37:42,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=15.0 2024-06-21 11:37:42,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=378073.6666666667, ans=0.125 2024-06-21 11:37:55,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=378092.0, ans=0.0 2024-06-21 11:38:01,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=378092.0, ans=0.0 2024-06-21 11:38:03,118 INFO [train.py:1028] (0/2) Epoch 21, batch 3900, loss[loss=0.1887, simple_loss=0.2391, pruned_loss=0.06919, over 13176.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2385, pruned_loss=0.06724, over 2586997.45 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:38:10,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=378110.3333333333, ans=0.125 2024-06-21 11:38:11,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-21 11:38:16,232 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.783e+02 1.960e+02 2.140e+02 2.304e+02 3.178e+02, threshold=4.280e+02, percent-clipped=0.0 2024-06-21 11:38:21,974 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2024-06-21 11:38:24,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=378147.0, ans=0.5 2024-06-21 11:38:32,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=378165.3333333333, ans=0.0 2024-06-21 11:38:40,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2024-06-21 11:38:46,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=378183.6666666667, ans=0.2 2024-06-21 11:38:50,699 INFO [train.py:1028] (0/2) Epoch 21, batch 3950, loss[loss=0.1933, simple_loss=0.2377, pruned_loss=0.07443, over 13076.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2382, pruned_loss=0.06658, over 2588608.10 frames. ], batch size: 132, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:39:03,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=378220.3333333333, ans=0.1 2024-06-21 11:39:28,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378257.0, ans=0.1 2024-06-21 11:39:48,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=378275.3333333333, ans=0.125 2024-06-21 11:39:54,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=378275.3333333333, ans=0.125 2024-06-21 11:39:56,559 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:39:58,582 INFO [train.py:1028] (0/2) Epoch 21, batch 4000, loss[loss=0.1792, simple_loss=0.2378, pruned_loss=0.0603, over 12877.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.238, pruned_loss=0.06678, over 2583607.68 frames. ], batch size: 39, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:40:03,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=378293.6666666667, ans=0.125 2024-06-21 11:40:09,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=378312.0, ans=0.125 2024-06-21 11:40:12,276 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 1.997e+02 2.118e+02 2.321e+02 2.856e+02, threshold=4.236e+02, percent-clipped=0.0 2024-06-21 11:40:20,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378330.3333333333, ans=0.125 2024-06-21 11:40:23,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=378330.3333333333, ans=0.2 2024-06-21 11:40:29,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378348.6666666667, ans=0.125 2024-06-21 11:40:48,516 INFO [train.py:1028] (0/2) Epoch 21, batch 4050, loss[loss=0.1873, simple_loss=0.234, pruned_loss=0.07034, over 11095.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2377, pruned_loss=0.06664, over 2580898.83 frames. ], batch size: 303, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:40:49,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=378385.3333333333, ans=0.2 2024-06-21 11:41:01,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=378403.6666666667, ans=0.2 2024-06-21 11:41:10,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378422.0, ans=0.1 2024-06-21 11:41:15,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2024-06-21 11:41:20,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=378440.3333333333, ans=0.2 2024-06-21 11:41:30,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378458.6666666667, ans=0.1 2024-06-21 11:41:33,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=378458.6666666667, ans=0.125 2024-06-21 11:41:39,241 INFO [train.py:1028] (0/2) Epoch 21, batch 4100, loss[loss=0.1847, simple_loss=0.2351, pruned_loss=0.06709, over 13051.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2375, pruned_loss=0.0669, over 2576753.59 frames. ], batch size: 102, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:41:42,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=378477.0, ans=0.025 2024-06-21 11:41:42,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.52 vs. limit=15.0 2024-06-21 11:41:51,656 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.007e+02 2.113e+02 2.348e+02 3.014e+02, threshold=4.226e+02, percent-clipped=0.0 2024-06-21 11:42:07,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378513.6666666667, ans=0.125 2024-06-21 11:42:09,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=378532.0, ans=0.125 2024-06-21 11:42:30,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=378550.3333333333, ans=0.125 2024-06-21 11:42:33,848 INFO [train.py:1028] (0/2) Epoch 21, batch 4150, loss[loss=0.1812, simple_loss=0.2375, pruned_loss=0.06246, over 13154.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2371, pruned_loss=0.06686, over 2575418.30 frames. ], batch size: 55, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:42:49,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2024-06-21 11:42:59,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378605.3333333333, ans=0.1 2024-06-21 11:43:01,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=378623.6666666667, ans=0.09899494936611666 2024-06-21 11:43:03,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=378623.6666666667, ans=0.125 2024-06-21 11:43:06,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.17 vs. limit=15.0 2024-06-21 11:43:09,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.08 vs. limit=15.0 2024-06-21 11:43:22,122 INFO [train.py:1028] (0/2) Epoch 21, batch 4200, loss[loss=0.1808, simple_loss=0.2223, pruned_loss=0.0696, over 13035.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2365, pruned_loss=0.06661, over 2577431.61 frames. ], batch size: 102, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:43:24,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=378660.3333333333, ans=0.0 2024-06-21 11:43:35,564 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.657e+02 1.929e+02 2.064e+02 2.185e+02 2.833e+02, threshold=4.129e+02, percent-clipped=0.0 2024-06-21 11:43:51,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=378697.0, ans=15.0 2024-06-21 11:44:10,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378733.6666666667, ans=0.125 2024-06-21 11:44:12,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=378733.6666666667, ans=0.0 2024-06-21 11:44:15,210 INFO [train.py:1028] (0/2) Epoch 21, batch 4250, loss[loss=0.1689, simple_loss=0.2263, pruned_loss=0.05573, over 13262.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2361, pruned_loss=0.06643, over 2581631.69 frames. ], batch size: 46, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:44:36,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=378788.6666666667, ans=0.07 2024-06-21 11:44:39,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=378788.6666666667, ans=0.125 2024-06-21 11:44:40,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=378788.6666666667, ans=0.025 2024-06-21 11:44:44,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=378807.0, ans=0.0 2024-06-21 11:44:49,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=378807.0, ans=0.125 2024-06-21 11:44:52,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.47 vs. limit=12.0 2024-06-21 11:44:55,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=378825.3333333333, ans=0.0 2024-06-21 11:44:58,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=378825.3333333333, ans=0.2 2024-06-21 11:44:59,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=378825.3333333333, ans=0.125 2024-06-21 11:45:00,641 INFO [train.py:1028] (0/2) Epoch 21, batch 4300, loss[loss=0.1673, simple_loss=0.2225, pruned_loss=0.05602, over 13196.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2353, pruned_loss=0.06623, over 2582217.24 frames. ], batch size: 59, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:45:21,289 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 1.993e+02 2.074e+02 2.276e+02 4.053e+02, threshold=4.148e+02, percent-clipped=0.0 2024-06-21 11:45:46,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=378898.6666666667, ans=0.125 2024-06-21 11:46:02,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=378917.0, ans=0.125 2024-06-21 11:46:06,532 INFO [train.py:1028] (0/2) Epoch 21, batch 4350, loss[loss=0.1654, simple_loss=0.2226, pruned_loss=0.05414, over 13225.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.235, pruned_loss=0.06616, over 2587091.85 frames. ], batch size: 59, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:46:11,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=378935.3333333333, ans=0.0 2024-06-21 11:46:17,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.42 vs. limit=15.0 2024-06-21 11:46:24,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=378953.6666666667, ans=0.125 2024-06-21 11:46:32,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=378972.0, ans=0.0 2024-06-21 11:46:41,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2024-06-21 11:46:57,160 INFO [train.py:1028] (0/2) Epoch 21, batch 4400, loss[loss=0.1669, simple_loss=0.2174, pruned_loss=0.05815, over 13201.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2347, pruned_loss=0.06619, over 2587093.87 frames. ], batch size: 83, lr: 2.77e-03, grad_scale: 64.0 2024-06-21 11:47:09,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.654e+02 2.025e+02 2.119e+02 2.257e+02 3.901e+02, threshold=4.239e+02, percent-clipped=0.0 2024-06-21 11:47:19,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=379063.6666666667, ans=0.125 2024-06-21 11:47:22,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=379063.6666666667, ans=0.1 2024-06-21 11:47:35,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=379082.0, ans=0.125 2024-06-21 11:47:36,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379082.0, ans=0.1 2024-06-21 11:47:37,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=379082.0, ans=0.125 2024-06-21 11:47:47,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379100.3333333333, ans=0.1 2024-06-21 11:47:49,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=379100.3333333333, ans=0.0 2024-06-21 11:47:51,666 INFO [train.py:1028] (0/2) Epoch 21, batch 4450, loss[loss=0.1852, simple_loss=0.2443, pruned_loss=0.06307, over 13010.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2354, pruned_loss=0.06651, over 2582631.80 frames. ], batch size: 33, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:47:51,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=379118.6666666667, ans=0.125 2024-06-21 11:47:56,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=15.0 2024-06-21 11:47:57,166 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 11:48:35,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=379173.6666666667, ans=0.2 2024-06-21 11:48:53,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=15.0 2024-06-21 11:48:54,361 INFO [train.py:1028] (0/2) Epoch 21, batch 4500, loss[loss=0.1847, simple_loss=0.2354, pruned_loss=0.06703, over 13220.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2349, pruned_loss=0.06623, over 2587273.32 frames. ], batch size: 89, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:49:04,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=379228.6666666667, ans=0.125 2024-06-21 11:49:07,963 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 1.953e+02 2.087e+02 2.290e+02 2.681e+02, threshold=4.175e+02, percent-clipped=0.0 2024-06-21 11:49:17,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379247.0, ans=0.1 2024-06-21 11:49:45,361 INFO [train.py:1028] (0/2) Epoch 21, batch 4550, loss[loss=0.1739, simple_loss=0.2194, pruned_loss=0.0642, over 13276.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2347, pruned_loss=0.06618, over 2590172.29 frames. ], batch size: 52, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:49:51,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=379302.0, ans=0.125 2024-06-21 11:49:58,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=379320.3333333333, ans=0.125 2024-06-21 11:50:06,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379338.6666666667, ans=0.1 2024-06-21 11:50:17,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=379357.0, ans=0.2 2024-06-21 11:50:22,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=379375.3333333333, ans=0.125 2024-06-21 11:50:32,382 INFO [train.py:1028] (0/2) Epoch 21, batch 4600, loss[loss=0.2155, simple_loss=0.2581, pruned_loss=0.08648, over 12489.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2352, pruned_loss=0.06646, over 2586136.56 frames. ], batch size: 202, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:50:42,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=379412.0, ans=10.0 2024-06-21 11:50:45,300 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.724e+02 1.978e+02 2.104e+02 2.299e+02 3.681e+02, threshold=4.208e+02, percent-clipped=0.0 2024-06-21 11:51:00,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=379430.3333333333, ans=0.0 2024-06-21 11:51:13,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2024-06-21 11:51:14,187 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.33 vs. limit=22.5 2024-06-21 11:51:22,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379485.3333333333, ans=0.125 2024-06-21 11:51:22,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=379485.3333333333, ans=0.2 2024-06-21 11:51:23,500 INFO [train.py:1028] (0/2) Epoch 21, batch 4650, loss[loss=0.1704, simple_loss=0.2174, pruned_loss=0.06173, over 13105.00 frames. ], tot_loss[loss=0.184, simple_loss=0.235, pruned_loss=0.06649, over 2588913.52 frames. ], batch size: 132, lr: 2.77e-03, grad_scale: 128.0 2024-06-21 11:51:23,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=379485.3333333333, ans=0.125 2024-06-21 11:51:28,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=379485.3333333333, ans=0.125 2024-06-21 11:52:01,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=379522.0, ans=0.0 2024-06-21 11:52:04,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=379522.0, ans=0.2 2024-06-21 11:52:21,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=379558.6666666667, ans=0.0 2024-06-21 11:52:26,474 INFO [train.py:1028] (0/2) Epoch 21, batch 4700, loss[loss=0.1897, simple_loss=0.2436, pruned_loss=0.06786, over 12826.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2352, pruned_loss=0.06655, over 2584936.30 frames. ], batch size: 26, lr: 2.76e-03, grad_scale: 128.0 2024-06-21 11:52:34,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=379577.0, ans=0.125 2024-06-21 11:52:40,407 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.021e+02 2.119e+02 2.216e+02 3.354e+02, threshold=4.238e+02, percent-clipped=0.0 2024-06-21 11:52:50,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2024-06-21 11:52:52,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=379613.6666666667, ans=0.125 2024-06-21 11:52:55,751 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.450e+01 2024-06-21 11:53:04,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2024-06-21 11:53:08,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=379650.3333333333, ans=0.0 2024-06-21 11:53:17,941 INFO [train.py:1028] (0/2) Epoch 21, batch 4750, loss[loss=0.196, simple_loss=0.2476, pruned_loss=0.07219, over 12536.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2349, pruned_loss=0.06655, over 2581522.21 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:53:24,638 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2024-06-21 11:53:33,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=379687.0, ans=0.0 2024-06-21 11:53:37,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2024-06-21 11:53:53,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=379723.6666666667, ans=0.125 2024-06-21 11:54:00,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=379742.0, ans=0.125 2024-06-21 11:54:00,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0 2024-06-21 11:54:05,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=379742.0, ans=0.125 2024-06-21 11:54:07,403 INFO [train.py:1028] (0/2) Epoch 21, batch 4800, loss[loss=0.1592, simple_loss=0.2119, pruned_loss=0.05321, over 13291.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2344, pruned_loss=0.06645, over 2577116.52 frames. ], batch size: 63, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:54:15,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=379778.6666666667, ans=0.035 2024-06-21 11:54:22,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.001e+02 2.155e+02 2.349e+02 3.104e+02, threshold=4.310e+02, percent-clipped=0.0 2024-06-21 11:54:28,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=379797.0, ans=0.025 2024-06-21 11:55:04,015 INFO [train.py:1028] (0/2) Epoch 21, batch 4850, loss[loss=0.191, simple_loss=0.2472, pruned_loss=0.06742, over 13301.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2345, pruned_loss=0.06611, over 2574733.01 frames. ], batch size: 89, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:55:23,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=379870.3333333333, ans=0.125 2024-06-21 11:55:29,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2024-06-21 11:55:35,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=379888.6666666667, ans=22.5 2024-06-21 11:55:39,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=379907.0, ans=0.0 2024-06-21 11:55:49,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=379925.3333333333, ans=0.025 2024-06-21 11:55:56,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379925.3333333333, ans=0.0 2024-06-21 11:55:59,163 INFO [train.py:1028] (0/2) Epoch 21, batch 4900, loss[loss=0.1799, simple_loss=0.2292, pruned_loss=0.0653, over 13191.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2343, pruned_loss=0.06618, over 2574743.53 frames. ], batch size: 59, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:56:07,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=379962.0, ans=15.0 2024-06-21 11:56:08,839 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 1.960e+02 2.125e+02 2.332e+02 3.281e+02, threshold=4.250e+02, percent-clipped=0.0 2024-06-21 11:56:11,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=379962.0, ans=0.0 2024-06-21 11:56:17,611 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.164e+01 2024-06-21 11:56:34,639 INFO [train.py:1028] (0/2) Epoch 21, batch 4950, loss[loss=0.1909, simple_loss=0.2312, pruned_loss=0.07526, over 10921.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2345, pruned_loss=0.06662, over 2568944.31 frames. ], batch size: 303, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:57:07,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.35 vs. limit=12.0 2024-06-21 11:57:12,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=380108.6666666667, ans=0.2 2024-06-21 11:57:27,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380108.6666666667, ans=0.1 2024-06-21 11:57:31,180 INFO [train.py:1028] (0/2) Epoch 21, batch 5000, loss[loss=0.1867, simple_loss=0.2267, pruned_loss=0.07335, over 13181.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2344, pruned_loss=0.06662, over 2572901.59 frames. ], batch size: 95, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:57:54,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.008e+02 2.122e+02 2.300e+02 4.136e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 11:57:55,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=380145.3333333333, ans=0.0 2024-06-21 11:58:10,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=380163.6666666667, ans=0.125 2024-06-21 11:58:19,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.78 vs. limit=15.0 2024-06-21 11:58:22,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=380200.3333333333, ans=0.125 2024-06-21 11:58:31,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380218.6666666667, ans=0.1 2024-06-21 11:58:31,989 INFO [train.py:1028] (0/2) Epoch 21, batch 5050, loss[loss=0.1758, simple_loss=0.2278, pruned_loss=0.06189, over 13286.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2343, pruned_loss=0.06646, over 2572717.90 frames. ], batch size: 37, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:58:48,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=380237.0, ans=0.0 2024-06-21 11:58:54,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=380255.3333333333, ans=0.125 2024-06-21 11:58:59,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=380255.3333333333, ans=0.125 2024-06-21 11:59:16,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=380292.0, ans=0.0 2024-06-21 11:59:17,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=380292.0, ans=0.125 2024-06-21 11:59:20,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380292.0, ans=0.1 2024-06-21 11:59:24,672 INFO [train.py:1028] (0/2) Epoch 21, batch 5100, loss[loss=0.1938, simple_loss=0.2512, pruned_loss=0.06821, over 12908.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2345, pruned_loss=0.06679, over 2568607.64 frames. ], batch size: 39, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 11:59:29,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=380310.3333333333, ans=0.125 2024-06-21 11:59:39,472 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 1.969e+02 2.185e+02 2.408e+02 3.009e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 11:59:56,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=380365.3333333333, ans=0.0 2024-06-21 11:59:57,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.40 vs. limit=22.5 2024-06-21 12:00:04,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=380383.6666666667, ans=0.125 2024-06-21 12:00:08,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=380383.6666666667, ans=0.125 2024-06-21 12:00:11,163 INFO [train.py:1028] (0/2) Epoch 21, batch 5150, loss[loss=0.1732, simple_loss=0.2155, pruned_loss=0.06549, over 13085.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2343, pruned_loss=0.06671, over 2569578.89 frames. ], batch size: 132, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:00:28,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=380420.3333333333, ans=0.125 2024-06-21 12:00:35,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=380420.3333333333, ans=0.2 2024-06-21 12:00:46,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=380457.0, ans=0.1 2024-06-21 12:00:54,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=380457.0, ans=0.125 2024-06-21 12:00:56,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=380475.3333333333, ans=0.05 2024-06-21 12:00:56,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=380475.3333333333, ans=0.0 2024-06-21 12:00:59,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2024-06-21 12:01:07,137 INFO [train.py:1028] (0/2) Epoch 21, batch 5200, loss[loss=0.1843, simple_loss=0.2269, pruned_loss=0.07084, over 13153.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.234, pruned_loss=0.06627, over 2573525.61 frames. ], batch size: 95, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:01:15,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=380493.6666666667, ans=0.0 2024-06-21 12:01:17,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=380512.0, ans=0.125 2024-06-21 12:01:21,399 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 1.959e+02 2.069e+02 2.215e+02 3.282e+02, threshold=4.139e+02, percent-clipped=0.0 2024-06-21 12:01:24,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=380512.0, ans=10.0 2024-06-21 12:01:31,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=380530.3333333333, ans=0.125 2024-06-21 12:01:35,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=380530.3333333333, ans=0.09899494936611666 2024-06-21 12:01:41,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=380548.6666666667, ans=0.0 2024-06-21 12:02:00,257 INFO [train.py:1028] (0/2) Epoch 21, batch 5250, loss[loss=0.1767, simple_loss=0.2307, pruned_loss=0.06136, over 13313.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2341, pruned_loss=0.06632, over 2569930.93 frames. ], batch size: 52, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:02:51,739 INFO [train.py:1028] (0/2) Epoch 21, batch 5300, loss[loss=0.1882, simple_loss=0.2326, pruned_loss=0.07189, over 13045.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2337, pruned_loss=0.06624, over 2567570.41 frames. ], batch size: 144, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:02:57,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=380677.0, ans=0.2 2024-06-21 12:03:03,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=380695.3333333333, ans=0.0 2024-06-21 12:03:05,828 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.748e+02 1.999e+02 2.124e+02 2.266e+02 3.967e+02, threshold=4.247e+02, percent-clipped=0.0 2024-06-21 12:03:07,799 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.23 vs. limit=6.0 2024-06-21 12:03:26,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 12:03:29,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=380732.0, ans=0.125 2024-06-21 12:03:49,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=380750.3333333333, ans=0.125 2024-06-21 12:03:52,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=380750.3333333333, ans=0.0 2024-06-21 12:03:59,441 INFO [train.py:1028] (0/2) Epoch 21, batch 5350, loss[loss=0.148, simple_loss=0.2121, pruned_loss=0.04191, over 11577.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2326, pruned_loss=0.06575, over 2574443.32 frames. ], batch size: 17, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:04:17,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=380787.0, ans=0.2 2024-06-21 12:04:39,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.49 vs. limit=10.0 2024-06-21 12:04:50,333 INFO [train.py:1028] (0/2) Epoch 21, batch 5400, loss[loss=0.2036, simple_loss=0.2437, pruned_loss=0.08176, over 12235.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2332, pruned_loss=0.06644, over 2567965.37 frames. ], batch size: 240, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:04:55,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=380860.3333333333, ans=0.0 2024-06-21 12:05:03,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=380878.6666666667, ans=0.125 2024-06-21 12:05:04,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=380878.6666666667, ans=0.125 2024-06-21 12:05:05,197 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.022e+02 2.138e+02 2.285e+02 3.160e+02, threshold=4.276e+02, percent-clipped=0.0 2024-06-21 12:05:12,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=380897.0, ans=0.09899494936611666 2024-06-21 12:05:31,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=380915.3333333333, ans=0.0 2024-06-21 12:05:41,645 INFO [train.py:1028] (0/2) Epoch 21, batch 5450, loss[loss=0.192, simple_loss=0.2537, pruned_loss=0.06516, over 12353.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2333, pruned_loss=0.06598, over 2572244.44 frames. ], batch size: 25, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:05:54,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=380970.3333333333, ans=0.0 2024-06-21 12:06:00,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=380988.6666666667, ans=0.125 2024-06-21 12:06:03,201 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.39 vs. limit=15.0 2024-06-21 12:06:30,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=381025.3333333333, ans=0.0 2024-06-21 12:06:34,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=22.5 2024-06-21 12:06:35,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=381043.6666666667, ans=0.0 2024-06-21 12:06:35,791 INFO [train.py:1028] (0/2) Epoch 21, batch 5500, loss[loss=0.2086, simple_loss=0.2468, pruned_loss=0.08518, over 12250.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2335, pruned_loss=0.06605, over 2564577.78 frames. ], batch size: 241, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:06:49,357 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 1.947e+02 2.057e+02 2.209e+02 2.931e+02, threshold=4.113e+02, percent-clipped=0.0 2024-06-21 12:06:53,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=381062.0, ans=0.2 2024-06-21 12:06:59,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=381080.3333333333, ans=0.2 2024-06-21 12:07:06,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.72 vs. limit=10.0 2024-06-21 12:07:12,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=381098.6666666667, ans=10.0 2024-06-21 12:07:16,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381117.0, ans=0.1 2024-06-21 12:07:27,939 INFO [train.py:1028] (0/2) Epoch 21, batch 5550, loss[loss=0.1846, simple_loss=0.2365, pruned_loss=0.06641, over 13302.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2334, pruned_loss=0.0658, over 2567881.73 frames. ], batch size: 43, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:07:33,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=381135.3333333333, ans=0.025 2024-06-21 12:07:44,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381153.6666666667, ans=0.1 2024-06-21 12:07:45,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=381153.6666666667, ans=0.125 2024-06-21 12:07:45,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381153.6666666667, ans=0.125 2024-06-21 12:08:12,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=381208.6666666667, ans=0.125 2024-06-21 12:08:14,082 INFO [train.py:1028] (0/2) Epoch 21, batch 5600, loss[loss=0.1675, simple_loss=0.2141, pruned_loss=0.06045, over 13246.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2324, pruned_loss=0.06544, over 2570741.81 frames. ], batch size: 89, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:08:15,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=381227.0, ans=0.07 2024-06-21 12:08:16,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381227.0, ans=0.1 2024-06-21 12:08:19,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=381227.0, ans=0.125 2024-06-21 12:08:20,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=381227.0, ans=0.125 2024-06-21 12:08:24,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.60 vs. limit=15.0 2024-06-21 12:08:27,058 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.689e+02 1.967e+02 2.061e+02 2.231e+02 2.900e+02, threshold=4.122e+02, percent-clipped=0.0 2024-06-21 12:08:56,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=381300.3333333333, ans=0.0 2024-06-21 12:08:59,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=381300.3333333333, ans=0.125 2024-06-21 12:09:03,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=12.0 2024-06-21 12:09:04,953 INFO [train.py:1028] (0/2) Epoch 21, batch 5650, loss[loss=0.1964, simple_loss=0.2394, pruned_loss=0.0767, over 12600.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.233, pruned_loss=0.06578, over 2575795.08 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:09:08,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=381318.6666666667, ans=0.125 2024-06-21 12:09:11,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.92 vs. limit=15.0 2024-06-21 12:09:12,234 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-208000.pt 2024-06-21 12:10:05,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=381373.6666666667, ans=0.125 2024-06-21 12:10:18,618 INFO [train.py:1028] (0/2) Epoch 21, batch 5700, loss[loss=0.1718, simple_loss=0.2267, pruned_loss=0.05845, over 13261.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2329, pruned_loss=0.06587, over 2580216.64 frames. ], batch size: 63, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:10:22,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381410.3333333333, ans=0.125 2024-06-21 12:10:29,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=381428.6666666667, ans=0.125 2024-06-21 12:10:33,054 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 1.980e+02 2.122e+02 2.311e+02 3.396e+02, threshold=4.244e+02, percent-clipped=0.0 2024-06-21 12:10:45,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381447.0, ans=0.125 2024-06-21 12:10:47,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=381447.0, ans=22.5 2024-06-21 12:10:53,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=381465.3333333333, ans=0.05 2024-06-21 12:11:11,729 INFO [train.py:1028] (0/2) Epoch 21, batch 5750, loss[loss=0.2003, simple_loss=0.2441, pruned_loss=0.07823, over 12740.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2339, pruned_loss=0.06614, over 2581276.51 frames. ], batch size: 176, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:11:50,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=381575.3333333333, ans=0.0 2024-06-21 12:11:56,419 INFO [train.py:1028] (0/2) Epoch 21, batch 5800, loss[loss=0.1811, simple_loss=0.2343, pruned_loss=0.06392, over 12679.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2349, pruned_loss=0.06689, over 2580245.49 frames. ], batch size: 176, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:11:59,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=381593.6666666667, ans=0.0 2024-06-21 12:12:07,144 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.004e+02 2.134e+02 2.278e+02 2.953e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 12:12:12,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.37 vs. limit=6.0 2024-06-21 12:12:15,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381630.3333333333, ans=0.125 2024-06-21 12:12:57,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-21 12:12:57,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=381667.0, ans=0.0 2024-06-21 12:13:00,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381667.0, ans=0.1 2024-06-21 12:13:01,808 INFO [train.py:1028] (0/2) Epoch 21, batch 5850, loss[loss=0.2021, simple_loss=0.2485, pruned_loss=0.07784, over 12551.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2361, pruned_loss=0.0673, over 2578515.00 frames. ], batch size: 202, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:13:01,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=381685.3333333333, ans=0.0 2024-06-21 12:13:07,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2024-06-21 12:13:16,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.22 vs. limit=22.5 2024-06-21 12:13:23,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=381722.0, ans=0.0 2024-06-21 12:13:32,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=381740.3333333333, ans=0.0 2024-06-21 12:13:34,111 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.07 vs. limit=22.5 2024-06-21 12:13:34,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=381740.3333333333, ans=0.0 2024-06-21 12:13:40,368 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.66 vs. limit=10.0 2024-06-21 12:13:45,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=381758.6666666667, ans=0.125 2024-06-21 12:13:48,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=381758.6666666667, ans=0.025 2024-06-21 12:13:51,640 INFO [train.py:1028] (0/2) Epoch 21, batch 5900, loss[loss=0.1891, simple_loss=0.2337, pruned_loss=0.07228, over 13100.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2384, pruned_loss=0.06793, over 2578777.11 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:13:52,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.53 vs. limit=10.0 2024-06-21 12:13:58,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=381777.0, ans=0.05 2024-06-21 12:14:03,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.032e+02 2.211e+02 2.378e+02 3.858e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 12:14:17,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.07 vs. limit=22.5 2024-06-21 12:14:41,250 INFO [train.py:1028] (0/2) Epoch 21, batch 5950, loss[loss=0.1745, simple_loss=0.2276, pruned_loss=0.06069, over 13072.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2397, pruned_loss=0.06847, over 2582680.91 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:14:48,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=381868.6666666667, ans=10.0 2024-06-21 12:14:50,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=381887.0, ans=0.025 2024-06-21 12:14:52,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=381887.0, ans=0.2 2024-06-21 12:14:53,354 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:15:00,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381905.3333333333, ans=0.125 2024-06-21 12:15:00,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2024-06-21 12:15:05,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381905.3333333333, ans=0.1 2024-06-21 12:15:18,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=381923.6666666667, ans=0.0 2024-06-21 12:15:22,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=381942.0, ans=0.0 2024-06-21 12:15:23,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=381942.0, ans=0.125 2024-06-21 12:15:23,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.13 vs. limit=22.5 2024-06-21 12:15:29,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381942.0, ans=0.1 2024-06-21 12:15:31,308 INFO [train.py:1028] (0/2) Epoch 21, batch 6000, loss[loss=0.2128, simple_loss=0.2613, pruned_loss=0.08212, over 12322.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2409, pruned_loss=0.06885, over 2575129.55 frames. ], batch size: 240, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:15:31,311 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 12:15:50,220 INFO [train.py:1060] (0/2) Epoch 21, validation: loss=0.1876, simple_loss=0.2514, pruned_loss=0.06191, over 351949.00 frames. 2024-06-21 12:15:50,221 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 12:16:05,721 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.107e+02 2.296e+02 2.494e+02 3.406e+02, threshold=4.593e+02, percent-clipped=0.0 2024-06-21 12:16:11,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.85 vs. limit=15.0 2024-06-21 12:16:22,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=381997.0, ans=0.0 2024-06-21 12:16:35,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=382033.6666666667, ans=0.0 2024-06-21 12:16:36,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=382033.6666666667, ans=0.125 2024-06-21 12:16:42,741 INFO [train.py:1028] (0/2) Epoch 21, batch 6050, loss[loss=0.1941, simple_loss=0.2561, pruned_loss=0.06608, over 13114.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2429, pruned_loss=0.06945, over 2577629.43 frames. ], batch size: 40, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:16:56,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=382070.3333333333, ans=0.0 2024-06-21 12:17:00,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382088.6666666667, ans=0.1 2024-06-21 12:17:03,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=382088.6666666667, ans=0.0 2024-06-21 12:17:09,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=382107.0, ans=0.025 2024-06-21 12:17:30,749 INFO [train.py:1028] (0/2) Epoch 21, batch 6100, loss[loss=0.1942, simple_loss=0.2432, pruned_loss=0.0726, over 13057.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2443, pruned_loss=0.07003, over 2579222.29 frames. ], batch size: 121, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:17:32,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=382143.6666666667, ans=0.125 2024-06-21 12:17:37,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2024-06-21 12:17:45,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.085e+02 2.221e+02 2.429e+02 3.595e+02, threshold=4.443e+02, percent-clipped=0.0 2024-06-21 12:17:46,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=382162.0, ans=0.0 2024-06-21 12:17:59,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=382180.3333333333, ans=0.125 2024-06-21 12:18:22,023 INFO [train.py:1028] (0/2) Epoch 21, batch 6150, loss[loss=0.1928, simple_loss=0.2411, pruned_loss=0.07222, over 10735.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2454, pruned_loss=0.07057, over 2578817.87 frames. ], batch size: 303, lr: 2.76e-03, grad_scale: 64.0 2024-06-21 12:18:37,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=382253.6666666667, ans=0.025 2024-06-21 12:18:41,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=382272.0, ans=0.125 2024-06-21 12:18:51,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=382272.0, ans=0.125 2024-06-21 12:18:57,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=382290.3333333333, ans=0.0 2024-06-21 12:19:09,076 INFO [train.py:1028] (0/2) Epoch 21, batch 6200, loss[loss=0.1941, simple_loss=0.2505, pruned_loss=0.06882, over 13226.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2469, pruned_loss=0.07111, over 2576343.70 frames. ], batch size: 89, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:19:21,987 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.153e+02 2.390e+02 2.627e+02 3.668e+02, threshold=4.780e+02, percent-clipped=0.0 2024-06-21 12:19:30,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=22.5 2024-06-21 12:19:53,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten.whitening_limit, batch_count=382400.3333333333, ans=15.0 2024-06-21 12:20:01,022 INFO [train.py:1028] (0/2) Epoch 21, batch 6250, loss[loss=0.1889, simple_loss=0.2312, pruned_loss=0.07331, over 13176.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2477, pruned_loss=0.07132, over 2568767.32 frames. ], batch size: 83, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:20:07,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2024-06-21 12:20:14,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=382437.0, ans=0.125 2024-06-21 12:20:16,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=382437.0, ans=0.025 2024-06-21 12:20:22,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=382455.3333333333, ans=0.125 2024-06-21 12:20:30,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=382455.3333333333, ans=0.125 2024-06-21 12:20:52,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=12.0 2024-06-21 12:20:52,370 INFO [train.py:1028] (0/2) Epoch 21, batch 6300, loss[loss=0.1863, simple_loss=0.2386, pruned_loss=0.06699, over 12077.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.249, pruned_loss=0.07194, over 2564673.27 frames. ], batch size: 17, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:20:53,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=382510.3333333333, ans=0.2 2024-06-21 12:20:53,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382510.3333333333, ans=0.1 2024-06-21 12:21:01,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=382528.6666666667, ans=0.125 2024-06-21 12:21:02,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2024-06-21 12:21:06,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.108e+02 2.236e+02 2.563e+02 3.165e+02, threshold=4.471e+02, percent-clipped=0.0 2024-06-21 12:21:07,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.28 vs. limit=10.0 2024-06-21 12:21:31,079 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2024-06-21 12:21:32,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.56 vs. limit=22.5 2024-06-21 12:22:00,098 INFO [train.py:1028] (0/2) Epoch 21, batch 6350, loss[loss=0.231, simple_loss=0.2807, pruned_loss=0.09065, over 12451.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2502, pruned_loss=0.07206, over 2574038.19 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:22:05,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=382602.0, ans=0.0 2024-06-21 12:22:39,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=382657.0, ans=0.0 2024-06-21 12:22:48,787 INFO [train.py:1028] (0/2) Epoch 21, batch 6400, loss[loss=0.1902, simple_loss=0.25, pruned_loss=0.06522, over 13334.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2517, pruned_loss=0.07256, over 2575213.15 frames. ], batch size: 67, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:22:53,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=382693.6666666667, ans=0.125 2024-06-21 12:23:03,415 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.216e+02 2.378e+02 2.690e+02 3.374e+02, threshold=4.755e+02, percent-clipped=0.0 2024-06-21 12:23:05,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=382712.0, ans=0.125 2024-06-21 12:23:22,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=382748.6666666667, ans=0.0 2024-06-21 12:23:32,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=382767.0, ans=0.025 2024-06-21 12:23:39,501 INFO [train.py:1028] (0/2) Epoch 21, batch 6450, loss[loss=0.2182, simple_loss=0.2679, pruned_loss=0.08421, over 12531.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.253, pruned_loss=0.07289, over 2581089.77 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:23:53,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=382803.6666666667, ans=0.2 2024-06-21 12:23:56,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.00 vs. limit=15.0 2024-06-21 12:24:30,928 INFO [train.py:1028] (0/2) Epoch 21, batch 6500, loss[loss=0.2185, simple_loss=0.2603, pruned_loss=0.08834, over 10835.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2548, pruned_loss=0.07337, over 2584207.79 frames. ], batch size: 304, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:24:31,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=382877.0, ans=15.0 2024-06-21 12:24:42,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=382895.3333333333, ans=0.125 2024-06-21 12:24:44,408 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.148e+02 2.299e+02 2.526e+02 3.259e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 12:25:16,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=382932.0, ans=0.125 2024-06-21 12:25:20,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.22 vs. limit=15.0 2024-06-21 12:25:21,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=382932.0, ans=0.025 2024-06-21 12:25:36,999 INFO [train.py:1028] (0/2) Epoch 21, batch 6550, loss[loss=0.1849, simple_loss=0.2446, pruned_loss=0.06261, over 12658.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2556, pruned_loss=0.07352, over 2589049.02 frames. ], batch size: 22, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:25:43,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=382968.6666666667, ans=0.125 2024-06-21 12:25:46,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=382987.0, ans=0.125 2024-06-21 12:25:52,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=382987.0, ans=0.07 2024-06-21 12:25:59,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=383005.3333333333, ans=0.07 2024-06-21 12:26:09,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=383023.6666666667, ans=10.0 2024-06-21 12:26:10,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=383023.6666666667, ans=0.125 2024-06-21 12:26:13,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2024-06-21 12:26:27,453 INFO [train.py:1028] (0/2) Epoch 21, batch 6600, loss[loss=0.1954, simple_loss=0.2556, pruned_loss=0.06756, over 13263.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2559, pruned_loss=0.07367, over 2590568.73 frames. ], batch size: 72, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:26:29,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=383060.3333333333, ans=0.2 2024-06-21 12:26:30,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2024-06-21 12:26:39,443 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.101e+02 2.248e+02 2.521e+02 3.532e+02, threshold=4.497e+02, percent-clipped=0.0 2024-06-21 12:26:41,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383078.6666666667, ans=0.1 2024-06-21 12:26:42,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=383078.6666666667, ans=6.0 2024-06-21 12:26:44,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=383097.0, ans=0.125 2024-06-21 12:26:53,731 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.79 vs. limit=22.5 2024-06-21 12:26:57,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=383115.3333333333, ans=0.0 2024-06-21 12:27:04,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.57 vs. limit=15.0 2024-06-21 12:27:12,949 INFO [train.py:1028] (0/2) Epoch 21, batch 6650, loss[loss=0.2159, simple_loss=0.2704, pruned_loss=0.08069, over 12941.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2577, pruned_loss=0.07454, over 2584888.40 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:27:23,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.38 vs. limit=15.0 2024-06-21 12:27:23,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=383170.3333333333, ans=0.05 2024-06-21 12:27:25,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=383170.3333333333, ans=0.0 2024-06-21 12:27:34,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.57 vs. limit=22.5 2024-06-21 12:27:52,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383225.3333333333, ans=0.1 2024-06-21 12:28:03,537 INFO [train.py:1028] (0/2) Epoch 21, batch 6700, loss[loss=0.2231, simple_loss=0.2671, pruned_loss=0.08956, over 12696.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2584, pruned_loss=0.07487, over 2583800.93 frames. ], batch size: 176, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:28:33,608 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.167e+02 2.292e+02 2.521e+02 3.060e+02, threshold=4.585e+02, percent-clipped=0.0 2024-06-21 12:28:36,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=383262.0, ans=0.125 2024-06-21 12:29:07,369 INFO [train.py:1028] (0/2) Epoch 21, batch 6750, loss[loss=0.261, simple_loss=0.3058, pruned_loss=0.1082, over 12146.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2592, pruned_loss=0.07535, over 2577741.11 frames. ], batch size: 240, lr: 2.75e-03, grad_scale: 128.0 2024-06-21 12:29:12,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=383335.3333333333, ans=0.0 2024-06-21 12:29:12,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383335.3333333333, ans=0.1 2024-06-21 12:29:13,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=383335.3333333333, ans=0.5 2024-06-21 12:29:19,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383353.6666666667, ans=0.1 2024-06-21 12:29:33,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=383372.0, ans=0.125 2024-06-21 12:29:40,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=383390.3333333333, ans=0.0 2024-06-21 12:29:40,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=383390.3333333333, ans=0.0 2024-06-21 12:29:46,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=383390.3333333333, ans=0.125 2024-06-21 12:29:50,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=383408.6666666667, ans=0.025 2024-06-21 12:29:59,853 INFO [train.py:1028] (0/2) Epoch 21, batch 6800, loss[loss=0.2, simple_loss=0.2631, pruned_loss=0.06846, over 13211.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2615, pruned_loss=0.07605, over 2579837.54 frames. ], batch size: 67, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:30:14,503 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.177e+02 2.338e+02 2.571e+02 3.506e+02, threshold=4.676e+02, percent-clipped=0.0 2024-06-21 12:30:49,562 INFO [train.py:1028] (0/2) Epoch 21, batch 6850, loss[loss=0.2085, simple_loss=0.2704, pruned_loss=0.07329, over 13299.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2617, pruned_loss=0.07541, over 2583387.65 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:30:55,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2024-06-21 12:30:56,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383518.6666666667, ans=0.1 2024-06-21 12:30:57,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=383518.6666666667, ans=0.125 2024-06-21 12:31:09,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383555.3333333333, ans=0.1 2024-06-21 12:31:10,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=383555.3333333333, ans=0.125 2024-06-21 12:31:15,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=383555.3333333333, ans=0.125 2024-06-21 12:31:42,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=15.0 2024-06-21 12:31:43,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=383592.0, ans=0.125 2024-06-21 12:31:44,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=383592.0, ans=0.125 2024-06-21 12:31:45,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=383592.0, ans=0.0 2024-06-21 12:31:51,800 INFO [train.py:1028] (0/2) Epoch 21, batch 6900, loss[loss=0.2165, simple_loss=0.2652, pruned_loss=0.08395, over 13259.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2622, pruned_loss=0.0756, over 2584956.41 frames. ], batch size: 49, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:32:07,543 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.153e+02 2.349e+02 2.609e+02 3.576e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-21 12:32:11,306 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.855e+01 2024-06-21 12:32:13,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=383647.0, ans=0.125 2024-06-21 12:32:19,392 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.161e+01 2024-06-21 12:32:22,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=383665.3333333333, ans=0.0 2024-06-21 12:32:38,870 INFO [train.py:1028] (0/2) Epoch 21, batch 6950, loss[loss=0.2278, simple_loss=0.2844, pruned_loss=0.08562, over 11203.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2629, pruned_loss=0.07565, over 2579199.85 frames. ], batch size: 17, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:32:39,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=383702.0, ans=0.0 2024-06-21 12:32:50,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=383720.3333333333, ans=0.0 2024-06-21 12:33:12,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2024-06-21 12:33:25,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2024-06-21 12:33:29,370 INFO [train.py:1028] (0/2) Epoch 21, batch 7000, loss[loss=0.2079, simple_loss=0.261, pruned_loss=0.0774, over 12989.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2633, pruned_loss=0.07552, over 2575964.94 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:33:32,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=383793.6666666667, ans=0.125 2024-06-21 12:33:33,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=383793.6666666667, ans=0.125 2024-06-21 12:33:44,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=383812.0, ans=0.125 2024-06-21 12:33:45,601 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.184e+02 2.336e+02 2.566e+02 3.383e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-21 12:33:54,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2024-06-21 12:34:06,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=12.0 2024-06-21 12:34:08,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=383848.6666666667, ans=0.0 2024-06-21 12:34:22,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=383885.3333333333, ans=0.2 2024-06-21 12:34:29,907 INFO [train.py:1028] (0/2) Epoch 21, batch 7050, loss[loss=0.2348, simple_loss=0.2888, pruned_loss=0.09036, over 12751.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2645, pruned_loss=0.07595, over 2582945.68 frames. ], batch size: 176, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:34:54,632 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2024-06-21 12:34:55,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383922.0, ans=0.1 2024-06-21 12:35:16,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=383958.6666666667, ans=0.09899494936611666 2024-06-21 12:35:24,424 INFO [train.py:1028] (0/2) Epoch 21, batch 7100, loss[loss=0.2193, simple_loss=0.2798, pruned_loss=0.07935, over 13191.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2652, pruned_loss=0.07649, over 2574970.51 frames. ], batch size: 112, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:35:39,984 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.185e+02 2.359e+02 2.605e+02 3.350e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 12:35:41,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2024-06-21 12:35:47,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=384013.6666666667, ans=0.125 2024-06-21 12:35:54,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=384032.0, ans=0.0 2024-06-21 12:36:12,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=384050.3333333333, ans=0.125 2024-06-21 12:36:15,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=384068.6666666667, ans=0.025 2024-06-21 12:36:15,999 INFO [train.py:1028] (0/2) Epoch 21, batch 7150, loss[loss=0.2424, simple_loss=0.2917, pruned_loss=0.09657, over 12517.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2657, pruned_loss=0.07639, over 2574643.01 frames. ], batch size: 202, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:36:53,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=15.0 2024-06-21 12:36:58,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=384142.0, ans=0.0 2024-06-21 12:36:59,841 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2024-06-21 12:37:00,105 INFO [train.py:1028] (0/2) Epoch 21, batch 7200, loss[loss=0.2123, simple_loss=0.2712, pruned_loss=0.07676, over 13158.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2671, pruned_loss=0.07693, over 2578728.22 frames. ], batch size: 112, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:37:06,335 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.67 vs. limit=15.0 2024-06-21 12:37:07,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=384160.3333333333, ans=0.0 2024-06-21 12:37:16,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=384178.6666666667, ans=0.125 2024-06-21 12:37:17,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.152e+02 2.281e+02 2.462e+02 3.178e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 12:37:22,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.97 vs. limit=10.0 2024-06-21 12:37:32,619 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.72 vs. limit=15.0 2024-06-21 12:37:40,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=384215.3333333333, ans=0.125 2024-06-21 12:37:41,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2024-06-21 12:37:59,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=384233.6666666667, ans=0.07 2024-06-21 12:38:01,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=384252.0, ans=0.125 2024-06-21 12:38:02,410 INFO [train.py:1028] (0/2) Epoch 21, batch 7250, loss[loss=0.1897, simple_loss=0.2505, pruned_loss=0.06446, over 12965.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2677, pruned_loss=0.07704, over 2580240.01 frames. ], batch size: 36, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:38:02,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=384252.0, ans=0.125 2024-06-21 12:38:02,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.56 vs. limit=15.0 2024-06-21 12:38:03,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=384252.0, ans=0.125 2024-06-21 12:38:15,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=384252.0, ans=0.125 2024-06-21 12:38:26,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=384288.6666666667, ans=0.125 2024-06-21 12:38:28,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2024-06-21 12:38:46,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=15.0 2024-06-21 12:38:51,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.78 vs. limit=22.5 2024-06-21 12:38:53,116 INFO [train.py:1028] (0/2) Epoch 21, batch 7300, loss[loss=0.2126, simple_loss=0.2809, pruned_loss=0.07214, over 12901.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2693, pruned_loss=0.07776, over 2580376.66 frames. ], batch size: 36, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:38:54,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=384343.6666666667, ans=0.2 2024-06-21 12:38:59,547 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:39:10,090 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.160e+02 2.331e+02 2.479e+02 3.218e+02, threshold=4.662e+02, percent-clipped=0.0 2024-06-21 12:39:14,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=384362.0, ans=0.0 2024-06-21 12:39:16,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=384380.3333333333, ans=0.0 2024-06-21 12:39:17,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=384380.3333333333, ans=0.125 2024-06-21 12:39:22,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=384380.3333333333, ans=0.0 2024-06-21 12:39:46,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=384417.0, ans=0.125 2024-06-21 12:39:47,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=384435.3333333333, ans=0.2 2024-06-21 12:39:47,974 INFO [train.py:1028] (0/2) Epoch 21, batch 7350, loss[loss=0.2336, simple_loss=0.2897, pruned_loss=0.08878, over 13317.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2701, pruned_loss=0.07829, over 2582230.52 frames. ], batch size: 46, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:39:51,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=15.0 2024-06-21 12:39:51,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=384435.3333333333, ans=0.5 2024-06-21 12:39:55,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=384435.3333333333, ans=0.125 2024-06-21 12:39:59,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=384453.6666666667, ans=0.0 2024-06-21 12:40:03,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=384453.6666666667, ans=0.0 2024-06-21 12:40:06,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=384453.6666666667, ans=0.2 2024-06-21 12:40:14,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=384472.0, ans=0.0 2024-06-21 12:40:14,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384472.0, ans=0.1 2024-06-21 12:40:15,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2024-06-21 12:40:23,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384490.3333333333, ans=0.1 2024-06-21 12:40:30,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=384508.6666666667, ans=0.125 2024-06-21 12:40:34,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=384508.6666666667, ans=0.125 2024-06-21 12:40:37,628 INFO [train.py:1028] (0/2) Epoch 21, batch 7400, loss[loss=0.2307, simple_loss=0.2882, pruned_loss=0.08662, over 13246.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2693, pruned_loss=0.07752, over 2587845.93 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:40:49,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=384545.3333333333, ans=0.1 2024-06-21 12:40:53,472 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.156e+02 2.322e+02 2.547e+02 3.464e+02, threshold=4.644e+02, percent-clipped=0.0 2024-06-21 12:41:13,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=384563.6666666667, ans=0.125 2024-06-21 12:41:13,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=384563.6666666667, ans=0.125 2024-06-21 12:41:14,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=384563.6666666667, ans=0.0 2024-06-21 12:41:17,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=384582.0, ans=0.0 2024-06-21 12:41:35,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=384600.3333333333, ans=0.07 2024-06-21 12:41:41,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.35 vs. limit=15.0 2024-06-21 12:41:45,913 INFO [train.py:1028] (0/2) Epoch 21, batch 7450, loss[loss=0.2165, simple_loss=0.2711, pruned_loss=0.08091, over 12570.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2689, pruned_loss=0.07728, over 2581772.54 frames. ], batch size: 29, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:41:58,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=384637.0, ans=0.0 2024-06-21 12:42:09,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=384655.3333333333, ans=0.0 2024-06-21 12:42:15,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384673.6666666667, ans=0.0 2024-06-21 12:42:32,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=15.0 2024-06-21 12:42:36,901 INFO [train.py:1028] (0/2) Epoch 21, batch 7500, loss[loss=0.2197, simple_loss=0.2647, pruned_loss=0.08737, over 10754.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2696, pruned_loss=0.07762, over 2579549.36 frames. ], batch size: 305, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:42:47,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=384728.6666666667, ans=0.2 2024-06-21 12:42:52,470 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.176e+02 2.354e+02 2.539e+02 3.656e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 12:43:25,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=384783.6666666667, ans=0.0 2024-06-21 12:43:28,112 INFO [train.py:1028] (0/2) Epoch 21, batch 7550, loss[loss=0.2197, simple_loss=0.2685, pruned_loss=0.08548, over 12930.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2701, pruned_loss=0.07813, over 2578494.24 frames. ], batch size: 158, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:43:29,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=384802.0, ans=0.125 2024-06-21 12:43:30,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=384802.0, ans=0.2 2024-06-21 12:43:33,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=384802.0, ans=0.0 2024-06-21 12:43:35,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=384802.0, ans=0.125 2024-06-21 12:43:38,791 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:43:40,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=384820.3333333333, ans=0.125 2024-06-21 12:43:44,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384820.3333333333, ans=0.125 2024-06-21 12:44:15,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=384875.3333333333, ans=0.125 2024-06-21 12:44:31,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=384875.3333333333, ans=0.04949747468305833 2024-06-21 12:44:33,332 INFO [train.py:1028] (0/2) Epoch 21, batch 7600, loss[loss=0.2081, simple_loss=0.2654, pruned_loss=0.07545, over 13237.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2715, pruned_loss=0.07876, over 2577256.60 frames. ], batch size: 83, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:44:50,034 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.172e+02 2.401e+02 2.621e+02 3.887e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-21 12:45:01,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384930.3333333333, ans=0.1 2024-06-21 12:45:05,934 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.625e-03 2024-06-21 12:45:28,416 INFO [train.py:1028] (0/2) Epoch 21, batch 7650, loss[loss=0.2133, simple_loss=0.2691, pruned_loss=0.07881, over 12970.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2715, pruned_loss=0.07877, over 2573674.80 frames. ], batch size: 33, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:45:29,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.77 vs. limit=22.5 2024-06-21 12:45:34,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=384985.3333333333, ans=0.125 2024-06-21 12:45:34,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384985.3333333333, ans=0.1 2024-06-21 12:45:36,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384985.3333333333, ans=0.0 2024-06-21 12:45:43,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=385003.6666666667, ans=0.025 2024-06-21 12:46:02,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=385040.3333333333, ans=0.1 2024-06-21 12:46:03,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2024-06-21 12:46:04,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=385040.3333333333, ans=0.2 2024-06-21 12:46:07,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=385040.3333333333, ans=0.125 2024-06-21 12:46:08,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=385040.3333333333, ans=0.1 2024-06-21 12:46:20,879 INFO [train.py:1028] (0/2) Epoch 21, batch 7700, loss[loss=0.216, simple_loss=0.2849, pruned_loss=0.07355, over 13243.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2717, pruned_loss=0.07894, over 2569886.36 frames. ], batch size: 63, lr: 2.75e-03, grad_scale: 64.0 2024-06-21 12:46:36,752 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.250e+02 2.473e+02 2.802e+02 4.511e+02, threshold=4.946e+02, percent-clipped=0.0 2024-06-21 12:46:42,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.79 vs. limit=6.0 2024-06-21 12:46:53,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=385132.0, ans=0.0 2024-06-21 12:47:15,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=385168.6666666667, ans=0.2 2024-06-21 12:47:16,358 INFO [train.py:1028] (0/2) Epoch 21, batch 7750, loss[loss=0.2346, simple_loss=0.2962, pruned_loss=0.08653, over 13234.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2726, pruned_loss=0.07971, over 2573774.34 frames. ], batch size: 72, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:47:24,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=385168.6666666667, ans=0.125 2024-06-21 12:47:25,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=385168.6666666667, ans=0.025 2024-06-21 12:47:40,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=385187.0, ans=0.0 2024-06-21 12:47:48,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=385205.3333333333, ans=0.0 2024-06-21 12:48:06,054 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2024-06-21 12:48:17,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=385260.3333333333, ans=0.5 2024-06-21 12:48:17,896 INFO [train.py:1028] (0/2) Epoch 21, batch 7800, loss[loss=0.2022, simple_loss=0.2667, pruned_loss=0.0689, over 13185.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2731, pruned_loss=0.07967, over 2578372.27 frames. ], batch size: 95, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:48:21,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=385260.3333333333, ans=0.125 2024-06-21 12:48:33,734 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.219e+02 2.381e+02 2.638e+02 3.851e+02, threshold=4.762e+02, percent-clipped=0.0 2024-06-21 12:48:47,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385297.0, ans=0.0 2024-06-21 12:49:08,341 INFO [train.py:1028] (0/2) Epoch 21, batch 7850, loss[loss=0.18, simple_loss=0.2451, pruned_loss=0.05744, over 11547.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2743, pruned_loss=0.0801, over 2573597.93 frames. ], batch size: 17, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:49:16,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2024-06-21 12:49:18,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=385370.3333333333, ans=0.125 2024-06-21 12:49:35,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=385388.6666666667, ans=0.0 2024-06-21 12:49:39,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385407.0, ans=0.1 2024-06-21 12:49:47,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=385425.3333333333, ans=0.2 2024-06-21 12:49:51,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=385425.3333333333, ans=0.0 2024-06-21 12:49:58,028 INFO [train.py:1028] (0/2) Epoch 21, batch 7900, loss[loss=0.1992, simple_loss=0.2672, pruned_loss=0.06561, over 13145.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.274, pruned_loss=0.07992, over 2572603.05 frames. ], batch size: 77, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:50:08,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=385443.6666666667, ans=0.1 2024-06-21 12:50:09,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=385443.6666666667, ans=0.1 2024-06-21 12:50:13,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=385462.0, ans=0.125 2024-06-21 12:50:17,926 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.264e+02 2.434e+02 2.713e+02 4.080e+02, threshold=4.868e+02, percent-clipped=0.0 2024-06-21 12:50:18,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=385462.0, ans=0.0 2024-06-21 12:50:20,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=385462.0, ans=0.0 2024-06-21 12:50:38,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2024-06-21 12:50:47,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=385498.6666666667, ans=0.0 2024-06-21 12:50:54,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=385517.0, ans=0.125 2024-06-21 12:50:55,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=385517.0, ans=0.95 2024-06-21 12:51:00,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=385535.3333333333, ans=0.125 2024-06-21 12:51:01,200 INFO [train.py:1028] (0/2) Epoch 21, batch 7950, loss[loss=0.2398, simple_loss=0.2815, pruned_loss=0.09904, over 10533.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.274, pruned_loss=0.07978, over 2574709.34 frames. ], batch size: 304, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:51:24,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=385572.0, ans=0.0 2024-06-21 12:51:30,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.70 vs. limit=22.5 2024-06-21 12:51:42,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=385608.6666666667, ans=0.07 2024-06-21 12:51:49,031 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 12:51:51,239 INFO [train.py:1028] (0/2) Epoch 21, batch 8000, loss[loss=0.1961, simple_loss=0.2581, pruned_loss=0.06704, over 12686.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2746, pruned_loss=0.07973, over 2571940.66 frames. ], batch size: 29, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:51:52,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=385627.0, ans=0.125 2024-06-21 12:52:02,746 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.214e+02 2.447e+02 2.667e+02 3.806e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-21 12:52:03,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=385645.3333333333, ans=0.0 2024-06-21 12:52:05,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=385645.3333333333, ans=0.2 2024-06-21 12:52:16,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2024-06-21 12:52:29,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385700.3333333333, ans=0.0 2024-06-21 12:52:34,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=385700.3333333333, ans=0.0 2024-06-21 12:52:41,640 INFO [train.py:1028] (0/2) Epoch 21, batch 8050, loss[loss=0.2117, simple_loss=0.268, pruned_loss=0.07774, over 13197.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2742, pruned_loss=0.07991, over 2572094.53 frames. ], batch size: 83, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:52:49,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=385718.6666666667, ans=0.125 2024-06-21 12:52:51,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2024-06-21 12:52:56,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=385737.0, ans=0.0 2024-06-21 12:52:58,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=385737.0, ans=0.2 2024-06-21 12:53:07,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=385755.3333333333, ans=0.125 2024-06-21 12:53:27,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=385773.6666666667, ans=0.2 2024-06-21 12:53:44,322 INFO [train.py:1028] (0/2) Epoch 21, batch 8100, loss[loss=0.2118, simple_loss=0.274, pruned_loss=0.07476, over 13128.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2747, pruned_loss=0.08005, over 2576144.58 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:53:45,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=385810.3333333333, ans=0.125 2024-06-21 12:53:59,850 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.198e+02 2.340e+02 2.543e+02 3.469e+02, threshold=4.680e+02, percent-clipped=0.0 2024-06-21 12:54:24,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=385865.3333333333, ans=0.125 2024-06-21 12:54:27,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=385883.6666666667, ans=0.07 2024-06-21 12:54:39,433 INFO [train.py:1028] (0/2) Epoch 21, batch 8150, loss[loss=0.2249, simple_loss=0.2789, pruned_loss=0.08546, over 13076.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2754, pruned_loss=0.08001, over 2579389.58 frames. ], batch size: 121, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:54:49,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=385920.3333333333, ans=0.125 2024-06-21 12:55:03,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=385938.6666666667, ans=0.125 2024-06-21 12:55:09,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=385957.0, ans=0.125 2024-06-21 12:55:20,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=385975.3333333333, ans=0.125 2024-06-21 12:55:23,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385975.3333333333, ans=0.0 2024-06-21 12:55:28,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=385975.3333333333, ans=0.125 2024-06-21 12:55:30,660 INFO [train.py:1028] (0/2) Epoch 21, batch 8200, loss[loss=0.2178, simple_loss=0.2768, pruned_loss=0.07938, over 13117.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2754, pruned_loss=0.07978, over 2582667.88 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:55:35,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2024-06-21 12:55:38,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=385993.6666666667, ans=0.0 2024-06-21 12:55:45,997 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.203e+02 2.319e+02 2.485e+02 3.518e+02, threshold=4.638e+02, percent-clipped=0.0 2024-06-21 12:55:49,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=386012.0, ans=0.0 2024-06-21 12:55:51,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=386030.3333333333, ans=0.125 2024-06-21 12:56:26,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=386067.0, ans=0.2 2024-06-21 12:56:28,069 INFO [train.py:1028] (0/2) Epoch 21, batch 8250, loss[loss=0.2097, simple_loss=0.2765, pruned_loss=0.07139, over 13243.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2763, pruned_loss=0.08037, over 2582708.43 frames. ], batch size: 52, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:56:45,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=386103.6666666667, ans=0.025 2024-06-21 12:57:04,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=386122.0, ans=0.2 2024-06-21 12:57:12,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=386140.3333333333, ans=0.04949747468305833 2024-06-21 12:57:14,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.15 vs. limit=22.5 2024-06-21 12:57:15,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2024-06-21 12:57:19,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=386158.6666666667, ans=0.125 2024-06-21 12:57:19,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=386158.6666666667, ans=0.125 2024-06-21 12:57:24,052 INFO [train.py:1028] (0/2) Epoch 21, batch 8300, loss[loss=0.2286, simple_loss=0.2752, pruned_loss=0.09104, over 13073.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2751, pruned_loss=0.07989, over 2579996.08 frames. ], batch size: 102, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:57:35,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=386195.3333333333, ans=0.0 2024-06-21 12:57:38,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.47 vs. limit=15.0 2024-06-21 12:57:39,067 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.195e+02 2.336e+02 2.498e+02 2.937e+02, threshold=4.671e+02, percent-clipped=0.0 2024-06-21 12:57:39,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=386195.3333333333, ans=0.0 2024-06-21 12:57:48,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=386213.6666666667, ans=0.0 2024-06-21 12:58:14,980 INFO [train.py:1028] (0/2) Epoch 21, batch 8350, loss[loss=0.2143, simple_loss=0.2713, pruned_loss=0.07869, over 13227.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2752, pruned_loss=0.0797, over 2580288.78 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 64.0 2024-06-21 12:58:18,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=386268.6666666667, ans=0.2 2024-06-21 12:58:21,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=386287.0, ans=0.0 2024-06-21 12:58:23,544 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2024-06-21 12:58:26,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386287.0, ans=0.1 2024-06-21 12:58:26,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=386287.0, ans=0.025 2024-06-21 12:58:40,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=386323.6666666667, ans=0.125 2024-06-21 12:58:49,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=386323.6666666667, ans=0.125 2024-06-21 12:59:03,878 INFO [train.py:1028] (0/2) Epoch 21, batch 8400, loss[loss=0.2023, simple_loss=0.2625, pruned_loss=0.07109, over 12953.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2754, pruned_loss=0.07987, over 2576631.29 frames. ], batch size: 39, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 12:59:04,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=386360.3333333333, ans=0.0 2024-06-21 12:59:16,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=386378.6666666667, ans=0.125 2024-06-21 12:59:19,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=386378.6666666667, ans=0.0 2024-06-21 12:59:20,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.08 vs. limit=22.5 2024-06-21 12:59:20,298 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.196e+02 2.335e+02 2.513e+02 3.037e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-21 12:59:24,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=386397.0, ans=0.125 2024-06-21 12:59:24,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=386397.0, ans=0.025 2024-06-21 12:59:42,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=386415.3333333333, ans=0.125 2024-06-21 12:59:59,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2024-06-21 13:00:06,754 INFO [train.py:1028] (0/2) Epoch 21, batch 8450, loss[loss=0.2302, simple_loss=0.2879, pruned_loss=0.08627, over 13133.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2761, pruned_loss=0.08008, over 2579271.78 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:00:20,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=386470.3333333333, ans=0.1 2024-06-21 13:00:56,204 INFO [train.py:1028] (0/2) Epoch 21, batch 8500, loss[loss=0.2337, simple_loss=0.2898, pruned_loss=0.08876, over 12577.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2777, pruned_loss=0.08091, over 2577933.85 frames. ], batch size: 29, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:01:06,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2024-06-21 13:01:14,226 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.200e+02 2.357e+02 2.557e+02 3.307e+02, threshold=4.713e+02, percent-clipped=0.0 2024-06-21 13:01:25,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386580.3333333333, ans=0.1 2024-06-21 13:01:32,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=22.5 2024-06-21 13:01:37,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=386598.6666666667, ans=0.2 2024-06-21 13:01:52,036 INFO [train.py:1028] (0/2) Epoch 21, batch 8550, loss[loss=0.2378, simple_loss=0.2933, pruned_loss=0.0911, over 12394.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2769, pruned_loss=0.08054, over 2574417.32 frames. ], batch size: 22, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:02:20,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=386672.0, ans=0.0 2024-06-21 13:02:39,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=386708.6666666667, ans=0.125 2024-06-21 13:02:50,055 INFO [train.py:1028] (0/2) Epoch 21, batch 8600, loss[loss=0.208, simple_loss=0.2652, pruned_loss=0.07545, over 13149.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2777, pruned_loss=0.08056, over 2571926.86 frames. ], batch size: 112, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:02:52,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=386727.0, ans=0.125 2024-06-21 13:02:58,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=386727.0, ans=0.125 2024-06-21 13:03:06,632 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.186e+02 2.374e+02 2.608e+02 3.640e+02, threshold=4.749e+02, percent-clipped=0.0 2024-06-21 13:03:07,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=386745.3333333333, ans=0.125 2024-06-21 13:03:08,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=386745.3333333333, ans=0.125 2024-06-21 13:03:10,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=386763.6666666667, ans=0.1 2024-06-21 13:03:20,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=386763.6666666667, ans=0.025 2024-06-21 13:03:34,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2024-06-21 13:03:50,442 INFO [train.py:1028] (0/2) Epoch 21, batch 8650, loss[loss=0.2295, simple_loss=0.2883, pruned_loss=0.08537, over 13024.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.278, pruned_loss=0.08057, over 2574974.56 frames. ], batch size: 102, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:03:53,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=386818.6666666667, ans=0.125 2024-06-21 13:03:58,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2024-06-21 13:04:13,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=386855.3333333333, ans=0.125 2024-06-21 13:04:22,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=386873.6666666667, ans=0.125 2024-06-21 13:04:25,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=386892.0, ans=0.125 2024-06-21 13:04:32,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2024-06-21 13:04:33,971 INFO [train.py:1028] (0/2) Epoch 21, batch 8700, loss[loss=0.2236, simple_loss=0.2851, pruned_loss=0.08104, over 13172.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2785, pruned_loss=0.08113, over 2572855.28 frames. ], batch size: 59, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:04:35,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=386910.3333333333, ans=0.1 2024-06-21 13:04:44,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2024-06-21 13:04:49,310 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.225e+02 2.459e+02 2.761e+02 4.390e+02, threshold=4.917e+02, percent-clipped=0.0 2024-06-21 13:04:56,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=386947.0, ans=0.2 2024-06-21 13:05:00,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=386947.0, ans=0.2 2024-06-21 13:05:28,170 INFO [train.py:1028] (0/2) Epoch 21, batch 8750, loss[loss=0.2232, simple_loss=0.2717, pruned_loss=0.08729, over 13118.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2782, pruned_loss=0.08109, over 2567727.89 frames. ], batch size: 121, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:05:29,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=387002.0, ans=0.125 2024-06-21 13:05:32,558 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:05:34,752 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.11 vs. limit=22.5 2024-06-21 13:06:02,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=387038.6666666667, ans=0.125 2024-06-21 13:06:02,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=387038.6666666667, ans=0.0 2024-06-21 13:06:05,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=387038.6666666667, ans=0.02 2024-06-21 13:06:21,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387075.3333333333, ans=0.1 2024-06-21 13:06:32,588 INFO [train.py:1028] (0/2) Epoch 21, batch 8800, loss[loss=0.2155, simple_loss=0.2775, pruned_loss=0.07679, over 13262.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2783, pruned_loss=0.08116, over 2572728.73 frames. ], batch size: 72, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:06:47,527 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.180e+02 2.301e+02 2.489e+02 3.222e+02, threshold=4.602e+02, percent-clipped=0.0 2024-06-21 13:06:56,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=387130.3333333333, ans=0.1 2024-06-21 13:07:04,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387148.6666666667, ans=0.1 2024-06-21 13:07:15,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=387167.0, ans=0.04949747468305833 2024-06-21 13:07:18,065 INFO [train.py:1028] (0/2) Epoch 21, batch 8850, loss[loss=0.238, simple_loss=0.2938, pruned_loss=0.09107, over 12627.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.278, pruned_loss=0.08124, over 2563399.83 frames. ], batch size: 202, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:07:18,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2024-06-21 13:07:21,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.66 vs. limit=22.5 2024-06-21 13:07:23,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2024-06-21 13:07:25,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=387185.3333333333, ans=0.0 2024-06-21 13:07:25,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387185.3333333333, ans=0.125 2024-06-21 13:07:29,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387203.6666666667, ans=0.125 2024-06-21 13:07:36,216 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.87 vs. limit=10.0 2024-06-21 13:07:43,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2024-06-21 13:07:47,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=387222.0, ans=0.125 2024-06-21 13:07:50,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387240.3333333333, ans=0.1 2024-06-21 13:08:02,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=387258.6666666667, ans=0.125 2024-06-21 13:08:12,653 INFO [train.py:1028] (0/2) Epoch 21, batch 8900, loss[loss=0.2156, simple_loss=0.2713, pruned_loss=0.07994, over 12849.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2781, pruned_loss=0.08126, over 2561455.90 frames. ], batch size: 33, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:08:30,049 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.224e+02 2.323e+02 2.535e+02 3.265e+02, threshold=4.646e+02, percent-clipped=0.0 2024-06-21 13:08:32,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=387295.3333333333, ans=0.125 2024-06-21 13:08:48,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=387313.6666666667, ans=0.2 2024-06-21 13:08:54,090 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2024-06-21 13:09:06,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2024-06-21 13:09:11,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387368.6666666667, ans=0.125 2024-06-21 13:09:11,564 INFO [train.py:1028] (0/2) Epoch 21, batch 8950, loss[loss=0.2349, simple_loss=0.2917, pruned_loss=0.08905, over 12492.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2786, pruned_loss=0.08094, over 2561062.83 frames. ], batch size: 202, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:09:29,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=387387.0, ans=0.125 2024-06-21 13:09:43,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=387405.3333333333, ans=0.0 2024-06-21 13:09:45,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=387405.3333333333, ans=0.025 2024-06-21 13:09:48,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=387405.3333333333, ans=0.125 2024-06-21 13:09:56,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=387423.6666666667, ans=0.0 2024-06-21 13:10:04,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2024-06-21 13:10:07,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2024-06-21 13:10:12,641 INFO [train.py:1028] (0/2) Epoch 21, batch 9000, loss[loss=0.2175, simple_loss=0.2755, pruned_loss=0.0797, over 13317.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2791, pruned_loss=0.08113, over 2566439.44 frames. ], batch size: 46, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:10:12,644 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 13:10:24,709 INFO [train.py:1060] (0/2) Epoch 21, validation: loss=0.1874, simple_loss=0.2512, pruned_loss=0.06183, over 351949.00 frames. 2024-06-21 13:10:24,709 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 13:10:29,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=387460.3333333333, ans=0.0 2024-06-21 13:10:37,285 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.186e+02 2.357e+02 2.472e+02 3.835e+02, threshold=4.713e+02, percent-clipped=0.0 2024-06-21 13:10:42,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=387497.0, ans=0.0 2024-06-21 13:10:51,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=387515.3333333333, ans=0.125 2024-06-21 13:10:57,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=387533.6666666667, ans=0.125 2024-06-21 13:11:06,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=387533.6666666667, ans=0.125 2024-06-21 13:11:08,518 INFO [train.py:1028] (0/2) Epoch 21, batch 9050, loss[loss=0.1776, simple_loss=0.2368, pruned_loss=0.05919, over 10785.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2799, pruned_loss=0.08157, over 2565398.33 frames. ], batch size: 16, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:11:37,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387588.6666666667, ans=0.1 2024-06-21 13:11:51,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=387625.3333333333, ans=0.2 2024-06-21 13:11:54,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=387625.3333333333, ans=0.2 2024-06-21 13:11:59,887 INFO [train.py:1028] (0/2) Epoch 21, batch 9100, loss[loss=0.2105, simple_loss=0.2749, pruned_loss=0.07307, over 13093.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2795, pruned_loss=0.08114, over 2567101.19 frames. ], batch size: 71, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:12:16,339 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.210e+02 2.354e+02 2.538e+02 3.319e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 13:12:22,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=387680.3333333333, ans=0.125 2024-06-21 13:12:31,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387698.6666666667, ans=0.1 2024-06-21 13:12:33,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=387698.6666666667, ans=0.0 2024-06-21 13:12:42,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=387717.0, ans=0.0 2024-06-21 13:12:47,344 INFO [train.py:1028] (0/2) Epoch 21, batch 9150, loss[loss=0.1983, simple_loss=0.2586, pruned_loss=0.069, over 13128.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2799, pruned_loss=0.08163, over 2567899.86 frames. ], batch size: 77, lr: 2.74e-03, grad_scale: 16.0 2024-06-21 13:12:55,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=387735.3333333333, ans=0.2 2024-06-21 13:13:02,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=387753.6666666667, ans=0.0 2024-06-21 13:13:21,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387790.3333333333, ans=0.1 2024-06-21 13:13:26,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387790.3333333333, ans=0.1 2024-06-21 13:13:40,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=387808.6666666667, ans=0.05 2024-06-21 13:13:45,469 INFO [train.py:1028] (0/2) Epoch 21, batch 9200, loss[loss=0.2119, simple_loss=0.2699, pruned_loss=0.07691, over 12965.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2797, pruned_loss=0.08099, over 2571244.58 frames. ], batch size: 36, lr: 2.74e-03, grad_scale: 32.0 2024-06-21 13:13:48,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=387827.0, ans=0.025 2024-06-21 13:13:55,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=387845.3333333333, ans=10.0 2024-06-21 13:14:02,832 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.182e+02 2.300e+02 2.486e+02 3.614e+02, threshold=4.600e+02, percent-clipped=0.0 2024-06-21 13:14:08,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=387863.6666666667, ans=0.125 2024-06-21 13:14:17,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2024-06-21 13:14:30,572 INFO [train.py:1028] (0/2) Epoch 21, batch 9250, loss[loss=0.213, simple_loss=0.2816, pruned_loss=0.07222, over 13247.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2801, pruned_loss=0.08092, over 2572365.04 frames. ], batch size: 67, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:14:30,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=387918.6666666667, ans=0.125 2024-06-21 13:14:48,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=387955.3333333333, ans=0.125 2024-06-21 13:14:49,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=387955.3333333333, ans=0.0 2024-06-21 13:14:56,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=387973.6666666667, ans=0.125 2024-06-21 13:15:01,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=387973.6666666667, ans=0.2 2024-06-21 13:15:14,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=387992.0, ans=0.0 2024-06-21 13:15:15,914 INFO [train.py:1028] (0/2) Epoch 21, batch 9300, loss[loss=0.2184, simple_loss=0.2752, pruned_loss=0.08077, over 12946.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2796, pruned_loss=0.08045, over 2569491.37 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:15:25,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388028.6666666667, ans=0.1 2024-06-21 13:15:30,979 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.211e+02 2.421e+02 2.633e+02 3.486e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 13:15:39,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388047.0, ans=0.1 2024-06-21 13:15:41,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=388047.0, ans=0.0 2024-06-21 13:15:44,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=388065.3333333333, ans=0.0 2024-06-21 13:15:44,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:15:47,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=388065.3333333333, ans=0.125 2024-06-21 13:15:49,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2024-06-21 13:15:56,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2024-06-21 13:16:03,146 INFO [train.py:1028] (0/2) Epoch 21, batch 9350, loss[loss=0.2316, simple_loss=0.2901, pruned_loss=0.08655, over 12719.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2797, pruned_loss=0.08062, over 2567635.78 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:16:03,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=388102.0, ans=0.125 2024-06-21 13:16:08,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=388102.0, ans=0.015 2024-06-21 13:16:22,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=388138.6666666667, ans=0.0 2024-06-21 13:16:27,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=388138.6666666667, ans=0.0 2024-06-21 13:16:28,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.66 vs. limit=15.0 2024-06-21 13:16:36,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=388157.0, ans=0.0 2024-06-21 13:16:48,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=388193.6666666667, ans=0.0 2024-06-21 13:16:49,227 INFO [train.py:1028] (0/2) Epoch 21, batch 9400, loss[loss=0.225, simple_loss=0.2868, pruned_loss=0.08156, over 13270.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2798, pruned_loss=0.08089, over 2567314.65 frames. ], batch size: 52, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:17:04,822 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.205e+02 2.325e+02 2.536e+02 3.081e+02, threshold=4.649e+02, percent-clipped=0.0 2024-06-21 13:17:11,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=388230.3333333333, ans=0.125 2024-06-21 13:17:12,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.10 vs. limit=15.0 2024-06-21 13:17:13,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=388230.3333333333, ans=0.2 2024-06-21 13:17:18,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=388248.6666666667, ans=0.125 2024-06-21 13:17:21,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=388248.6666666667, ans=0.125 2024-06-21 13:17:27,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388267.0, ans=0.1 2024-06-21 13:17:37,570 INFO [train.py:1028] (0/2) Epoch 21, batch 9450, loss[loss=0.2078, simple_loss=0.2731, pruned_loss=0.07119, over 12459.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2808, pruned_loss=0.08151, over 2567101.03 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:18:01,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=388322.0, ans=0.125 2024-06-21 13:18:19,883 INFO [train.py:1028] (0/2) Epoch 21, batch 9500, loss[loss=0.2223, simple_loss=0.2733, pruned_loss=0.08566, over 13248.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2807, pruned_loss=0.08129, over 2576990.34 frames. ], batch size: 43, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:18:21,709 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=15.0 2024-06-21 13:18:32,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=388395.3333333333, ans=0.125 2024-06-21 13:18:35,497 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.195e+02 2.363e+02 2.468e+02 3.468e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-21 13:18:44,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=388413.6666666667, ans=0.025 2024-06-21 13:19:07,868 INFO [train.py:1028] (0/2) Epoch 21, batch 9550, loss[loss=0.2071, simple_loss=0.2674, pruned_loss=0.07343, over 12938.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2805, pruned_loss=0.08126, over 2570920.49 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:19:22,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=388487.0, ans=0.04949747468305833 2024-06-21 13:19:23,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=388487.0, ans=0.125 2024-06-21 13:19:30,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=388505.3333333333, ans=0.025 2024-06-21 13:19:37,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=388523.6666666667, ans=0.1 2024-06-21 13:19:43,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=388523.6666666667, ans=0.0 2024-06-21 13:19:51,075 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2024-06-21 13:19:56,006 INFO [train.py:1028] (0/2) Epoch 21, batch 9600, loss[loss=0.2328, simple_loss=0.2745, pruned_loss=0.09552, over 10616.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2807, pruned_loss=0.0814, over 2568969.27 frames. ], batch size: 303, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:19:56,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=388560.3333333333, ans=0.125 2024-06-21 13:19:56,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.27 vs. limit=22.5 2024-06-21 13:20:07,270 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.221e+02 2.371e+02 2.590e+02 3.110e+02, threshold=4.743e+02, percent-clipped=0.0 2024-06-21 13:20:30,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=388633.6666666667, ans=0.1 2024-06-21 13:20:35,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=388633.6666666667, ans=0.025 2024-06-21 13:20:38,453 INFO [train.py:1028] (0/2) Epoch 21, batch 9650, loss[loss=0.2094, simple_loss=0.2652, pruned_loss=0.07673, over 13095.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2804, pruned_loss=0.08168, over 2559786.64 frames. ], batch size: 132, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:20:41,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=388652.0, ans=0.0 2024-06-21 13:20:44,875 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-212000.pt 2024-06-21 13:20:51,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=388652.0, ans=0.125 2024-06-21 13:20:59,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=388670.3333333333, ans=0.07 2024-06-21 13:21:23,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=388725.3333333333, ans=0.0 2024-06-21 13:21:25,983 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:21:33,796 INFO [train.py:1028] (0/2) Epoch 21, batch 9700, loss[loss=0.2207, simple_loss=0.2712, pruned_loss=0.08507, over 13006.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2794, pruned_loss=0.08138, over 2555710.50 frames. ], batch size: 144, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:21:47,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.215e+02 2.393e+02 2.605e+02 3.985e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-21 13:21:56,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2024-06-21 13:22:16,322 INFO [train.py:1028] (0/2) Epoch 21, batch 9750, loss[loss=0.2142, simple_loss=0.2611, pruned_loss=0.08369, over 13055.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2779, pruned_loss=0.08068, over 2551716.05 frames. ], batch size: 132, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:22:22,709 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:22:27,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=388853.6666666667, ans=0.0 2024-06-21 13:22:32,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=388853.6666666667, ans=0.125 2024-06-21 13:22:38,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=388872.0, ans=0.0 2024-06-21 13:22:40,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=388872.0, ans=0.0 2024-06-21 13:22:51,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=388890.3333333333, ans=0.0 2024-06-21 13:22:56,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=388908.6666666667, ans=0.125 2024-06-21 13:22:58,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=388908.6666666667, ans=0.05 2024-06-21 13:23:00,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=388927.0, ans=0.0 2024-06-21 13:23:01,567 INFO [train.py:1028] (0/2) Epoch 21, batch 9800, loss[loss=0.1765, simple_loss=0.2407, pruned_loss=0.05616, over 12928.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2778, pruned_loss=0.08042, over 2544602.87 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:23:02,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=388927.0, ans=0.0 2024-06-21 13:23:16,767 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 2.228e+02 2.421e+02 2.637e+02 3.419e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 13:23:25,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.45 vs. limit=10.0 2024-06-21 13:23:29,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=388982.0, ans=0.125 2024-06-21 13:23:35,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=388982.0, ans=0.125 2024-06-21 13:23:45,858 INFO [train.py:1028] (0/2) Epoch 21, batch 9850, loss[loss=0.1979, simple_loss=0.2529, pruned_loss=0.07146, over 13202.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2772, pruned_loss=0.08025, over 2538013.73 frames. ], batch size: 103, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:23:58,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=389037.0, ans=0.0 2024-06-21 13:24:29,803 INFO [train.py:1028] (0/2) Epoch 21, batch 9900, loss[loss=0.1929, simple_loss=0.2512, pruned_loss=0.06731, over 12976.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2764, pruned_loss=0.08016, over 2531839.55 frames. ], batch size: 39, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:24:34,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=389110.3333333333, ans=0.125 2024-06-21 13:24:39,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=389128.6666666667, ans=0.125 2024-06-21 13:24:44,373 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.281e+02 2.442e+02 2.681e+02 3.633e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-21 13:24:57,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389165.3333333333, ans=0.1 2024-06-21 13:25:03,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=389183.6666666667, ans=0.125 2024-06-21 13:25:13,859 INFO [train.py:1028] (0/2) Epoch 21, batch 9950, loss[loss=0.2382, simple_loss=0.3001, pruned_loss=0.08809, over 12695.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2758, pruned_loss=0.08047, over 2526898.36 frames. ], batch size: 29, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:25:15,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=389202.0, ans=0.025 2024-06-21 13:25:47,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.92 vs. limit=10.0 2024-06-21 13:25:53,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-21 13:25:59,630 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.10 vs. limit=6.0 2024-06-21 13:26:00,049 INFO [train.py:1028] (0/2) Epoch 21, batch 10000, loss[loss=0.2297, simple_loss=0.2875, pruned_loss=0.0859, over 12601.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.276, pruned_loss=0.08076, over 2488858.13 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:26:00,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=389293.6666666667, ans=0.07 2024-06-21 13:26:06,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=389293.6666666667, ans=0.125 2024-06-21 13:26:09,202 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:26:12,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=389312.0, ans=0.0 2024-06-21 13:26:15,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=389312.0, ans=0.0 2024-06-21 13:26:17,384 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.201e+02 2.333e+02 2.565e+02 3.575e+02, threshold=4.666e+02, percent-clipped=0.0 2024-06-21 13:26:27,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=389330.3333333333, ans=0.125 2024-06-21 13:26:35,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=389348.6666666667, ans=15.0 2024-06-21 13:26:40,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=389367.0, ans=0.0 2024-06-21 13:26:41,648 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.49 vs. limit=12.0 2024-06-21 13:26:42,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389367.0, ans=0.1 2024-06-21 13:26:47,657 INFO [train.py:1028] (0/2) Epoch 21, batch 10050, loss[loss=0.2147, simple_loss=0.2742, pruned_loss=0.07762, over 12454.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2758, pruned_loss=0.08134, over 2446394.17 frames. ], batch size: 22, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:26:50,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=389385.3333333333, ans=0.125 2024-06-21 13:26:56,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=389403.6666666667, ans=0.2 2024-06-21 13:27:33,823 INFO [train.py:1028] (0/2) Epoch 21, batch 10100, loss[loss=0.1912, simple_loss=0.2553, pruned_loss=0.06361, over 10949.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2748, pruned_loss=0.08049, over 2430617.59 frames. ], batch size: 16, lr: 2.73e-03, grad_scale: 32.0 2024-06-21 13:27:49,418 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-21.pt 2024-06-21 13:30:39,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.148e+02 2.288e+02 2.479e+02 3.366e+02, threshold=4.576e+02, percent-clipped=0.0 2024-06-21 13:30:39,860 INFO [train.py:1028] (0/2) Epoch 22, batch 0, loss[loss=0.192, simple_loss=0.2478, pruned_loss=0.06815, over 12882.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2478, pruned_loss=0.06815, over 12882.00 frames. ], batch size: 36, lr: 2.67e-03, grad_scale: 32.0 2024-06-21 13:30:39,863 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 13:30:56,623 INFO [train.py:1060] (0/2) Epoch 22, validation: loss=0.1886, simple_loss=0.2528, pruned_loss=0.06221, over 351949.00 frames. 2024-06-21 13:30:56,624 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 13:31:11,382 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2024-06-21 13:31:19,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=389544.8333333333, ans=0.04949747468305833 2024-06-21 13:31:31,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=389563.1666666667, ans=0.0 2024-06-21 13:31:52,364 INFO [train.py:1028] (0/2) Epoch 22, batch 50, loss[loss=0.1852, simple_loss=0.2564, pruned_loss=0.05706, over 12682.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2576, pruned_loss=0.07447, over 574936.35 frames. ], batch size: 29, lr: 2.67e-03, grad_scale: 32.0 2024-06-21 13:31:53,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2024-06-21 13:31:54,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=389599.8333333333, ans=0.125 2024-06-21 13:31:56,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.86 vs. limit=22.5 2024-06-21 13:31:58,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=389599.8333333333, ans=0.125 2024-06-21 13:32:09,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=389636.5, ans=0.1 2024-06-21 13:32:13,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=15.0 2024-06-21 13:32:18,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=389654.8333333333, ans=0.0 2024-06-21 13:32:36,429 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.127e+02 2.240e+02 2.485e+02 3.030e+02, threshold=4.480e+02, percent-clipped=0.0 2024-06-21 13:32:36,463 INFO [train.py:1028] (0/2) Epoch 22, batch 100, loss[loss=0.2084, simple_loss=0.2723, pruned_loss=0.0722, over 13343.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2573, pruned_loss=0.07315, over 1017446.46 frames. ], batch size: 46, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:32:41,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=389691.5, ans=0.05 2024-06-21 13:32:41,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=389691.5, ans=0.125 2024-06-21 13:32:42,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=389691.5, ans=0.125 2024-06-21 13:32:44,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=389709.8333333333, ans=0.07 2024-06-21 13:32:51,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=389709.8333333333, ans=0.95 2024-06-21 13:32:55,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=389728.1666666667, ans=10.0 2024-06-21 13:32:57,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389728.1666666667, ans=0.1 2024-06-21 13:33:02,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=389728.1666666667, ans=0.125 2024-06-21 13:33:12,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389746.5, ans=0.125 2024-06-21 13:33:32,219 INFO [train.py:1028] (0/2) Epoch 22, batch 150, loss[loss=0.2003, simple_loss=0.2677, pruned_loss=0.06643, over 12680.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2572, pruned_loss=0.07228, over 1364518.29 frames. ], batch size: 29, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:34:16,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=389856.5, ans=0.125 2024-06-21 13:34:19,615 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.054e+02 2.155e+02 2.372e+02 3.572e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-21 13:34:19,644 INFO [train.py:1028] (0/2) Epoch 22, batch 200, loss[loss=0.2222, simple_loss=0.2729, pruned_loss=0.08572, over 12511.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2571, pruned_loss=0.07244, over 1634139.64 frames. ], batch size: 202, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:34:28,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=389893.1666666667, ans=0.0 2024-06-21 13:34:33,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=389893.1666666667, ans=0.2 2024-06-21 13:34:44,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=389911.5, ans=0.125 2024-06-21 13:34:58,516 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-21 13:34:59,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=389929.8333333333, ans=0.5 2024-06-21 13:35:14,313 INFO [train.py:1028] (0/2) Epoch 22, batch 250, loss[loss=0.177, simple_loss=0.2252, pruned_loss=0.06436, over 13049.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2559, pruned_loss=0.07194, over 1845456.49 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:35:24,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=389984.8333333333, ans=0.1 2024-06-21 13:35:32,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=390003.1666666667, ans=0.0 2024-06-21 13:35:45,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=390021.5, ans=0.0 2024-06-21 13:35:59,552 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.072e+02 2.229e+02 2.443e+02 2.878e+02, threshold=4.458e+02, percent-clipped=0.0 2024-06-21 13:35:59,614 INFO [train.py:1028] (0/2) Epoch 22, batch 300, loss[loss=0.2177, simple_loss=0.2712, pruned_loss=0.08204, over 13191.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2565, pruned_loss=0.07259, over 2009173.51 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:36:13,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=390076.5, ans=0.0 2024-06-21 13:36:40,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390113.1666666667, ans=0.125 2024-06-21 13:36:40,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=390113.1666666667, ans=0.125 2024-06-21 13:36:53,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390149.8333333333, ans=0.1 2024-06-21 13:36:54,470 INFO [train.py:1028] (0/2) Epoch 22, batch 350, loss[loss=0.1998, simple_loss=0.2614, pruned_loss=0.06915, over 12856.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2562, pruned_loss=0.07259, over 2138527.80 frames. ], batch size: 33, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:36:56,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=390149.8333333333, ans=0.0 2024-06-21 13:37:11,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=390168.1666666667, ans=0.125 2024-06-21 13:37:18,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390186.5, ans=0.1 2024-06-21 13:37:29,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=390204.8333333333, ans=0.2 2024-06-21 13:37:42,614 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.056e+02 2.261e+02 2.519e+02 3.296e+02, threshold=4.522e+02, percent-clipped=0.0 2024-06-21 13:37:42,653 INFO [train.py:1028] (0/2) Epoch 22, batch 400, loss[loss=0.1994, simple_loss=0.2583, pruned_loss=0.07024, over 13238.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2553, pruned_loss=0.0718, over 2239283.97 frames. ], batch size: 63, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:37:45,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=390241.5, ans=0.025 2024-06-21 13:37:46,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=15.0 2024-06-21 13:37:46,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=390241.5, ans=0.0 2024-06-21 13:37:48,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2024-06-21 13:37:51,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=390259.8333333333, ans=0.2 2024-06-21 13:37:53,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390259.8333333333, ans=0.125 2024-06-21 13:37:54,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=390259.8333333333, ans=0.0 2024-06-21 13:38:07,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390296.5, ans=0.1 2024-06-21 13:38:10,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390296.5, ans=0.1 2024-06-21 13:38:17,942 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:38:23,772 INFO [train.py:1028] (0/2) Epoch 22, batch 450, loss[loss=0.1829, simple_loss=0.2355, pruned_loss=0.0651, over 13180.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.255, pruned_loss=0.07173, over 2312956.60 frames. ], batch size: 67, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:38:26,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=390333.1666666667, ans=0.0 2024-06-21 13:38:36,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=390351.5, ans=0.125 2024-06-21 13:38:45,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.56 vs. limit=22.5 2024-06-21 13:38:51,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=390388.1666666667, ans=0.0 2024-06-21 13:38:58,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=390406.5, ans=0.125 2024-06-21 13:39:00,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=390406.5, ans=0.125 2024-06-21 13:39:03,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2024-06-21 13:39:05,861 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.012e+02 2.134e+02 2.327e+02 2.700e+02, threshold=4.268e+02, percent-clipped=0.0 2024-06-21 13:39:05,920 INFO [train.py:1028] (0/2) Epoch 22, batch 500, loss[loss=0.1906, simple_loss=0.2429, pruned_loss=0.06916, over 13081.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2554, pruned_loss=0.07155, over 2375639.82 frames. ], batch size: 121, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:39:19,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=390424.8333333333, ans=0.0 2024-06-21 13:39:23,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=390443.1666666667, ans=0.0 2024-06-21 13:39:25,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.23 vs. limit=10.0 2024-06-21 13:39:25,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=390443.1666666667, ans=0.025 2024-06-21 13:39:29,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=390443.1666666667, ans=0.0 2024-06-21 13:39:35,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390461.5, ans=0.125 2024-06-21 13:39:38,580 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.68 vs. limit=15.0 2024-06-21 13:39:39,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=390461.5, ans=0.0 2024-06-21 13:40:03,221 INFO [train.py:1028] (0/2) Epoch 22, batch 550, loss[loss=0.2015, simple_loss=0.2553, pruned_loss=0.07381, over 12971.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2549, pruned_loss=0.07117, over 2420850.66 frames. ], batch size: 158, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:40:03,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=15.0 2024-06-21 13:40:05,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390516.5, ans=0.1 2024-06-21 13:40:07,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=390516.5, ans=0.125 2024-06-21 13:40:16,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=390534.8333333333, ans=0.125 2024-06-21 13:40:24,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.07 vs. limit=15.0 2024-06-21 13:40:35,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390571.5, ans=0.125 2024-06-21 13:40:48,732 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.089e+02 2.224e+02 2.432e+02 3.650e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 13:40:48,763 INFO [train.py:1028] (0/2) Epoch 22, batch 600, loss[loss=0.1734, simple_loss=0.2243, pruned_loss=0.0613, over 13097.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2552, pruned_loss=0.07133, over 2458440.63 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:40:56,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-21 13:41:14,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=390644.8333333333, ans=0.2 2024-06-21 13:41:15,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=390644.8333333333, ans=0.125 2024-06-21 13:41:29,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=390681.5, ans=0.2 2024-06-21 13:41:31,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.94 vs. limit=15.0 2024-06-21 13:41:32,604 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:41:38,790 INFO [train.py:1028] (0/2) Epoch 22, batch 650, loss[loss=0.1863, simple_loss=0.2454, pruned_loss=0.06358, over 13212.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2554, pruned_loss=0.07101, over 2489349.96 frames. ], batch size: 59, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:41:39,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2024-06-21 13:42:10,065 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:42:27,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=390773.1666666667, ans=0.1 2024-06-21 13:42:30,341 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.050e+02 2.149e+02 2.277e+02 2.831e+02, threshold=4.298e+02, percent-clipped=0.0 2024-06-21 13:42:30,378 INFO [train.py:1028] (0/2) Epoch 22, batch 700, loss[loss=0.1942, simple_loss=0.2533, pruned_loss=0.06759, over 13318.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2555, pruned_loss=0.0714, over 2511124.50 frames. ], batch size: 46, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:42:30,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=390791.5, ans=0.125 2024-06-21 13:42:33,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390791.5, ans=0.125 2024-06-21 13:42:35,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-21 13:42:39,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2024-06-21 13:42:48,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=390828.1666666667, ans=0.125 2024-06-21 13:42:50,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=390828.1666666667, ans=0.0 2024-06-21 13:43:10,029 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2024-06-21 13:43:10,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2024-06-21 13:43:16,225 INFO [train.py:1028] (0/2) Epoch 22, batch 750, loss[loss=0.1943, simple_loss=0.2577, pruned_loss=0.06542, over 13251.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2556, pruned_loss=0.07111, over 2525886.26 frames. ], batch size: 63, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:43:30,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2024-06-21 13:43:32,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2024-06-21 13:43:33,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=390901.5, ans=0.025 2024-06-21 13:43:52,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390938.1666666667, ans=0.1 2024-06-21 13:43:56,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=390938.1666666667, ans=0.125 2024-06-21 13:44:04,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=390956.5, ans=0.2 2024-06-21 13:44:11,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=390956.5, ans=0.0 2024-06-21 13:44:13,665 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.060e+02 2.153e+02 2.252e+02 2.846e+02, threshold=4.306e+02, percent-clipped=0.0 2024-06-21 13:44:13,695 INFO [train.py:1028] (0/2) Epoch 22, batch 800, loss[loss=0.2036, simple_loss=0.2622, pruned_loss=0.07249, over 12939.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.256, pruned_loss=0.07131, over 2538680.37 frames. ], batch size: 36, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:44:18,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390974.8333333333, ans=0.125 2024-06-21 13:44:18,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.65 vs. limit=15.0 2024-06-21 13:44:25,449 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:44:28,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390993.1666666667, ans=0.1 2024-06-21 13:44:57,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=391048.1666666667, ans=0.0 2024-06-21 13:45:01,541 INFO [train.py:1028] (0/2) Epoch 22, batch 850, loss[loss=0.2021, simple_loss=0.2565, pruned_loss=0.07385, over 13152.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2554, pruned_loss=0.07116, over 2549127.44 frames. ], batch size: 95, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:45:01,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=391066.5, ans=0.125 2024-06-21 13:45:02,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2024-06-21 13:45:46,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=391139.8333333333, ans=0.0 2024-06-21 13:45:47,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=391139.8333333333, ans=0.125 2024-06-21 13:45:56,638 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.091e+02 2.185e+02 2.273e+02 3.063e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 13:45:56,670 INFO [train.py:1028] (0/2) Epoch 22, batch 900, loss[loss=0.1983, simple_loss=0.258, pruned_loss=0.06928, over 12899.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2548, pruned_loss=0.07092, over 2554691.68 frames. ], batch size: 36, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:46:02,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.79 vs. limit=15.0 2024-06-21 13:46:09,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2024-06-21 13:46:27,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.92 vs. limit=15.0 2024-06-21 13:46:51,923 INFO [train.py:1028] (0/2) Epoch 22, batch 950, loss[loss=0.2161, simple_loss=0.2741, pruned_loss=0.07908, over 12962.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2548, pruned_loss=0.07104, over 2558703.38 frames. ], batch size: 39, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:47:06,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.20 vs. limit=15.0 2024-06-21 13:47:12,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=391286.5, ans=0.0 2024-06-21 13:47:14,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=391286.5, ans=0.0 2024-06-21 13:47:28,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391323.1666666667, ans=0.125 2024-06-21 13:47:34,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.38 vs. limit=10.0 2024-06-21 13:47:34,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.127e+02 2.289e+02 2.519e+02 3.615e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 13:47:34,774 INFO [train.py:1028] (0/2) Epoch 22, batch 1000, loss[loss=0.2182, simple_loss=0.2742, pruned_loss=0.08113, over 13249.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2549, pruned_loss=0.07139, over 2561064.00 frames. ], batch size: 49, lr: 2.66e-03, grad_scale: 32.0 2024-06-21 13:47:52,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=391359.8333333333, ans=0.0 2024-06-21 13:47:57,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=391378.1666666667, ans=0.0 2024-06-21 13:48:10,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.94 vs. limit=15.0 2024-06-21 13:48:20,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=391414.8333333333, ans=0.125 2024-06-21 13:48:31,761 INFO [train.py:1028] (0/2) Epoch 22, batch 1050, loss[loss=0.1809, simple_loss=0.2468, pruned_loss=0.05749, over 13186.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2556, pruned_loss=0.07155, over 2564296.47 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:48:32,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=391433.1666666667, ans=0.025 2024-06-21 13:48:35,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=391433.1666666667, ans=0.125 2024-06-21 13:48:45,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=391451.5, ans=0.125 2024-06-21 13:48:53,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=391469.8333333333, ans=0.125 2024-06-21 13:48:55,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391469.8333333333, ans=0.125 2024-06-21 13:48:56,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=391469.8333333333, ans=0.125 2024-06-21 13:48:57,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391469.8333333333, ans=0.1 2024-06-21 13:48:58,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=391488.1666666667, ans=0.125 2024-06-21 13:48:58,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=391488.1666666667, ans=0.025 2024-06-21 13:49:16,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=391506.5, ans=0.0 2024-06-21 13:49:22,078 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.066e+02 2.175e+02 2.354e+02 2.870e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 13:49:22,123 INFO [train.py:1028] (0/2) Epoch 22, batch 1100, loss[loss=0.1781, simple_loss=0.2384, pruned_loss=0.05884, over 13340.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2558, pruned_loss=0.07141, over 2569054.45 frames. ], batch size: 52, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:49:22,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=391524.8333333333, ans=0.125 2024-06-21 13:49:41,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2024-06-21 13:50:03,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=391579.8333333333, ans=0.125 2024-06-21 13:50:11,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=391598.1666666667, ans=0.0 2024-06-21 13:50:13,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=391598.1666666667, ans=0.125 2024-06-21 13:50:17,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=391598.1666666667, ans=0.1 2024-06-21 13:50:20,750 INFO [train.py:1028] (0/2) Epoch 22, batch 1150, loss[loss=0.2111, simple_loss=0.2667, pruned_loss=0.07777, over 13296.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.256, pruned_loss=0.07147, over 2570634.75 frames. ], batch size: 52, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:50:31,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2024-06-21 13:50:42,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=391653.1666666667, ans=0.0 2024-06-21 13:50:46,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=391671.5, ans=0.125 2024-06-21 13:50:48,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=391671.5, ans=0.02 2024-06-21 13:50:48,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=391671.5, ans=0.1 2024-06-21 13:51:01,028 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.121e+02 2.256e+02 2.420e+02 3.038e+02, threshold=4.512e+02, percent-clipped=0.0 2024-06-21 13:51:01,079 INFO [train.py:1028] (0/2) Epoch 22, batch 1200, loss[loss=0.1983, simple_loss=0.2616, pruned_loss=0.06743, over 13114.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2564, pruned_loss=0.07187, over 2573668.45 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:51:03,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2024-06-21 13:51:10,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391726.5, ans=0.1 2024-06-21 13:51:18,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2024-06-21 13:51:33,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=391744.8333333333, ans=0.125 2024-06-21 13:51:40,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2024-06-21 13:51:51,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391781.5, ans=0.125 2024-06-21 13:51:57,723 INFO [train.py:1028] (0/2) Epoch 22, batch 1250, loss[loss=0.1914, simple_loss=0.2437, pruned_loss=0.0695, over 13163.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.256, pruned_loss=0.07179, over 2583791.35 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:52:25,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=22.5 2024-06-21 13:52:28,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=391854.8333333333, ans=0.125 2024-06-21 13:52:46,052 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:52:48,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=391873.1666666667, ans=0.09899494936611666 2024-06-21 13:52:50,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.093e+02 2.186e+02 2.401e+02 4.792e+02, threshold=4.372e+02, percent-clipped=1.0 2024-06-21 13:52:50,201 INFO [train.py:1028] (0/2) Epoch 22, batch 1300, loss[loss=0.2039, simple_loss=0.2541, pruned_loss=0.07688, over 12766.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2564, pruned_loss=0.07189, over 2583100.91 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:52:59,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391909.8333333333, ans=0.125 2024-06-21 13:53:01,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2024-06-21 13:53:03,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=391909.8333333333, ans=0.0 2024-06-21 13:53:03,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=391909.8333333333, ans=0.2 2024-06-21 13:53:11,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=391928.1666666667, ans=0.0 2024-06-21 13:53:12,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.16 vs. limit=22.5 2024-06-21 13:53:36,650 INFO [train.py:1028] (0/2) Epoch 22, batch 1350, loss[loss=0.1966, simple_loss=0.2605, pruned_loss=0.06632, over 13203.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2562, pruned_loss=0.07167, over 2585351.82 frames. ], batch size: 59, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:54:15,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=392038.1666666667, ans=0.125 2024-06-21 13:54:27,879 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.106e+02 2.264e+02 2.573e+02 3.379e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 13:54:27,918 INFO [train.py:1028] (0/2) Epoch 22, batch 1400, loss[loss=0.2191, simple_loss=0.2824, pruned_loss=0.07788, over 12460.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2567, pruned_loss=0.07168, over 2586902.05 frames. ], batch size: 25, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:54:31,814 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:54:52,198 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.96 vs. limit=22.5 2024-06-21 13:54:56,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=392111.5, ans=0.0 2024-06-21 13:55:07,454 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:55:16,057 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:55:21,046 INFO [train.py:1028] (0/2) Epoch 22, batch 1450, loss[loss=0.1852, simple_loss=0.2383, pruned_loss=0.06606, over 13114.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.257, pruned_loss=0.0719, over 2586587.79 frames. ], batch size: 121, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:55:21,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=392166.5, ans=0.05 2024-06-21 13:55:50,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=392203.1666666667, ans=0.025 2024-06-21 13:55:53,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=392203.1666666667, ans=0.125 2024-06-21 13:55:55,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=392221.5, ans=0.0 2024-06-21 13:56:01,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=22.5 2024-06-21 13:56:06,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=392239.8333333333, ans=0.125 2024-06-21 13:56:15,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=392258.1666666667, ans=0.2 2024-06-21 13:56:16,062 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 2.099e+02 2.190e+02 2.333e+02 3.382e+02, threshold=4.381e+02, percent-clipped=0.0 2024-06-21 13:56:16,097 INFO [train.py:1028] (0/2) Epoch 22, batch 1500, loss[loss=0.2186, simple_loss=0.2689, pruned_loss=0.08421, over 13199.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2572, pruned_loss=0.07218, over 2589029.65 frames. ], batch size: 83, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:56:22,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=392258.1666666667, ans=0.0 2024-06-21 13:56:23,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=392258.1666666667, ans=0.2 2024-06-21 13:56:48,999 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.57 vs. limit=22.5 2024-06-21 13:57:00,701 INFO [train.py:1028] (0/2) Epoch 22, batch 1550, loss[loss=0.2034, simple_loss=0.2495, pruned_loss=0.0786, over 12999.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.257, pruned_loss=0.07239, over 2583818.92 frames. ], batch size: 102, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:57:07,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=392349.8333333333, ans=0.125 2024-06-21 13:57:28,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=392386.5, ans=0.125 2024-06-21 13:57:34,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=15.0 2024-06-21 13:57:44,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392423.1666666667, ans=0.1 2024-06-21 13:57:53,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=392423.1666666667, ans=0.1 2024-06-21 13:57:54,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.118e+02 2.220e+02 2.401e+02 3.384e+02, threshold=4.440e+02, percent-clipped=0.0 2024-06-21 13:57:54,485 INFO [train.py:1028] (0/2) Epoch 22, batch 1600, loss[loss=0.1697, simple_loss=0.2296, pruned_loss=0.05492, over 13191.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2572, pruned_loss=0.07219, over 2579220.31 frames. ], batch size: 77, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:57:54,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=392441.5, ans=0.05 2024-06-21 13:57:55,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=392441.5, ans=0.125 2024-06-21 13:58:06,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.47 vs. limit=6.0 2024-06-21 13:58:18,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=392478.1666666667, ans=0.0 2024-06-21 13:58:25,659 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 13:58:35,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392514.8333333333, ans=0.1 2024-06-21 13:58:43,465 INFO [train.py:1028] (0/2) Epoch 22, batch 1650, loss[loss=0.2027, simple_loss=0.2508, pruned_loss=0.07727, over 13145.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2575, pruned_loss=0.07269, over 2575843.45 frames. ], batch size: 95, lr: 2.66e-03, grad_scale: 64.0 2024-06-21 13:58:46,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=392533.1666666667, ans=0.125 2024-06-21 13:59:01,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=392551.5, ans=0.09899494936611666 2024-06-21 13:59:14,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2024-06-21 13:59:31,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=392606.5, ans=0.07 2024-06-21 13:59:36,105 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.067e+02 2.186e+02 2.405e+02 3.278e+02, threshold=4.372e+02, percent-clipped=0.0 2024-06-21 13:59:36,137 INFO [train.py:1028] (0/2) Epoch 22, batch 1700, loss[loss=0.1737, simple_loss=0.238, pruned_loss=0.05476, over 12824.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2571, pruned_loss=0.0722, over 2580637.41 frames. ], batch size: 26, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 13:59:49,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=392643.1666666667, ans=0.2 2024-06-21 14:00:18,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=392698.1666666667, ans=0.0 2024-06-21 14:00:21,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=392698.1666666667, ans=0.025 2024-06-21 14:00:22,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=392698.1666666667, ans=0.0 2024-06-21 14:00:24,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=392698.1666666667, ans=0.0 2024-06-21 14:00:26,555 INFO [train.py:1028] (0/2) Epoch 22, batch 1750, loss[loss=0.2011, simple_loss=0.2593, pruned_loss=0.07142, over 12596.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2578, pruned_loss=0.07261, over 2581996.06 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:00:55,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.62 vs. limit=22.5 2024-06-21 14:01:02,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=22.5 2024-06-21 14:01:04,760 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-21 14:01:11,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392789.8333333333, ans=0.1 2024-06-21 14:01:15,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.104e+02 2.227e+02 2.447e+02 3.112e+02, threshold=4.454e+02, percent-clipped=0.0 2024-06-21 14:01:15,184 INFO [train.py:1028] (0/2) Epoch 22, batch 1800, loss[loss=0.1943, simple_loss=0.252, pruned_loss=0.06824, over 13229.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2578, pruned_loss=0.07263, over 2582497.53 frames. ], batch size: 67, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:01:38,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=392826.5, ans=0.125 2024-06-21 14:01:47,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=392844.8333333333, ans=0.125 2024-06-21 14:01:55,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392863.1666666667, ans=0.1 2024-06-21 14:02:03,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=392881.5, ans=0.0 2024-06-21 14:02:09,136 INFO [train.py:1028] (0/2) Epoch 22, batch 1850, loss[loss=0.2026, simple_loss=0.2548, pruned_loss=0.07525, over 13236.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2577, pruned_loss=0.07264, over 2583341.27 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:02:10,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=392899.8333333333, ans=0.015 2024-06-21 14:02:24,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=392918.1666666667, ans=0.2 2024-06-21 14:02:36,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=392954.8333333333, ans=0.125 2024-06-21 14:02:43,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=392954.8333333333, ans=0.125 2024-06-21 14:02:57,383 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.857e+02 2.103e+02 2.236e+02 2.430e+02 3.214e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 14:02:57,414 INFO [train.py:1028] (0/2) Epoch 22, batch 1900, loss[loss=0.2037, simple_loss=0.2538, pruned_loss=0.07686, over 13109.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2575, pruned_loss=0.07245, over 2585909.83 frames. ], batch size: 95, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:03:11,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=393009.8333333333, ans=0.125 2024-06-21 14:03:12,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393009.8333333333, ans=0.1 2024-06-21 14:03:13,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=393009.8333333333, ans=0.125 2024-06-21 14:03:30,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=393028.1666666667, ans=10.0 2024-06-21 14:03:38,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=393046.5, ans=0.125 2024-06-21 14:03:51,467 INFO [train.py:1028] (0/2) Epoch 22, batch 1950, loss[loss=0.1921, simple_loss=0.2581, pruned_loss=0.06308, over 13224.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2572, pruned_loss=0.07266, over 2591168.44 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:03:57,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=393083.1666666667, ans=0.09899494936611666 2024-06-21 14:04:09,888 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.91 vs. limit=10.0 2024-06-21 14:04:23,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=393119.8333333333, ans=0.0 2024-06-21 14:04:25,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393119.8333333333, ans=0.1 2024-06-21 14:04:31,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=393138.1666666667, ans=0.0 2024-06-21 14:04:37,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=393138.1666666667, ans=0.125 2024-06-21 14:04:45,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393156.5, ans=0.1 2024-06-21 14:04:46,967 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.073e+02 2.224e+02 2.384e+02 3.007e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 14:04:47,011 INFO [train.py:1028] (0/2) Epoch 22, batch 2000, loss[loss=0.1856, simple_loss=0.2471, pruned_loss=0.06205, over 12352.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2571, pruned_loss=0.07278, over 2587334.15 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:04:47,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=393174.8333333333, ans=0.0 2024-06-21 14:04:50,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393174.8333333333, ans=0.1 2024-06-21 14:04:55,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=393193.1666666667, ans=0.1 2024-06-21 14:04:57,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=393193.1666666667, ans=0.2 2024-06-21 14:04:58,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=393193.1666666667, ans=0.125 2024-06-21 14:05:06,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=393211.5, ans=0.0 2024-06-21 14:05:13,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=393211.5, ans=0.0 2024-06-21 14:05:19,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=393229.8333333333, ans=0.125 2024-06-21 14:05:19,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=393229.8333333333, ans=0.0 2024-06-21 14:05:20,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=393229.8333333333, ans=0.125 2024-06-21 14:05:23,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.19 vs. limit=22.5 2024-06-21 14:05:33,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393248.1666666667, ans=0.1 2024-06-21 14:05:36,665 INFO [train.py:1028] (0/2) Epoch 22, batch 2050, loss[loss=0.2054, simple_loss=0.2641, pruned_loss=0.07337, over 12572.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2567, pruned_loss=0.07263, over 2581580.49 frames. ], batch size: 29, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:05:40,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=393266.5, ans=0.0 2024-06-21 14:05:44,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=393266.5, ans=0.0 2024-06-21 14:05:49,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393284.8333333333, ans=0.1 2024-06-21 14:05:57,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.11 vs. limit=15.0 2024-06-21 14:06:09,109 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=15.0 2024-06-21 14:06:10,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=393321.5, ans=0.125 2024-06-21 14:06:17,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.88 vs. limit=22.5 2024-06-21 14:06:19,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=393339.8333333333, ans=0.125 2024-06-21 14:06:28,604 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.144e+02 2.329e+02 2.514e+02 3.732e+02, threshold=4.657e+02, percent-clipped=0.0 2024-06-21 14:06:28,644 INFO [train.py:1028] (0/2) Epoch 22, batch 2100, loss[loss=0.1841, simple_loss=0.2475, pruned_loss=0.06034, over 13183.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2568, pruned_loss=0.07218, over 2583981.33 frames. ], batch size: 59, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:06:43,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=393376.5, ans=0.125 2024-06-21 14:06:45,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=393394.8333333333, ans=0.025 2024-06-21 14:06:46,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=393394.8333333333, ans=15.0 2024-06-21 14:06:48,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=393394.8333333333, ans=0.125 2024-06-21 14:07:07,384 INFO [train.py:1028] (0/2) Epoch 22, batch 2150, loss[loss=0.1849, simple_loss=0.2476, pruned_loss=0.06105, over 13284.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.257, pruned_loss=0.07208, over 2586759.41 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:07:16,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.40 vs. limit=22.5 2024-06-21 14:07:17,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=393468.1666666667, ans=0.125 2024-06-21 14:07:45,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=393504.8333333333, ans=0.125 2024-06-21 14:07:51,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=393504.8333333333, ans=0.125 2024-06-21 14:08:03,255 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.046e+02 2.163e+02 2.307e+02 2.802e+02, threshold=4.327e+02, percent-clipped=0.0 2024-06-21 14:08:03,299 INFO [train.py:1028] (0/2) Epoch 22, batch 2200, loss[loss=0.1941, simple_loss=0.2489, pruned_loss=0.06971, over 13239.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2571, pruned_loss=0.0723, over 2586470.02 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:08:04,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393541.5, ans=0.125 2024-06-21 14:08:04,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=393541.5, ans=0.125 2024-06-21 14:08:04,601 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:08:08,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=393541.5, ans=0.0 2024-06-21 14:08:15,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.49 vs. limit=22.5 2024-06-21 14:08:16,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=393559.8333333333, ans=0.125 2024-06-21 14:08:26,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=393578.1666666667, ans=0.125 2024-06-21 14:08:38,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=393596.5, ans=0.0 2024-06-21 14:08:55,757 INFO [train.py:1028] (0/2) Epoch 22, batch 2250, loss[loss=0.2007, simple_loss=0.258, pruned_loss=0.07164, over 13248.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2567, pruned_loss=0.07237, over 2585037.72 frames. ], batch size: 63, lr: 2.65e-03, grad_scale: 64.0 2024-06-21 14:08:56,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=393633.1666666667, ans=0.0 2024-06-21 14:09:12,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=393651.5, ans=0.2 2024-06-21 14:09:28,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=393688.1666666667, ans=0.0 2024-06-21 14:09:38,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2024-06-21 14:09:52,267 INFO [train.py:1028] (0/2) Epoch 22, batch 2300, loss[loss=0.2005, simple_loss=0.2615, pruned_loss=0.06979, over 12918.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2566, pruned_loss=0.07213, over 2580065.29 frames. ], batch size: 33, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:09:53,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.169e+02 2.348e+02 2.620e+02 3.934e+02, threshold=4.696e+02, percent-clipped=0.0 2024-06-21 14:10:25,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=393779.8333333333, ans=0.125 2024-06-21 14:10:26,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=393779.8333333333, ans=0.125 2024-06-21 14:10:28,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=393779.8333333333, ans=0.07 2024-06-21 14:10:29,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=393779.8333333333, ans=0.125 2024-06-21 14:10:44,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=393798.1666666667, ans=0.2 2024-06-21 14:10:45,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=393798.1666666667, ans=0.125 2024-06-21 14:10:46,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.87 vs. limit=15.0 2024-06-21 14:10:46,808 INFO [train.py:1028] (0/2) Epoch 22, batch 2350, loss[loss=0.2026, simple_loss=0.2566, pruned_loss=0.07428, over 13257.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2572, pruned_loss=0.07216, over 2583506.01 frames. ], batch size: 67, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:10:57,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=393834.8333333333, ans=0.0 2024-06-21 14:11:02,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=393834.8333333333, ans=0.125 2024-06-21 14:11:16,415 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:11:22,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=393871.5, ans=0.0 2024-06-21 14:11:36,397 INFO [train.py:1028] (0/2) Epoch 22, batch 2400, loss[loss=0.1986, simple_loss=0.2516, pruned_loss=0.07287, over 13317.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2568, pruned_loss=0.07218, over 2587077.80 frames. ], batch size: 46, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:11:37,501 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.120e+02 2.247e+02 2.430e+02 3.498e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 14:11:46,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=393926.5, ans=0.125 2024-06-21 14:11:47,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=393926.5, ans=0.125 2024-06-21 14:11:54,983 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2024-06-21 14:11:56,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393944.8333333333, ans=0.1 2024-06-21 14:12:09,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=393963.1666666667, ans=0.0 2024-06-21 14:12:13,732 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:12:26,670 INFO [train.py:1028] (0/2) Epoch 22, batch 2450, loss[loss=0.1955, simple_loss=0.2516, pruned_loss=0.06966, over 13291.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2559, pruned_loss=0.07217, over 2582626.82 frames. ], batch size: 63, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:12:26,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=393999.8333333333, ans=0.125 2024-06-21 14:12:34,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=393999.8333333333, ans=0.0 2024-06-21 14:12:37,994 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.08 vs. limit=15.0 2024-06-21 14:13:12,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2024-06-21 14:13:27,802 INFO [train.py:1028] (0/2) Epoch 22, batch 2500, loss[loss=0.1901, simple_loss=0.2431, pruned_loss=0.06852, over 13182.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.255, pruned_loss=0.07195, over 2585686.46 frames. ], batch size: 83, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:13:28,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.121e+02 2.247e+02 2.474e+02 3.241e+02, threshold=4.495e+02, percent-clipped=0.0 2024-06-21 14:13:33,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394091.5, ans=0.1 2024-06-21 14:13:37,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=394091.5, ans=15.0 2024-06-21 14:13:43,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=394109.8333333333, ans=0.125 2024-06-21 14:14:03,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=394146.5, ans=0.0 2024-06-21 14:14:19,470 INFO [train.py:1028] (0/2) Epoch 22, batch 2550, loss[loss=0.1618, simple_loss=0.2245, pruned_loss=0.04954, over 12476.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2533, pruned_loss=0.07124, over 2585953.59 frames. ], batch size: 22, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:14:22,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394183.1666666667, ans=0.1 2024-06-21 14:14:35,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.51 vs. limit=15.0 2024-06-21 14:14:36,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394201.5, ans=0.1 2024-06-21 14:14:37,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2024-06-21 14:14:41,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=394219.8333333333, ans=15.0 2024-06-21 14:14:49,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394238.1666666667, ans=0.125 2024-06-21 14:14:51,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=394238.1666666667, ans=0.125 2024-06-21 14:14:53,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394238.1666666667, ans=0.1 2024-06-21 14:14:55,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=394256.5, ans=0.025 2024-06-21 14:15:04,438 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:15:04,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=15.0 2024-06-21 14:15:06,055 INFO [train.py:1028] (0/2) Epoch 22, batch 2600, loss[loss=0.1858, simple_loss=0.2407, pruned_loss=0.06552, over 13239.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2521, pruned_loss=0.0712, over 2585189.83 frames. ], batch size: 52, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:15:07,135 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.095e+02 2.221e+02 2.432e+02 3.036e+02, threshold=4.442e+02, percent-clipped=0.0 2024-06-21 14:15:13,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394274.8333333333, ans=0.1 2024-06-21 14:15:20,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=394293.1666666667, ans=0.0 2024-06-21 14:15:27,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=394293.1666666667, ans=0.05 2024-06-21 14:15:27,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=394293.1666666667, ans=0.0 2024-06-21 14:15:41,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=394329.8333333333, ans=0.0 2024-06-21 14:15:44,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=394329.8333333333, ans=0.0 2024-06-21 14:15:45,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=394329.8333333333, ans=0.1 2024-06-21 14:15:54,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=394348.1666666667, ans=0.0 2024-06-21 14:16:02,875 INFO [train.py:1028] (0/2) Epoch 22, batch 2650, loss[loss=0.1924, simple_loss=0.2347, pruned_loss=0.07501, over 13082.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2509, pruned_loss=0.0707, over 2585487.10 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:16:24,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=394384.8333333333, ans=0.125 2024-06-21 14:16:28,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=394403.1666666667, ans=0.05 2024-06-21 14:16:54,946 INFO [train.py:1028] (0/2) Epoch 22, batch 2700, loss[loss=0.1904, simple_loss=0.2371, pruned_loss=0.07188, over 13213.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2497, pruned_loss=0.07051, over 2583209.52 frames. ], batch size: 89, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:16:55,985 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.072e+02 2.234e+02 2.406e+02 3.537e+02, threshold=4.468e+02, percent-clipped=0.0 2024-06-21 14:17:12,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.05 vs. limit=22.5 2024-06-21 14:17:17,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=15.0 2024-06-21 14:17:22,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=394494.8333333333, ans=0.025 2024-06-21 14:17:34,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394513.1666666667, ans=0.125 2024-06-21 14:17:46,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=394531.5, ans=0.125 2024-06-21 14:17:47,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.44 vs. limit=22.5 2024-06-21 14:17:48,517 INFO [train.py:1028] (0/2) Epoch 22, batch 2750, loss[loss=0.2073, simple_loss=0.2544, pruned_loss=0.08008, over 13265.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.249, pruned_loss=0.07011, over 2580489.69 frames. ], batch size: 43, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:17:53,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=394549.8333333333, ans=0.5 2024-06-21 14:17:54,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=394549.8333333333, ans=0.125 2024-06-21 14:17:56,334 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.74 vs. limit=22.5 2024-06-21 14:18:06,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=394568.1666666667, ans=0.125 2024-06-21 14:18:08,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=394568.1666666667, ans=0.05 2024-06-21 14:18:16,932 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:18:38,186 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=15.0 2024-06-21 14:18:41,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=394623.1666666667, ans=0.025 2024-06-21 14:18:46,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394623.1666666667, ans=0.1 2024-06-21 14:18:50,150 INFO [train.py:1028] (0/2) Epoch 22, batch 2800, loss[loss=0.2064, simple_loss=0.2504, pruned_loss=0.08123, over 10913.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2489, pruned_loss=0.07038, over 2578675.10 frames. ], batch size: 304, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:18:50,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=394641.5, ans=0.2 2024-06-21 14:18:51,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.043e+02 2.186e+02 2.388e+02 3.843e+02, threshold=4.371e+02, percent-clipped=0.0 2024-06-21 14:18:51,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=394641.5, ans=0.125 2024-06-21 14:19:26,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=394696.5, ans=0.0 2024-06-21 14:19:45,072 INFO [train.py:1028] (0/2) Epoch 22, batch 2850, loss[loss=0.1836, simple_loss=0.2405, pruned_loss=0.06336, over 13241.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2478, pruned_loss=0.07001, over 2577267.73 frames. ], batch size: 49, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:19:56,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2024-06-21 14:19:59,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394751.5, ans=0.125 2024-06-21 14:20:02,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394769.8333333333, ans=0.125 2024-06-21 14:20:15,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=394788.1666666667, ans=0.0 2024-06-21 14:20:29,832 INFO [train.py:1028] (0/2) Epoch 22, batch 2900, loss[loss=0.1916, simple_loss=0.249, pruned_loss=0.06712, over 13157.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2458, pruned_loss=0.06939, over 2585997.94 frames. ], batch size: 55, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:20:31,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.070e+02 2.182e+02 2.430e+02 3.200e+02, threshold=4.363e+02, percent-clipped=0.0 2024-06-21 14:21:12,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=394879.8333333333, ans=0.0 2024-06-21 14:21:14,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=15.0 2024-06-21 14:21:28,614 INFO [train.py:1028] (0/2) Epoch 22, batch 2950, loss[loss=0.1762, simple_loss=0.2298, pruned_loss=0.06127, over 13250.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2454, pruned_loss=0.0693, over 2580879.16 frames. ], batch size: 43, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:21:30,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=394916.5, ans=0.125 2024-06-21 14:21:47,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=394934.8333333333, ans=0.0 2024-06-21 14:21:55,565 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.42 vs. limit=10.0 2024-06-21 14:22:20,115 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:22:25,253 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:22:28,126 INFO [train.py:1028] (0/2) Epoch 22, batch 3000, loss[loss=0.2046, simple_loss=0.2572, pruned_loss=0.076, over 13211.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2443, pruned_loss=0.0687, over 2578950.09 frames. ], batch size: 59, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:22:28,127 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 14:22:35,547 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.3275, 2.4753, 3.9107, 3.6742], device='cuda:0') 2024-06-21 14:22:39,688 INFO [train.py:1060] (0/2) Epoch 22, validation: loss=0.187, simple_loss=0.2507, pruned_loss=0.06171, over 351949.00 frames. 2024-06-21 14:22:39,690 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 14:22:40,657 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.063e+02 2.181e+02 2.421e+02 3.145e+02, threshold=4.362e+02, percent-clipped=0.0 2024-06-21 14:22:43,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=395008.1666666667, ans=10.0 2024-06-21 14:22:53,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.23 vs. limit=15.0 2024-06-21 14:23:12,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.55 vs. limit=10.0 2024-06-21 14:23:12,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=395063.1666666667, ans=0.035 2024-06-21 14:23:15,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=395063.1666666667, ans=0.0 2024-06-21 14:23:20,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=395063.1666666667, ans=0.0 2024-06-21 14:23:30,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=395099.8333333333, ans=0.1 2024-06-21 14:23:32,023 INFO [train.py:1028] (0/2) Epoch 22, batch 3050, loss[loss=0.1731, simple_loss=0.2313, pruned_loss=0.05738, over 13251.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2438, pruned_loss=0.06867, over 2577809.94 frames. ], batch size: 46, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:23:35,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=395099.8333333333, ans=0.125 2024-06-21 14:23:41,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=395118.1666666667, ans=0.05 2024-06-21 14:23:47,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=395118.1666666667, ans=0.0 2024-06-21 14:23:54,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2024-06-21 14:24:06,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=395154.8333333333, ans=0.0 2024-06-21 14:24:08,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=395154.8333333333, ans=0.2 2024-06-21 14:24:18,570 INFO [train.py:1028] (0/2) Epoch 22, batch 3100, loss[loss=0.1948, simple_loss=0.243, pruned_loss=0.07337, over 13058.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2436, pruned_loss=0.06844, over 2578183.69 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:24:19,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 2.067e+02 2.202e+02 2.371e+02 3.498e+02, threshold=4.404e+02, percent-clipped=0.0 2024-06-21 14:24:36,602 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:24:38,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=395209.8333333333, ans=0.125 2024-06-21 14:24:46,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2024-06-21 14:25:12,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=395283.1666666667, ans=0.025 2024-06-21 14:25:20,475 INFO [train.py:1028] (0/2) Epoch 22, batch 3150, loss[loss=0.1785, simple_loss=0.2269, pruned_loss=0.06504, over 12981.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2422, pruned_loss=0.06787, over 2579012.26 frames. ], batch size: 158, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:25:41,397 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:25:41,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=395319.8333333333, ans=0.2 2024-06-21 14:25:56,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=395338.1666666667, ans=0.125 2024-06-21 14:25:58,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=395338.1666666667, ans=0.0 2024-06-21 14:25:59,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=395356.5, ans=0.125 2024-06-21 14:26:05,831 INFO [train.py:1028] (0/2) Epoch 22, batch 3200, loss[loss=0.1694, simple_loss=0.2307, pruned_loss=0.0541, over 13175.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2421, pruned_loss=0.06772, over 2578541.60 frames. ], batch size: 55, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:26:06,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.050e+02 2.200e+02 2.430e+02 3.427e+02, threshold=4.401e+02, percent-clipped=0.0 2024-06-21 14:26:07,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=395374.8333333333, ans=0.0 2024-06-21 14:26:09,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=395374.8333333333, ans=0.125 2024-06-21 14:26:21,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=395411.5, ans=0.125 2024-06-21 14:26:24,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=395429.8333333333, ans=0.0 2024-06-21 14:26:26,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=395429.8333333333, ans=0.0 2024-06-21 14:26:34,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=395448.1666666667, ans=0.125 2024-06-21 14:26:34,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=395448.1666666667, ans=0.125 2024-06-21 14:26:34,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=395448.1666666667, ans=0.125 2024-06-21 14:26:37,212 INFO [train.py:1028] (0/2) Epoch 22, batch 3250, loss[loss=0.1719, simple_loss=0.2294, pruned_loss=0.05725, over 13286.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2414, pruned_loss=0.06774, over 2583807.98 frames. ], batch size: 72, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:26:40,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=395466.5, ans=0.125 2024-06-21 14:26:44,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=395484.8333333333, ans=0.2 2024-06-21 14:26:49,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=15.0 2024-06-21 14:27:12,573 INFO [train.py:1028] (0/2) Epoch 22, batch 3300, loss[loss=0.2065, simple_loss=0.2523, pruned_loss=0.08037, over 12826.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2415, pruned_loss=0.06758, over 2581018.34 frames. ], batch size: 176, lr: 2.65e-03, grad_scale: 32.0 2024-06-21 14:27:13,151 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.023e+02 2.169e+02 2.334e+02 3.159e+02, threshold=4.339e+02, percent-clipped=0.0 2024-06-21 14:27:17,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2024-06-21 14:27:18,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=395576.5, ans=0.125 2024-06-21 14:27:27,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=395576.5, ans=0.125 2024-06-21 14:27:40,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.31 vs. limit=15.0 2024-06-21 14:27:41,430 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.659e+01 2024-06-21 14:27:47,428 INFO [train.py:1028] (0/2) Epoch 22, batch 3350, loss[loss=0.1847, simple_loss=0.2315, pruned_loss=0.06891, over 12909.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2411, pruned_loss=0.06779, over 2576553.47 frames. ], batch size: 158, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:28:10,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=395704.8333333333, ans=0.025 2024-06-21 14:28:13,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=395723.1666666667, ans=0.2 2024-06-21 14:28:20,094 INFO [train.py:1028] (0/2) Epoch 22, batch 3400, loss[loss=0.2023, simple_loss=0.2574, pruned_loss=0.07364, over 12768.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2411, pruned_loss=0.06797, over 2575793.36 frames. ], batch size: 22, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:28:20,750 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.750e+02 2.063e+02 2.251e+02 2.541e+02 3.891e+02, threshold=4.503e+02, percent-clipped=0.0 2024-06-21 14:28:27,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=395759.8333333333, ans=0.0 2024-06-21 14:28:34,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=395778.1666666667, ans=0.025 2024-06-21 14:28:46,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=395796.5, ans=0.125 2024-06-21 14:28:55,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=22.5 2024-06-21 14:28:55,669 INFO [train.py:1028] (0/2) Epoch 22, batch 3450, loss[loss=0.1976, simple_loss=0.2464, pruned_loss=0.07441, over 12714.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2404, pruned_loss=0.06776, over 2576030.54 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:28:55,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=395833.1666666667, ans=0.07 2024-06-21 14:29:23,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=395888.1666666667, ans=0.0 2024-06-21 14:29:25,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=395906.5, ans=0.125 2024-06-21 14:29:31,957 INFO [train.py:1028] (0/2) Epoch 22, batch 3500, loss[loss=0.1729, simple_loss=0.2227, pruned_loss=0.06158, over 12872.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2402, pruned_loss=0.06762, over 2573955.85 frames. ], batch size: 33, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:29:32,661 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.053e+02 2.178e+02 2.386e+02 2.848e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-21 14:29:38,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=395943.1666666667, ans=0.0 2024-06-21 14:29:41,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=395943.1666666667, ans=0.125 2024-06-21 14:29:50,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=395961.5, ans=0.125 2024-06-21 14:29:56,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=395979.8333333333, ans=0.0 2024-06-21 14:29:59,037 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-216000.pt 2024-06-21 14:30:10,779 INFO [train.py:1028] (0/2) Epoch 22, batch 3550, loss[loss=0.1756, simple_loss=0.2236, pruned_loss=0.06375, over 13152.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2393, pruned_loss=0.06718, over 2574890.47 frames. ], batch size: 95, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:30:20,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=396034.8333333333, ans=0.125 2024-06-21 14:30:23,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=396034.8333333333, ans=0.125 2024-06-21 14:30:25,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=396053.1666666667, ans=0.1 2024-06-21 14:30:28,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=396053.1666666667, ans=0.125 2024-06-21 14:30:28,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396053.1666666667, ans=0.125 2024-06-21 14:30:28,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=396053.1666666667, ans=0.025 2024-06-21 14:30:30,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.06 vs. limit=15.0 2024-06-21 14:30:32,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=396071.5, ans=0.125 2024-06-21 14:30:33,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=396071.5, ans=0.2 2024-06-21 14:30:36,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=396071.5, ans=0.0 2024-06-21 14:30:39,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2024-06-21 14:30:44,123 INFO [train.py:1028] (0/2) Epoch 22, batch 3600, loss[loss=0.1722, simple_loss=0.2279, pruned_loss=0.05825, over 13280.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2386, pruned_loss=0.06686, over 2578135.79 frames. ], batch size: 49, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:30:44,922 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 2.052e+02 2.156e+02 2.454e+02 3.434e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-21 14:30:45,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=396108.1666666667, ans=0.125 2024-06-21 14:30:46,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=396108.1666666667, ans=0.125 2024-06-21 14:30:51,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=396126.5, ans=0.0 2024-06-21 14:30:57,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=396126.5, ans=0.025 2024-06-21 14:30:57,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.35 vs. limit=15.0 2024-06-21 14:31:06,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=15.0 2024-06-21 14:31:17,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2024-06-21 14:31:17,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=396181.5, ans=0.125 2024-06-21 14:31:19,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=396181.5, ans=0.1 2024-06-21 14:31:21,084 INFO [train.py:1028] (0/2) Epoch 22, batch 3650, loss[loss=0.1944, simple_loss=0.2425, pruned_loss=0.0732, over 13082.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2386, pruned_loss=0.06661, over 2576380.88 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:31:21,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=396199.8333333333, ans=0.025 2024-06-21 14:31:29,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=396199.8333333333, ans=0.125 2024-06-21 14:31:30,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=396218.1666666667, ans=0.125 2024-06-21 14:31:34,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2024-06-21 14:31:38,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396236.5, ans=0.1 2024-06-21 14:31:56,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=396273.1666666667, ans=15.0 2024-06-21 14:31:57,786 INFO [train.py:1028] (0/2) Epoch 22, batch 3700, loss[loss=0.1994, simple_loss=0.2529, pruned_loss=0.07297, over 13244.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2378, pruned_loss=0.06607, over 2582145.67 frames. ], batch size: 72, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:31:58,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2024-06-21 14:31:58,404 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.021e+02 2.178e+02 2.348e+02 3.318e+02, threshold=4.356e+02, percent-clipped=0.0 2024-06-21 14:31:59,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396291.5, ans=0.125 2024-06-21 14:32:00,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.04 vs. limit=10.0 2024-06-21 14:32:15,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=396328.1666666667, ans=0.07 2024-06-21 14:32:16,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.13 vs. limit=15.0 2024-06-21 14:32:28,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=396364.8333333333, ans=12.0 2024-06-21 14:32:31,074 INFO [train.py:1028] (0/2) Epoch 22, batch 3750, loss[loss=0.1787, simple_loss=0.2349, pruned_loss=0.06124, over 12607.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.237, pruned_loss=0.0657, over 2585270.25 frames. ], batch size: 22, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:32:34,262 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.69 vs. limit=22.5 2024-06-21 14:32:35,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=396383.1666666667, ans=0.125 2024-06-21 14:32:35,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=396383.1666666667, ans=0.0 2024-06-21 14:32:54,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=396438.1666666667, ans=0.07 2024-06-21 14:32:58,608 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.67 vs. limit=12.0 2024-06-21 14:33:04,190 INFO [train.py:1028] (0/2) Epoch 22, batch 3800, loss[loss=0.175, simple_loss=0.2227, pruned_loss=0.06367, over 13233.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2371, pruned_loss=0.06561, over 2583520.57 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:33:04,811 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.021e+02 2.176e+02 2.352e+02 3.040e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 14:33:26,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=396511.5, ans=0.125 2024-06-21 14:33:40,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=396548.1666666667, ans=0.125 2024-06-21 14:33:44,969 INFO [train.py:1028] (0/2) Epoch 22, batch 3850, loss[loss=0.1745, simple_loss=0.2191, pruned_loss=0.06498, over 13054.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2374, pruned_loss=0.06593, over 2583819.98 frames. ], batch size: 144, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:33:48,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=396566.5, ans=0.1 2024-06-21 14:33:51,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=396584.8333333333, ans=0.1 2024-06-21 14:34:01,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.28 vs. limit=15.0 2024-06-21 14:34:04,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.82 vs. limit=22.5 2024-06-21 14:34:05,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=396621.5, ans=0.1 2024-06-21 14:34:14,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396639.8333333333, ans=0.1 2024-06-21 14:34:17,267 INFO [train.py:1028] (0/2) Epoch 22, batch 3900, loss[loss=0.1834, simple_loss=0.2387, pruned_loss=0.06399, over 13222.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2371, pruned_loss=0.06598, over 2587024.32 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:34:17,838 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.024e+02 2.128e+02 2.346e+02 3.309e+02, threshold=4.255e+02, percent-clipped=0.0 2024-06-21 14:34:28,043 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.32 vs. limit=15.0 2024-06-21 14:34:29,203 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.09 vs. limit=15.0 2024-06-21 14:34:33,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=396694.8333333333, ans=0.125 2024-06-21 14:34:37,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=396713.1666666667, ans=0.125 2024-06-21 14:34:40,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=396713.1666666667, ans=0.125 2024-06-21 14:34:42,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=396731.5, ans=0.125 2024-06-21 14:34:48,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=396731.5, ans=0.125 2024-06-21 14:34:49,808 INFO [train.py:1028] (0/2) Epoch 22, batch 3950, loss[loss=0.1913, simple_loss=0.2311, pruned_loss=0.07575, over 13108.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.236, pruned_loss=0.06541, over 2587750.27 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:35:08,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=396804.8333333333, ans=0.1 2024-06-21 14:35:09,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=396804.8333333333, ans=0.2 2024-06-21 14:35:09,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=396804.8333333333, ans=0.0 2024-06-21 14:35:15,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=396804.8333333333, ans=0.0 2024-06-21 14:35:18,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=396823.1666666667, ans=0.0 2024-06-21 14:35:25,415 INFO [train.py:1028] (0/2) Epoch 22, batch 4000, loss[loss=0.1914, simple_loss=0.2492, pruned_loss=0.06678, over 12917.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2354, pruned_loss=0.06527, over 2582915.62 frames. ], batch size: 39, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:35:25,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396841.5, ans=0.0 2024-06-21 14:35:26,124 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 1.984e+02 2.134e+02 2.275e+02 3.227e+02, threshold=4.269e+02, percent-clipped=0.0 2024-06-21 14:35:31,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=396841.5, ans=0.125 2024-06-21 14:35:31,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=396841.5, ans=0.125 2024-06-21 14:35:33,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=396859.8333333333, ans=0.0 2024-06-21 14:35:34,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=396859.8333333333, ans=0.0 2024-06-21 14:35:56,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=396914.8333333333, ans=0.125 2024-06-21 14:36:02,756 INFO [train.py:1028] (0/2) Epoch 22, batch 4050, loss[loss=0.2068, simple_loss=0.2458, pruned_loss=0.08387, over 11053.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2351, pruned_loss=0.06533, over 2580474.57 frames. ], batch size: 304, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:36:09,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396951.5, ans=0.1 2024-06-21 14:36:15,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=396969.8333333333, ans=0.2 2024-06-21 14:36:20,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=396969.8333333333, ans=0.0 2024-06-21 14:36:23,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=396988.1666666667, ans=0.125 2024-06-21 14:36:25,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=396988.1666666667, ans=0.125 2024-06-21 14:36:35,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397024.8333333333, ans=0.1 2024-06-21 14:36:35,751 INFO [train.py:1028] (0/2) Epoch 22, batch 4100, loss[loss=0.1857, simple_loss=0.2283, pruned_loss=0.07152, over 13065.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2355, pruned_loss=0.06561, over 2577449.80 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:36:36,339 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 2.025e+02 2.145e+02 2.317e+02 2.980e+02, threshold=4.289e+02, percent-clipped=0.0 2024-06-21 14:36:40,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=397024.8333333333, ans=0.1 2024-06-21 14:36:49,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=397061.5, ans=0.125 2024-06-21 14:36:49,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=397061.5, ans=0.0 2024-06-21 14:36:58,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.64 vs. limit=12.0 2024-06-21 14:36:58,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=397079.8333333333, ans=0.2 2024-06-21 14:37:00,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=397079.8333333333, ans=0.125 2024-06-21 14:37:00,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.63 vs. limit=22.5 2024-06-21 14:37:06,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=397098.1666666667, ans=0.0 2024-06-21 14:37:08,982 INFO [train.py:1028] (0/2) Epoch 22, batch 4150, loss[loss=0.1813, simple_loss=0.2375, pruned_loss=0.06254, over 13132.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2349, pruned_loss=0.06502, over 2576840.76 frames. ], batch size: 55, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:37:11,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=397116.5, ans=0.0 2024-06-21 14:37:14,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397116.5, ans=0.1 2024-06-21 14:37:21,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=397134.8333333333, ans=0.2 2024-06-21 14:37:30,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=397153.1666666667, ans=0.125 2024-06-21 14:37:38,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=397171.5, ans=0.2 2024-06-21 14:37:44,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2024-06-21 14:37:47,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=397189.8333333333, ans=0.125 2024-06-21 14:37:51,439 INFO [train.py:1028] (0/2) Epoch 22, batch 4200, loss[loss=0.1773, simple_loss=0.229, pruned_loss=0.06286, over 13083.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2343, pruned_loss=0.06494, over 2579859.47 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:37:52,077 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 1.971e+02 2.084e+02 2.260e+02 3.127e+02, threshold=4.167e+02, percent-clipped=0.0 2024-06-21 14:37:53,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397208.1666666667, ans=0.1 2024-06-21 14:38:21,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397281.5, ans=0.1 2024-06-21 14:38:24,628 INFO [train.py:1028] (0/2) Epoch 22, batch 4250, loss[loss=0.1804, simple_loss=0.2408, pruned_loss=0.06006, over 13298.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2342, pruned_loss=0.06486, over 2582297.23 frames. ], batch size: 46, lr: 2.64e-03, grad_scale: 32.0 2024-06-21 14:38:24,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=397299.8333333333, ans=0.125 2024-06-21 14:38:28,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=397299.8333333333, ans=0.125 2024-06-21 14:38:30,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=397318.1666666667, ans=0.125 2024-06-21 14:38:35,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397318.1666666667, ans=0.1 2024-06-21 14:38:39,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=397336.5, ans=0.125 2024-06-21 14:38:45,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=397354.8333333333, ans=0.05 2024-06-21 14:38:51,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=397373.1666666667, ans=0.025 2024-06-21 14:38:58,010 INFO [train.py:1028] (0/2) Epoch 22, batch 4300, loss[loss=0.1812, simple_loss=0.226, pruned_loss=0.06818, over 13255.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2342, pruned_loss=0.0649, over 2582170.57 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:38:58,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.018e+02 2.157e+02 2.357e+02 3.339e+02, threshold=4.314e+02, percent-clipped=0.0 2024-06-21 14:39:01,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=397391.5, ans=0.07 2024-06-21 14:39:05,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=397409.8333333333, ans=0.2 2024-06-21 14:39:13,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=397428.1666666667, ans=0.025 2024-06-21 14:39:28,805 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2024-06-21 14:39:32,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=397464.8333333333, ans=0.2 2024-06-21 14:39:36,690 INFO [train.py:1028] (0/2) Epoch 22, batch 4350, loss[loss=0.1771, simple_loss=0.2327, pruned_loss=0.06078, over 13212.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2334, pruned_loss=0.06476, over 2587380.89 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:39:38,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=397483.1666666667, ans=0.125 2024-06-21 14:39:40,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=397483.1666666667, ans=0.125 2024-06-21 14:39:42,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=397483.1666666667, ans=0.125 2024-06-21 14:39:51,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397501.5, ans=0.1 2024-06-21 14:39:56,938 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=12.0 2024-06-21 14:39:58,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=397519.8333333333, ans=0.125 2024-06-21 14:40:01,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=397538.1666666667, ans=0.0 2024-06-21 14:40:13,181 INFO [train.py:1028] (0/2) Epoch 22, batch 4400, loss[loss=0.1958, simple_loss=0.2413, pruned_loss=0.07512, over 13244.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2332, pruned_loss=0.06493, over 2586557.83 frames. ], batch size: 83, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:40:13,776 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.716e+02 1.992e+02 2.108e+02 2.311e+02 3.009e+02, threshold=4.215e+02, percent-clipped=0.0 2024-06-21 14:40:18,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=397574.8333333333, ans=0.025 2024-06-21 14:40:37,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2024-06-21 14:40:38,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=397629.8333333333, ans=0.125 2024-06-21 14:40:38,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=397648.1666666667, ans=0.0 2024-06-21 14:40:46,601 INFO [train.py:1028] (0/2) Epoch 22, batch 4450, loss[loss=0.1733, simple_loss=0.2288, pruned_loss=0.0589, over 12852.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2336, pruned_loss=0.06523, over 2581074.61 frames. ], batch size: 33, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:40:50,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=397666.5, ans=0.125 2024-06-21 14:40:50,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=397666.5, ans=0.125 2024-06-21 14:41:04,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=397703.1666666667, ans=0.125 2024-06-21 14:41:04,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397703.1666666667, ans=0.125 2024-06-21 14:41:05,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-21 14:41:12,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=397739.8333333333, ans=0.0 2024-06-21 14:41:19,230 INFO [train.py:1028] (0/2) Epoch 22, batch 4500, loss[loss=0.1787, simple_loss=0.2215, pruned_loss=0.068, over 13258.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2335, pruned_loss=0.06528, over 2586066.58 frames. ], batch size: 89, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:41:19,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.24 vs. limit=10.0 2024-06-21 14:41:19,838 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.022e+02 2.162e+02 2.341e+02 3.125e+02, threshold=4.323e+02, percent-clipped=0.0 2024-06-21 14:41:29,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397776.5, ans=0.125 2024-06-21 14:41:33,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=397776.5, ans=0.0 2024-06-21 14:41:37,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.04 vs. limit=15.0 2024-06-21 14:41:45,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=397813.1666666667, ans=0.125 2024-06-21 14:41:50,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397813.1666666667, ans=0.1 2024-06-21 14:41:53,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=397831.5, ans=0.035 2024-06-21 14:41:53,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2024-06-21 14:41:59,241 INFO [train.py:1028] (0/2) Epoch 22, batch 4550, loss[loss=0.1887, simple_loss=0.2436, pruned_loss=0.06688, over 13257.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2333, pruned_loss=0.06531, over 2589851.24 frames. ], batch size: 52, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:42:05,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397868.1666666667, ans=0.1 2024-06-21 14:42:27,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=397923.1666666667, ans=0.0 2024-06-21 14:42:28,834 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.43 vs. limit=15.0 2024-06-21 14:42:29,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=397923.1666666667, ans=0.125 2024-06-21 14:42:31,682 INFO [train.py:1028] (0/2) Epoch 22, batch 4600, loss[loss=0.1778, simple_loss=0.2251, pruned_loss=0.06528, over 12549.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.233, pruned_loss=0.06475, over 2585799.79 frames. ], batch size: 202, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:42:31,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=397941.5, ans=0.0 2024-06-21 14:42:32,332 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.055e+02 2.218e+02 2.375e+02 3.149e+02, threshold=4.435e+02, percent-clipped=0.0 2024-06-21 14:42:33,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=397941.5, ans=0.125 2024-06-21 14:42:35,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=397941.5, ans=0.025 2024-06-21 14:42:35,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=397941.5, ans=0.2 2024-06-21 14:42:44,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397978.1666666667, ans=0.1 2024-06-21 14:42:47,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=397978.1666666667, ans=0.2 2024-06-21 14:42:56,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.20 vs. limit=22.5 2024-06-21 14:43:03,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=15.0 2024-06-21 14:43:04,228 INFO [train.py:1028] (0/2) Epoch 22, batch 4650, loss[loss=0.187, simple_loss=0.2367, pruned_loss=0.06868, over 13062.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2324, pruned_loss=0.06465, over 2588375.41 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:43:13,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=398051.5, ans=0.0 2024-06-21 14:43:16,827 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2024-06-21 14:43:17,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=398069.8333333333, ans=0.125 2024-06-21 14:43:26,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.76 vs. limit=10.0 2024-06-21 14:43:29,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2024-06-21 14:43:30,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2024-06-21 14:43:35,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.86 vs. limit=10.0 2024-06-21 14:43:41,271 INFO [train.py:1028] (0/2) Epoch 22, batch 4700, loss[loss=0.1779, simple_loss=0.2427, pruned_loss=0.0566, over 12344.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2329, pruned_loss=0.06507, over 2584023.31 frames. ], batch size: 25, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:43:41,941 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.008e+02 2.163e+02 2.336e+02 3.155e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 14:43:51,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=398143.1666666667, ans=0.125 2024-06-21 14:44:17,598 INFO [train.py:1028] (0/2) Epoch 22, batch 4750, loss[loss=0.2005, simple_loss=0.2478, pruned_loss=0.07659, over 12560.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2321, pruned_loss=0.06462, over 2580390.05 frames. ], batch size: 202, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:44:19,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=398216.5, ans=0.2 2024-06-21 14:44:21,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=398216.5, ans=0.125 2024-06-21 14:44:31,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.20 vs. limit=22.5 2024-06-21 14:44:37,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=398271.5, ans=0.125 2024-06-21 14:44:38,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398271.5, ans=0.1 2024-06-21 14:44:44,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398289.8333333333, ans=0.1 2024-06-21 14:44:46,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=398289.8333333333, ans=0.0 2024-06-21 14:44:48,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=398289.8333333333, ans=0.125 2024-06-21 14:44:50,211 INFO [train.py:1028] (0/2) Epoch 22, batch 4800, loss[loss=0.1754, simple_loss=0.2196, pruned_loss=0.06563, over 13227.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.232, pruned_loss=0.06481, over 2576822.34 frames. ], batch size: 63, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:44:50,898 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.024e+02 2.117e+02 2.256e+02 2.983e+02, threshold=4.234e+02, percent-clipped=0.0 2024-06-21 14:44:54,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=398308.1666666667, ans=0.0 2024-06-21 14:44:59,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=398326.5, ans=0.2 2024-06-21 14:45:04,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2024-06-21 14:45:05,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=398344.8333333333, ans=0.0 2024-06-21 14:45:12,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398363.1666666667, ans=0.125 2024-06-21 14:45:15,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=398381.5, ans=0.0 2024-06-21 14:45:26,217 INFO [train.py:1028] (0/2) Epoch 22, batch 4850, loss[loss=0.1674, simple_loss=0.213, pruned_loss=0.06096, over 13215.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2319, pruned_loss=0.06469, over 2574271.50 frames. ], batch size: 89, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:45:27,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=398399.8333333333, ans=0.0 2024-06-21 14:45:28,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.21 vs. limit=15.0 2024-06-21 14:45:33,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=398418.1666666667, ans=0.0 2024-06-21 14:45:47,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=398436.5, ans=0.125 2024-06-21 14:45:52,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=15.0 2024-06-21 14:45:56,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=398454.8333333333, ans=0.07 2024-06-21 14:46:04,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=12.0 2024-06-21 14:46:05,844 INFO [train.py:1028] (0/2) Epoch 22, batch 4900, loss[loss=0.1522, simple_loss=0.2008, pruned_loss=0.05184, over 13193.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2319, pruned_loss=0.06468, over 2575086.67 frames. ], batch size: 59, lr: 2.64e-03, grad_scale: 64.0 2024-06-21 14:46:06,434 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 1.994e+02 2.150e+02 2.264e+02 3.021e+02, threshold=4.300e+02, percent-clipped=0.0 2024-06-21 14:46:08,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=398491.5, ans=0.0 2024-06-21 14:46:09,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398491.5, ans=0.1 2024-06-21 14:46:12,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2024-06-21 14:46:23,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=398528.1666666667, ans=0.125 2024-06-21 14:46:27,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=398546.5, ans=0.0 2024-06-21 14:46:38,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=398583.1666666667, ans=0.125 2024-06-21 14:46:39,310 INFO [train.py:1028] (0/2) Epoch 22, batch 4950, loss[loss=0.1846, simple_loss=0.2245, pruned_loss=0.07241, over 11132.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2322, pruned_loss=0.06505, over 2570711.30 frames. ], batch size: 304, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:46:41,390 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:46:50,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=398601.5, ans=0.0 2024-06-21 14:46:52,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=398619.8333333333, ans=0.125 2024-06-21 14:46:54,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=15.0 2024-06-21 14:46:56,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398619.8333333333, ans=0.125 2024-06-21 14:46:59,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=15.0 2024-06-21 14:46:59,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.08 vs. limit=15.0 2024-06-21 14:47:09,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2024-06-21 14:47:11,554 INFO [train.py:1028] (0/2) Epoch 22, batch 5000, loss[loss=0.1907, simple_loss=0.2284, pruned_loss=0.07654, over 13130.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2316, pruned_loss=0.06471, over 2573700.19 frames. ], batch size: 95, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:47:12,145 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.646e+02 2.033e+02 2.212e+02 2.427e+02 3.168e+02, threshold=4.423e+02, percent-clipped=0.0 2024-06-21 14:47:12,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398674.8333333333, ans=0.125 2024-06-21 14:47:31,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=398711.5, ans=10.0 2024-06-21 14:47:33,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=398711.5, ans=0.0 2024-06-21 14:47:52,089 INFO [train.py:1028] (0/2) Epoch 22, batch 5050, loss[loss=0.1583, simple_loss=0.2185, pruned_loss=0.04903, over 12936.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2319, pruned_loss=0.06444, over 2572612.92 frames. ], batch size: 36, lr: 2.63e-03, grad_scale: 64.0 2024-06-21 14:48:00,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=398784.8333333333, ans=0.1 2024-06-21 14:48:03,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398784.8333333333, ans=0.1 2024-06-21 14:48:05,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=398803.1666666667, ans=0.125 2024-06-21 14:48:05,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=398803.1666666667, ans=0.125 2024-06-21 14:48:06,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=398803.1666666667, ans=0.2 2024-06-21 14:48:08,980 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:48:18,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=398839.8333333333, ans=0.0 2024-06-21 14:48:24,948 INFO [train.py:1028] (0/2) Epoch 22, batch 5100, loss[loss=0.1968, simple_loss=0.2519, pruned_loss=0.07088, over 12925.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2323, pruned_loss=0.06466, over 2568901.56 frames. ], batch size: 39, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:48:26,354 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.782e+02 1.998e+02 2.133e+02 2.340e+02 3.001e+02, threshold=4.265e+02, percent-clipped=0.0 2024-06-21 14:48:33,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=398876.5, ans=0.2 2024-06-21 14:48:44,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398913.1666666667, ans=0.1 2024-06-21 14:48:45,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=398913.1666666667, ans=0.0 2024-06-21 14:48:50,835 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:48:52,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398931.5, ans=0.1 2024-06-21 14:48:54,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398931.5, ans=0.1 2024-06-21 14:48:57,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=398949.8333333333, ans=0.025 2024-06-21 14:48:58,432 INFO [train.py:1028] (0/2) Epoch 22, batch 5150, loss[loss=0.1827, simple_loss=0.2272, pruned_loss=0.06913, over 13106.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2323, pruned_loss=0.06479, over 2570725.37 frames. ], batch size: 132, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:49:05,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=398968.1666666667, ans=0.2 2024-06-21 14:49:08,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=398968.1666666667, ans=0.0 2024-06-21 14:49:15,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=398986.5, ans=0.125 2024-06-21 14:49:18,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2024-06-21 14:49:32,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=399023.1666666667, ans=0.125 2024-06-21 14:49:33,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=399023.1666666667, ans=0.025 2024-06-21 14:49:34,953 INFO [train.py:1028] (0/2) Epoch 22, batch 5200, loss[loss=0.1684, simple_loss=0.2181, pruned_loss=0.05936, over 13096.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.232, pruned_loss=0.06446, over 2573397.54 frames. ], batch size: 95, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:49:35,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=399041.5, ans=0.05 2024-06-21 14:49:36,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 2.022e+02 2.167e+02 2.270e+02 3.344e+02, threshold=4.334e+02, percent-clipped=0.0 2024-06-21 14:49:44,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=399059.8333333333, ans=0.125 2024-06-21 14:49:46,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-21 14:49:47,124 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2024-06-21 14:49:52,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=399078.1666666667, ans=0.2 2024-06-21 14:49:53,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399078.1666666667, ans=0.1 2024-06-21 14:50:02,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=399096.5, ans=10.0 2024-06-21 14:50:05,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399114.8333333333, ans=0.1 2024-06-21 14:50:09,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=399114.8333333333, ans=0.125 2024-06-21 14:50:11,897 INFO [train.py:1028] (0/2) Epoch 22, batch 5250, loss[loss=0.1859, simple_loss=0.2424, pruned_loss=0.06465, over 13263.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.233, pruned_loss=0.06515, over 2570751.26 frames. ], batch size: 52, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:50:22,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=399151.5, ans=0.125 2024-06-21 14:50:29,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399169.8333333333, ans=0.1 2024-06-21 14:50:36,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=15.0 2024-06-21 14:50:40,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.04 vs. limit=15.0 2024-06-21 14:50:45,260 INFO [train.py:1028] (0/2) Epoch 22, batch 5300, loss[loss=0.1807, simple_loss=0.2232, pruned_loss=0.06916, over 12996.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2329, pruned_loss=0.06516, over 2567342.76 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:50:46,493 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.717e+02 2.015e+02 2.116e+02 2.248e+02 2.806e+02, threshold=4.232e+02, percent-clipped=0.0 2024-06-21 14:50:51,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:50:52,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=399243.1666666667, ans=0.125 2024-06-21 14:51:02,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=399261.5, ans=0.125 2024-06-21 14:51:08,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=15.0 2024-06-21 14:51:17,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=12.0 2024-06-21 14:51:18,640 INFO [train.py:1028] (0/2) Epoch 22, batch 5350, loss[loss=0.1652, simple_loss=0.2263, pruned_loss=0.05207, over 11325.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2323, pruned_loss=0.06517, over 2573276.22 frames. ], batch size: 16, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:51:23,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=399316.5, ans=0.0 2024-06-21 14:51:24,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399316.5, ans=0.1 2024-06-21 14:51:25,066 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.422e+00 2024-06-21 14:51:27,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=399316.5, ans=0.125 2024-06-21 14:51:29,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=399334.8333333333, ans=0.0 2024-06-21 14:51:29,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=399334.8333333333, ans=0.125 2024-06-21 14:51:42,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=399353.1666666667, ans=0.125 2024-06-21 14:51:46,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=399371.5, ans=0.0 2024-06-21 14:51:58,137 INFO [train.py:1028] (0/2) Epoch 22, batch 5400, loss[loss=0.1935, simple_loss=0.2384, pruned_loss=0.07429, over 12255.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2326, pruned_loss=0.06539, over 2566498.94 frames. ], batch size: 241, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:52:00,118 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.009e+02 2.150e+02 2.300e+02 3.036e+02, threshold=4.299e+02, percent-clipped=0.0 2024-06-21 14:52:18,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.65 vs. limit=22.5 2024-06-21 14:52:27,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=399481.5, ans=0.125 2024-06-21 14:52:31,956 INFO [train.py:1028] (0/2) Epoch 22, batch 5450, loss[loss=0.1819, simple_loss=0.2299, pruned_loss=0.06702, over 12367.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.2326, pruned_loss=0.0651, over 2571035.91 frames. ], batch size: 25, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:52:45,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=399536.5, ans=0.125 2024-06-21 14:52:46,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2024-06-21 14:52:53,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399554.8333333333, ans=0.1 2024-06-21 14:52:54,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=399554.8333333333, ans=0.125 2024-06-21 14:53:05,124 INFO [train.py:1028] (0/2) Epoch 22, batch 5500, loss[loss=0.1996, simple_loss=0.2458, pruned_loss=0.07668, over 12122.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2329, pruned_loss=0.06507, over 2564292.62 frames. ], batch size: 240, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:53:06,885 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.045e+02 2.140e+02 2.287e+02 3.017e+02, threshold=4.281e+02, percent-clipped=0.0 2024-06-21 14:53:29,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=399646.5, ans=10.0 2024-06-21 14:53:46,277 INFO [train.py:1028] (0/2) Epoch 22, batch 5550, loss[loss=0.1672, simple_loss=0.2215, pruned_loss=0.05644, over 13159.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2322, pruned_loss=0.06443, over 2567147.76 frames. ], batch size: 43, lr: 2.63e-03, grad_scale: 16.0 2024-06-21 14:53:49,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=399683.1666666667, ans=0.0 2024-06-21 14:53:49,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=399683.1666666667, ans=0.0 2024-06-21 14:53:50,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=399683.1666666667, ans=0.0 2024-06-21 14:53:54,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.22 vs. limit=12.0 2024-06-21 14:53:57,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=399701.5, ans=0.0 2024-06-21 14:54:04,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399719.8333333333, ans=0.1 2024-06-21 14:54:05,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=399738.1666666667, ans=0.04949747468305833 2024-06-21 14:54:07,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=399738.1666666667, ans=0.0 2024-06-21 14:54:10,498 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399738.1666666667, ans=0.1 2024-06-21 14:54:17,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=399774.8333333333, ans=0.0 2024-06-21 14:54:18,020 INFO [train.py:1028] (0/2) Epoch 22, batch 5600, loss[loss=0.1666, simple_loss=0.2187, pruned_loss=0.05721, over 13221.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2317, pruned_loss=0.06422, over 2571228.29 frames. ], batch size: 89, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:54:19,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=399774.8333333333, ans=0.125 2024-06-21 14:54:20,013 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 1.966e+02 2.142e+02 2.327e+02 5.044e+02, threshold=4.284e+02, percent-clipped=1.0 2024-06-21 14:54:21,082 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2024-06-21 14:54:24,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399793.1666666667, ans=0.1 2024-06-21 14:54:39,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=399829.8333333333, ans=0.125 2024-06-21 14:54:40,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=399829.8333333333, ans=10.0 2024-06-21 14:54:41,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=399829.8333333333, ans=0.0 2024-06-21 14:54:49,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399848.1666666667, ans=0.1 2024-06-21 14:54:51,109 INFO [train.py:1028] (0/2) Epoch 22, batch 5650, loss[loss=0.1919, simple_loss=0.2394, pruned_loss=0.07217, over 12559.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2317, pruned_loss=0.06429, over 2576911.07 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:54:53,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=399866.5, ans=0.0 2024-06-21 14:54:58,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=399884.8333333333, ans=0.0 2024-06-21 14:55:03,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2024-06-21 14:55:06,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.14 vs. limit=5.0 2024-06-21 14:55:12,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=399921.5, ans=0.0 2024-06-21 14:55:16,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=399921.5, ans=0.125 2024-06-21 14:55:27,092 INFO [train.py:1028] (0/2) Epoch 22, batch 5700, loss[loss=0.1654, simple_loss=0.2211, pruned_loss=0.0548, over 13244.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2312, pruned_loss=0.06423, over 2581052.25 frames. ], batch size: 63, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:55:29,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 1.975e+02 2.072e+02 2.214e+02 2.791e+02, threshold=4.145e+02, percent-clipped=0.0 2024-06-21 14:55:29,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=399958.1666666667, ans=0.125 2024-06-21 14:55:34,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.44 vs. limit=10.0 2024-06-21 14:55:36,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.88 vs. limit=22.5 2024-06-21 14:55:36,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=22.5 2024-06-21 14:55:39,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=399976.5, ans=0.125 2024-06-21 14:55:48,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=399994.8333333333, ans=0.125 2024-06-21 14:55:48,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399994.8333333333, ans=0.1 2024-06-21 14:55:52,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400013.1666666667, ans=0.1 2024-06-21 14:55:53,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=400013.1666666667, ans=0.0 2024-06-21 14:56:00,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=400031.5, ans=0.0 2024-06-21 14:56:03,207 INFO [train.py:1028] (0/2) Epoch 22, batch 5750, loss[loss=0.1812, simple_loss=0.2272, pruned_loss=0.06755, over 12709.00 frames. ], tot_loss[loss=0.1802, simple_loss=0.2316, pruned_loss=0.06437, over 2580431.71 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:56:04,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=400049.8333333333, ans=0.025 2024-06-21 14:56:04,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.01 vs. limit=22.5 2024-06-21 14:56:05,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400049.8333333333, ans=0.1 2024-06-21 14:56:09,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=400068.1666666667, ans=0.0 2024-06-21 14:56:12,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=400068.1666666667, ans=0.025 2024-06-21 14:56:25,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=400104.8333333333, ans=0.125 2024-06-21 14:56:29,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=400123.1666666667, ans=0.1 2024-06-21 14:56:35,175 INFO [train.py:1028] (0/2) Epoch 22, batch 5800, loss[loss=0.1865, simple_loss=0.2386, pruned_loss=0.06718, over 12837.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2329, pruned_loss=0.06514, over 2579294.16 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:56:36,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=400141.5, ans=0.125 2024-06-21 14:56:37,140 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.130e+02 2.282e+02 2.432e+02 3.124e+02, threshold=4.564e+02, percent-clipped=0.0 2024-06-21 14:56:41,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=400159.8333333333, ans=0.125 2024-06-21 14:56:44,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=400159.8333333333, ans=0.2 2024-06-21 14:56:44,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2024-06-21 14:56:45,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400159.8333333333, ans=0.1 2024-06-21 14:56:48,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=400178.1666666667, ans=0.0 2024-06-21 14:56:50,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=400178.1666666667, ans=0.125 2024-06-21 14:56:50,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=400178.1666666667, ans=0.025 2024-06-21 14:56:53,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.32 vs. limit=22.5 2024-06-21 14:57:06,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=400233.1666666667, ans=0.125 2024-06-21 14:57:07,475 INFO [train.py:1028] (0/2) Epoch 22, batch 5850, loss[loss=0.1913, simple_loss=0.2397, pruned_loss=0.07145, over 12577.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2345, pruned_loss=0.06573, over 2577614.98 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:57:10,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.00 vs. limit=22.5 2024-06-21 14:57:23,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2024-06-21 14:57:32,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400269.8333333333, ans=0.1 2024-06-21 14:57:39,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=400288.1666666667, ans=0.0 2024-06-21 14:57:39,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=400288.1666666667, ans=0.1 2024-06-21 14:57:41,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=400306.5, ans=0.125 2024-06-21 14:57:48,783 INFO [train.py:1028] (0/2) Epoch 22, batch 5900, loss[loss=0.1751, simple_loss=0.2258, pruned_loss=0.06224, over 13113.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2363, pruned_loss=0.06634, over 2577175.15 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:57:50,943 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.731e+02 2.054e+02 2.189e+02 2.439e+02 3.591e+02, threshold=4.378e+02, percent-clipped=0.0 2024-06-21 14:57:54,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400324.8333333333, ans=0.125 2024-06-21 14:58:01,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2024-06-21 14:58:06,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=400361.5, ans=0.125 2024-06-21 14:58:10,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.60 vs. limit=6.0 2024-06-21 14:58:15,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=400398.1666666667, ans=0.0 2024-06-21 14:58:20,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400398.1666666667, ans=0.1 2024-06-21 14:58:22,210 INFO [train.py:1028] (0/2) Epoch 22, batch 5950, loss[loss=0.1879, simple_loss=0.2331, pruned_loss=0.07139, over 13076.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2375, pruned_loss=0.06691, over 2580680.92 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:58:22,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2024-06-21 14:58:26,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.76 vs. limit=15.0 2024-06-21 14:58:28,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=400434.8333333333, ans=0.05 2024-06-21 14:58:29,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=400434.8333333333, ans=0.0 2024-06-21 14:58:35,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=400453.1666666667, ans=0.2 2024-06-21 14:58:41,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=400471.5, ans=0.125 2024-06-21 14:58:41,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=400471.5, ans=0.0 2024-06-21 14:58:42,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=400471.5, ans=0.125 2024-06-21 14:58:55,382 INFO [train.py:1028] (0/2) Epoch 22, batch 6000, loss[loss=0.2179, simple_loss=0.2611, pruned_loss=0.08736, over 12131.00 frames. ], tot_loss[loss=0.187, simple_loss=0.239, pruned_loss=0.06747, over 2573812.46 frames. ], batch size: 240, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:58:55,382 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 14:59:03,338 INFO [train.py:1060] (0/2) Epoch 22, validation: loss=0.1876, simple_loss=0.251, pruned_loss=0.06212, over 351949.00 frames. 2024-06-21 14:59:03,338 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 14:59:05,444 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.073e+02 2.237e+02 2.446e+02 3.016e+02, threshold=4.475e+02, percent-clipped=0.0 2024-06-21 14:59:13,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=400526.5, ans=0.125 2024-06-21 14:59:28,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=400563.1666666667, ans=0.1 2024-06-21 14:59:43,575 INFO [train.py:1028] (0/2) Epoch 22, batch 6050, loss[loss=0.1984, simple_loss=0.2485, pruned_loss=0.07412, over 12795.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2402, pruned_loss=0.06777, over 2577272.97 frames. ], batch size: 39, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 14:59:44,513 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 14:59:50,785 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=27.51 vs. limit=22.5 2024-06-21 14:59:53,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=400618.1666666667, ans=0.0 2024-06-21 15:00:17,160 INFO [train.py:1028] (0/2) Epoch 22, batch 6100, loss[loss=0.1688, simple_loss=0.2188, pruned_loss=0.05943, over 13082.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2417, pruned_loss=0.06798, over 2579493.83 frames. ], batch size: 121, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:00:19,149 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.054e+02 2.165e+02 2.342e+02 3.200e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-21 15:00:26,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=400709.8333333333, ans=0.035 2024-06-21 15:00:30,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=400728.1666666667, ans=0.5 2024-06-21 15:00:40,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400746.5, ans=0.1 2024-06-21 15:00:41,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=400746.5, ans=0.025 2024-06-21 15:00:52,302 INFO [train.py:1028] (0/2) Epoch 22, batch 6150, loss[loss=0.2073, simple_loss=0.2448, pruned_loss=0.08494, over 10931.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2436, pruned_loss=0.06877, over 2577996.54 frames. ], batch size: 304, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:00:53,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400783.1666666667, ans=0.1 2024-06-21 15:01:00,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=400801.5, ans=0.125 2024-06-21 15:01:31,034 INFO [train.py:1028] (0/2) Epoch 22, batch 6200, loss[loss=0.2282, simple_loss=0.2835, pruned_loss=0.08645, over 13297.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.245, pruned_loss=0.06938, over 2575144.74 frames. ], batch size: 89, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:01:38,292 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.235e+02 2.433e+02 2.785e+02 4.406e+02, threshold=4.866e+02, percent-clipped=1.0 2024-06-21 15:01:52,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=400911.5, ans=0.125 2024-06-21 15:02:08,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=400948.1666666667, ans=0.2 2024-06-21 15:02:11,418 INFO [train.py:1028] (0/2) Epoch 22, batch 6250, loss[loss=0.2158, simple_loss=0.2699, pruned_loss=0.08084, over 13211.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2468, pruned_loss=0.06987, over 2568077.85 frames. ], batch size: 83, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:02:15,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2024-06-21 15:02:27,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.93 vs. limit=15.0 2024-06-21 15:02:30,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=401021.5, ans=0.0 2024-06-21 15:02:32,763 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.82 vs. limit=10.0 2024-06-21 15:02:38,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401039.8333333333, ans=0.125 2024-06-21 15:02:44,501 INFO [train.py:1028] (0/2) Epoch 22, batch 6300, loss[loss=0.1671, simple_loss=0.2242, pruned_loss=0.05497, over 11425.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2476, pruned_loss=0.07012, over 2564462.79 frames. ], batch size: 16, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:02:46,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2024-06-21 15:02:46,499 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.233e+02 2.386e+02 2.704e+02 4.213e+02, threshold=4.771e+02, percent-clipped=0.0 2024-06-21 15:02:47,473 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.356e+01 2024-06-21 15:02:57,136 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.16 vs. limit=15.0 2024-06-21 15:03:00,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=401094.8333333333, ans=0.07 2024-06-21 15:03:02,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=401094.8333333333, ans=0.0 2024-06-21 15:03:07,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=401113.1666666667, ans=0.0 2024-06-21 15:03:15,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=401131.5, ans=0.125 2024-06-21 15:03:18,019 INFO [train.py:1028] (0/2) Epoch 22, batch 6350, loss[loss=0.2187, simple_loss=0.2662, pruned_loss=0.08561, over 12553.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2489, pruned_loss=0.07015, over 2575299.74 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:03:24,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=401168.1666666667, ans=0.025 2024-06-21 15:03:39,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=401186.5, ans=0.2 2024-06-21 15:03:41,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401204.8333333333, ans=0.125 2024-06-21 15:03:51,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.41 vs. limit=22.5 2024-06-21 15:03:52,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2024-06-21 15:03:55,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=401223.1666666667, ans=0.125 2024-06-21 15:03:57,488 INFO [train.py:1028] (0/2) Epoch 22, batch 6400, loss[loss=0.1934, simple_loss=0.2561, pruned_loss=0.06533, over 13212.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2505, pruned_loss=0.07075, over 2576043.64 frames. ], batch size: 67, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:03:59,540 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.114e+02 2.250e+02 2.485e+02 3.994e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-21 15:04:00,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=401241.5, ans=0.125 2024-06-21 15:04:02,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=401241.5, ans=0.125 2024-06-21 15:04:10,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.90 vs. limit=22.5 2024-06-21 15:04:12,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=401278.1666666667, ans=0.07 2024-06-21 15:04:13,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=401278.1666666667, ans=0.0 2024-06-21 15:04:19,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=401296.5, ans=10.0 2024-06-21 15:04:21,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401296.5, ans=0.0 2024-06-21 15:04:24,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.38 vs. limit=12.0 2024-06-21 15:04:27,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.50 vs. limit=15.0 2024-06-21 15:04:29,993 INFO [train.py:1028] (0/2) Epoch 22, batch 6450, loss[loss=0.2194, simple_loss=0.2706, pruned_loss=0.08411, over 12577.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2518, pruned_loss=0.07133, over 2581680.00 frames. ], batch size: 202, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:04:35,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=401333.1666666667, ans=0.125 2024-06-21 15:04:36,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.50 vs. limit=10.0 2024-06-21 15:04:47,812 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:04:54,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=401388.1666666667, ans=0.125 2024-06-21 15:04:56,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2024-06-21 15:05:01,540 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=15.0 2024-06-21 15:05:02,393 INFO [train.py:1028] (0/2) Epoch 22, batch 6500, loss[loss=0.2094, simple_loss=0.2521, pruned_loss=0.08332, over 10813.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2532, pruned_loss=0.07168, over 2585538.99 frames. ], batch size: 303, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:05:03,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=401424.8333333333, ans=0.025 2024-06-21 15:05:03,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=401424.8333333333, ans=0.0 2024-06-21 15:05:04,302 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.171e+02 2.322e+02 2.518e+02 3.300e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-21 15:05:27,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=401498.1666666667, ans=0.2 2024-06-21 15:05:31,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=401498.1666666667, ans=0.09899494936611666 2024-06-21 15:05:33,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=401516.5, ans=0.125 2024-06-21 15:05:34,344 INFO [train.py:1028] (0/2) Epoch 22, batch 6550, loss[loss=0.181, simple_loss=0.2364, pruned_loss=0.06281, over 12523.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2543, pruned_loss=0.07165, over 2590441.91 frames. ], batch size: 22, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:05:34,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=401516.5, ans=0.0 2024-06-21 15:05:52,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=401553.1666666667, ans=0.0 2024-06-21 15:05:54,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=401553.1666666667, ans=0.0 2024-06-21 15:06:01,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=401571.5, ans=0.125 2024-06-21 15:06:06,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=401589.8333333333, ans=0.125 2024-06-21 15:06:10,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=401589.8333333333, ans=0.0 2024-06-21 15:06:11,374 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.20 vs. limit=10.0 2024-06-21 15:06:12,153 INFO [train.py:1028] (0/2) Epoch 22, batch 6600, loss[loss=0.197, simple_loss=0.2526, pruned_loss=0.07075, over 13247.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2552, pruned_loss=0.0721, over 2592964.60 frames. ], batch size: 72, lr: 2.63e-03, grad_scale: 32.0 2024-06-21 15:06:14,237 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.212e+02 2.350e+02 2.504e+02 3.124e+02, threshold=4.701e+02, percent-clipped=0.0 2024-06-21 15:06:14,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=401608.1666666667, ans=0.0 2024-06-21 15:06:21,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=401626.5, ans=0.0 2024-06-21 15:06:24,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=401626.5, ans=0.125 2024-06-21 15:06:33,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=401663.1666666667, ans=0.2 2024-06-21 15:06:33,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=401663.1666666667, ans=0.09899494936611666 2024-06-21 15:06:34,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.67 vs. limit=22.5 2024-06-21 15:06:34,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=401663.1666666667, ans=0.125 2024-06-21 15:06:36,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=401663.1666666667, ans=0.0 2024-06-21 15:06:39,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=401681.5, ans=0.0 2024-06-21 15:06:44,800 INFO [train.py:1028] (0/2) Epoch 22, batch 6650, loss[loss=0.2162, simple_loss=0.2658, pruned_loss=0.0833, over 13013.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.257, pruned_loss=0.07283, over 2587094.62 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:06:54,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=401718.1666666667, ans=0.125 2024-06-21 15:07:02,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=401736.5, ans=0.0 2024-06-21 15:07:11,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=401773.1666666667, ans=0.125 2024-06-21 15:07:17,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401791.5, ans=0.1 2024-06-21 15:07:17,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401791.5, ans=0.1 2024-06-21 15:07:17,635 INFO [train.py:1028] (0/2) Epoch 22, batch 6700, loss[loss=0.2135, simple_loss=0.2683, pruned_loss=0.07933, over 12732.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2582, pruned_loss=0.07339, over 2586244.01 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:07:19,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.237e+02 2.396e+02 2.624e+02 3.925e+02, threshold=4.792e+02, percent-clipped=0.0 2024-06-21 15:07:25,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=401809.8333333333, ans=0.2 2024-06-21 15:07:27,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=401809.8333333333, ans=0.0 2024-06-21 15:07:28,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401809.8333333333, ans=0.0 2024-06-21 15:07:30,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=401809.8333333333, ans=0.2 2024-06-21 15:07:36,173 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2024-06-21 15:07:57,636 INFO [train.py:1028] (0/2) Epoch 22, batch 6750, loss[loss=0.2499, simple_loss=0.2926, pruned_loss=0.1036, over 12181.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2591, pruned_loss=0.07362, over 2580439.74 frames. ], batch size: 240, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:08:04,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.13 vs. limit=22.5 2024-06-21 15:08:06,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=401901.5, ans=0.025 2024-06-21 15:08:14,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=401919.8333333333, ans=0.0 2024-06-21 15:08:16,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=401938.1666666667, ans=0.125 2024-06-21 15:08:16,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.61 vs. limit=22.5 2024-06-21 15:08:25,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=401956.5, ans=0.125 2024-06-21 15:08:28,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401956.5, ans=0.1 2024-06-21 15:08:29,658 INFO [train.py:1028] (0/2) Epoch 22, batch 6800, loss[loss=0.1956, simple_loss=0.2516, pruned_loss=0.06977, over 13257.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2596, pruned_loss=0.07351, over 2581645.74 frames. ], batch size: 67, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:08:31,784 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.213e+02 2.403e+02 2.710e+02 4.229e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-21 15:08:34,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.59 vs. limit=15.0 2024-06-21 15:08:42,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=402011.5, ans=0.125 2024-06-21 15:08:45,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=402011.5, ans=0.125 2024-06-21 15:08:49,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2024-06-21 15:08:54,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.75 vs. limit=22.5 2024-06-21 15:08:57,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=402048.1666666667, ans=0.125 2024-06-21 15:09:02,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=402066.5, ans=0.0 2024-06-21 15:09:02,674 INFO [train.py:1028] (0/2) Epoch 22, batch 6850, loss[loss=0.2248, simple_loss=0.29, pruned_loss=0.07977, over 13277.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2611, pruned_loss=0.07392, over 2584471.05 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:09:03,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2024-06-21 15:09:20,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402103.1666666667, ans=0.1 2024-06-21 15:09:34,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=402158.1666666667, ans=0.2 2024-06-21 15:09:34,743 INFO [train.py:1028] (0/2) Epoch 22, batch 6900, loss[loss=0.1931, simple_loss=0.2519, pruned_loss=0.06721, over 13093.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2615, pruned_loss=0.07432, over 2585961.97 frames. ], batch size: 48, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:09:41,826 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.217e+02 2.472e+02 2.679e+02 3.885e+02, threshold=4.944e+02, percent-clipped=0.0 2024-06-21 15:09:44,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=402158.1666666667, ans=0.125 2024-06-21 15:09:59,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2024-06-21 15:10:06,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402213.1666666667, ans=0.1 2024-06-21 15:10:07,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=402213.1666666667, ans=0.015 2024-06-21 15:10:14,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=402231.5, ans=0.2 2024-06-21 15:10:16,875 INFO [train.py:1028] (0/2) Epoch 22, batch 6950, loss[loss=0.1772, simple_loss=0.2413, pruned_loss=0.05661, over 11342.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2617, pruned_loss=0.07427, over 2578865.30 frames. ], batch size: 16, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:10:17,860 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.93 vs. limit=12.0 2024-06-21 15:10:19,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=402249.8333333333, ans=0.125 2024-06-21 15:10:22,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=402268.1666666667, ans=0.125 2024-06-21 15:10:27,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=402268.1666666667, ans=0.0 2024-06-21 15:10:44,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=402323.1666666667, ans=0.125 2024-06-21 15:10:49,620 INFO [train.py:1028] (0/2) Epoch 22, batch 7000, loss[loss=0.2147, simple_loss=0.2715, pruned_loss=0.0789, over 12942.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2617, pruned_loss=0.07416, over 2576545.39 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:10:49,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=402341.5, ans=0.125 2024-06-21 15:10:51,523 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 2.184e+02 2.366e+02 2.623e+02 4.675e+02, threshold=4.731e+02, percent-clipped=0.0 2024-06-21 15:11:15,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=402396.5, ans=0.09899494936611666 2024-06-21 15:11:21,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.71 vs. limit=15.0 2024-06-21 15:11:23,604 INFO [train.py:1028] (0/2) Epoch 22, batch 7050, loss[loss=0.2323, simple_loss=0.2836, pruned_loss=0.09054, over 12715.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2618, pruned_loss=0.07421, over 2583819.03 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:11:25,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402433.1666666667, ans=0.1 2024-06-21 15:11:26,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=402433.1666666667, ans=0.125 2024-06-21 15:11:29,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=402451.5, ans=0.09899494936611666 2024-06-21 15:11:43,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=402488.1666666667, ans=0.0 2024-06-21 15:11:45,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=402488.1666666667, ans=0.07 2024-06-21 15:11:56,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=402506.5, ans=0.125 2024-06-21 15:11:56,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=402506.5, ans=0.125 2024-06-21 15:12:03,186 INFO [train.py:1028] (0/2) Epoch 22, batch 7100, loss[loss=0.2164, simple_loss=0.2782, pruned_loss=0.07735, over 13168.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2624, pruned_loss=0.07466, over 2575931.25 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:12:03,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402524.8333333333, ans=0.1 2024-06-21 15:12:05,151 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.161e+02 2.310e+02 2.495e+02 4.017e+02, threshold=4.621e+02, percent-clipped=0.0 2024-06-21 15:12:10,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=402543.1666666667, ans=0.125 2024-06-21 15:12:11,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=402543.1666666667, ans=0.125 2024-06-21 15:12:15,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2024-06-21 15:12:16,595 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.95 vs. limit=15.0 2024-06-21 15:12:17,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=402561.5, ans=0.0 2024-06-21 15:12:19,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.30 vs. limit=15.0 2024-06-21 15:12:27,489 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2024-06-21 15:12:35,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=402598.1666666667, ans=0.05 2024-06-21 15:12:36,515 INFO [train.py:1028] (0/2) Epoch 22, batch 7150, loss[loss=0.2303, simple_loss=0.281, pruned_loss=0.08975, over 12526.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2631, pruned_loss=0.07457, over 2573467.14 frames. ], batch size: 202, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:12:43,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=402634.8333333333, ans=0.125 2024-06-21 15:12:44,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=402634.8333333333, ans=0.2 2024-06-21 15:12:58,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=402671.5, ans=0.0 2024-06-21 15:13:03,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=402689.8333333333, ans=0.0 2024-06-21 15:13:05,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=402689.8333333333, ans=0.0 2024-06-21 15:13:08,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=402708.1666666667, ans=0.2 2024-06-21 15:13:08,518 INFO [train.py:1028] (0/2) Epoch 22, batch 7200, loss[loss=0.205, simple_loss=0.2631, pruned_loss=0.07349, over 13156.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2644, pruned_loss=0.07505, over 2578290.10 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:13:09,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=402708.1666666667, ans=0.125 2024-06-21 15:13:10,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.152e+02 2.335e+02 2.606e+02 3.795e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-21 15:13:15,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=402726.5, ans=0.125 2024-06-21 15:13:15,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=402726.5, ans=0.0 2024-06-21 15:13:25,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402744.8333333333, ans=0.1 2024-06-21 15:13:28,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=402763.1666666667, ans=0.125 2024-06-21 15:13:31,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=402763.1666666667, ans=0.125 2024-06-21 15:13:34,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=402781.5, ans=0.0 2024-06-21 15:13:34,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=402781.5, ans=0.125 2024-06-21 15:13:35,049 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=12.0 2024-06-21 15:13:39,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=402781.5, ans=0.025 2024-06-21 15:13:41,283 INFO [train.py:1028] (0/2) Epoch 22, batch 7250, loss[loss=0.2137, simple_loss=0.2729, pruned_loss=0.07729, over 12921.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2656, pruned_loss=0.07535, over 2579469.93 frames. ], batch size: 36, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:13:49,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=402818.1666666667, ans=0.0 2024-06-21 15:13:49,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=12.0 2024-06-21 15:13:50,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402818.1666666667, ans=0.1 2024-06-21 15:13:51,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=402818.1666666667, ans=0.07 2024-06-21 15:14:05,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=402836.5, ans=0.2 2024-06-21 15:14:20,815 INFO [train.py:1028] (0/2) Epoch 22, batch 7300, loss[loss=0.2122, simple_loss=0.2726, pruned_loss=0.0759, over 12980.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2673, pruned_loss=0.07614, over 2579257.33 frames. ], batch size: 36, lr: 2.62e-03, grad_scale: 32.0 2024-06-21 15:14:22,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=402891.5, ans=0.0 2024-06-21 15:14:22,767 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.349e+02 2.581e+02 2.877e+02 4.145e+02, threshold=5.162e+02, percent-clipped=0.0 2024-06-21 15:14:23,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=402891.5, ans=0.125 2024-06-21 15:14:24,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402891.5, ans=0.1 2024-06-21 15:14:26,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=402891.5, ans=0.125 2024-06-21 15:14:30,012 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2024-06-21 15:14:50,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402964.8333333333, ans=0.1 2024-06-21 15:14:53,184 INFO [train.py:1028] (0/2) Epoch 22, batch 7350, loss[loss=0.2106, simple_loss=0.2652, pruned_loss=0.07801, over 13296.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.267, pruned_loss=0.07611, over 2580293.82 frames. ], batch size: 46, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:15:01,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=403001.5, ans=0.125 2024-06-21 15:15:09,354 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.80 vs. limit=10.0 2024-06-21 15:15:26,132 INFO [train.py:1028] (0/2) Epoch 22, batch 7400, loss[loss=0.2369, simple_loss=0.3008, pruned_loss=0.08651, over 13196.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2671, pruned_loss=0.07589, over 2586752.86 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:15:28,127 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.244e+02 2.421e+02 2.720e+02 3.518e+02, threshold=4.842e+02, percent-clipped=0.0 2024-06-21 15:15:38,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=403093.1666666667, ans=0.125 2024-06-21 15:15:41,575 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.56 vs. limit=22.5 2024-06-21 15:16:03,835 INFO [train.py:1028] (0/2) Epoch 22, batch 7450, loss[loss=0.1639, simple_loss=0.226, pruned_loss=0.05083, over 12744.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2671, pruned_loss=0.07586, over 2580067.03 frames. ], batch size: 29, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:16:05,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2024-06-21 15:16:15,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=403184.8333333333, ans=0.0 2024-06-21 15:16:21,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=403203.1666666667, ans=0.0 2024-06-21 15:16:21,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2024-06-21 15:16:35,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=403239.8333333333, ans=0.125 2024-06-21 15:16:40,957 INFO [train.py:1028] (0/2) Epoch 22, batch 7500, loss[loss=0.2167, simple_loss=0.2635, pruned_loss=0.08492, over 10874.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2678, pruned_loss=0.07638, over 2578210.95 frames. ], batch size: 304, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:16:42,894 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.238e+02 2.423e+02 2.635e+02 3.666e+02, threshold=4.846e+02, percent-clipped=0.0 2024-06-21 15:16:45,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=403258.1666666667, ans=0.2 2024-06-21 15:16:50,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=403276.5, ans=0.1 2024-06-21 15:16:51,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2024-06-21 15:16:58,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.97 vs. limit=22.5 2024-06-21 15:17:00,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=403313.1666666667, ans=0.125 2024-06-21 15:17:04,030 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:17:07,788 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-220000.pt 2024-06-21 15:17:19,193 INFO [train.py:1028] (0/2) Epoch 22, batch 7550, loss[loss=0.2181, simple_loss=0.2674, pruned_loss=0.08445, over 12960.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2689, pruned_loss=0.07711, over 2577625.90 frames. ], batch size: 158, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:17:25,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.26 vs. limit=12.0 2024-06-21 15:17:27,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=403368.1666666667, ans=0.2 2024-06-21 15:17:32,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2024-06-21 15:17:33,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=403386.5, ans=0.0 2024-06-21 15:17:34,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.01 vs. limit=12.0 2024-06-21 15:17:40,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=403404.8333333333, ans=15.0 2024-06-21 15:17:44,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=403404.8333333333, ans=0.125 2024-06-21 15:17:52,510 INFO [train.py:1028] (0/2) Epoch 22, batch 7600, loss[loss=0.2165, simple_loss=0.2741, pruned_loss=0.07942, over 13188.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2689, pruned_loss=0.07701, over 2576443.98 frames. ], batch size: 83, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:17:54,590 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.194e+02 2.369e+02 2.611e+02 4.041e+02, threshold=4.737e+02, percent-clipped=0.0 2024-06-21 15:17:55,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.67 vs. limit=22.5 2024-06-21 15:17:56,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.69 vs. limit=15.0 2024-06-21 15:18:02,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=403459.8333333333, ans=0.125 2024-06-21 15:18:05,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=403478.1666666667, ans=0.2 2024-06-21 15:18:06,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403478.1666666667, ans=0.1 2024-06-21 15:18:20,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=403496.5, ans=0.1 2024-06-21 15:18:29,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=403514.8333333333, ans=0.5 2024-06-21 15:18:32,888 INFO [train.py:1028] (0/2) Epoch 22, batch 7650, loss[loss=0.1911, simple_loss=0.2592, pruned_loss=0.06149, over 12941.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2694, pruned_loss=0.07722, over 2572968.32 frames. ], batch size: 33, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:18:33,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403533.1666666667, ans=0.1 2024-06-21 15:18:37,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.73 vs. limit=15.0 2024-06-21 15:18:56,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=403588.1666666667, ans=0.0 2024-06-21 15:19:06,283 INFO [train.py:1028] (0/2) Epoch 22, batch 7700, loss[loss=0.2365, simple_loss=0.2964, pruned_loss=0.0883, over 13257.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2696, pruned_loss=0.0771, over 2569432.50 frames. ], batch size: 63, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:19:08,338 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.221e+02 2.397e+02 2.603e+02 3.385e+02, threshold=4.794e+02, percent-clipped=0.0 2024-06-21 15:19:14,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=403643.1666666667, ans=0.0 2024-06-21 15:19:32,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=403698.1666666667, ans=0.125 2024-06-21 15:19:32,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=22.5 2024-06-21 15:19:35,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=403698.1666666667, ans=0.0 2024-06-21 15:19:39,434 INFO [train.py:1028] (0/2) Epoch 22, batch 7750, loss[loss=0.1938, simple_loss=0.2629, pruned_loss=0.06236, over 13206.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2708, pruned_loss=0.07785, over 2574098.63 frames. ], batch size: 72, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:19:45,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=403734.8333333333, ans=0.025 2024-06-21 15:19:54,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=403753.1666666667, ans=0.1 2024-06-21 15:19:55,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2024-06-21 15:19:56,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=403753.1666666667, ans=0.0 2024-06-21 15:20:03,775 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2024-06-21 15:20:15,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.38 vs. limit=10.0 2024-06-21 15:20:17,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=403789.8333333333, ans=0.0 2024-06-21 15:20:19,151 INFO [train.py:1028] (0/2) Epoch 22, batch 7800, loss[loss=0.2314, simple_loss=0.283, pruned_loss=0.0899, over 13111.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2715, pruned_loss=0.07801, over 2578911.03 frames. ], batch size: 95, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:20:19,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=8.0 2024-06-21 15:20:21,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.304e+02 2.551e+02 2.790e+02 3.705e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-21 15:20:23,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2024-06-21 15:20:29,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=403826.5, ans=0.125 2024-06-21 15:20:35,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=403844.8333333333, ans=0.125 2024-06-21 15:20:45,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=403881.5, ans=0.125 2024-06-21 15:20:52,956 INFO [train.py:1028] (0/2) Epoch 22, batch 7850, loss[loss=0.2026, simple_loss=0.266, pruned_loss=0.06959, over 11568.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2725, pruned_loss=0.07837, over 2572766.40 frames. ], batch size: 16, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:20:58,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403899.8333333333, ans=0.1 2024-06-21 15:21:00,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403918.1666666667, ans=0.1 2024-06-21 15:21:08,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403936.5, ans=0.1 2024-06-21 15:21:25,503 INFO [train.py:1028] (0/2) Epoch 22, batch 7900, loss[loss=0.2138, simple_loss=0.2656, pruned_loss=0.08098, over 13169.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2725, pruned_loss=0.07848, over 2572266.59 frames. ], batch size: 77, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:21:27,616 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.330e+02 2.467e+02 2.824e+02 4.195e+02, threshold=4.933e+02, percent-clipped=0.0 2024-06-21 15:21:35,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=404009.8333333333, ans=0.0 2024-06-21 15:21:36,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=404009.8333333333, ans=0.125 2024-06-21 15:21:36,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=404009.8333333333, ans=0.95 2024-06-21 15:21:41,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=404028.1666666667, ans=0.0 2024-06-21 15:21:42,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=404028.1666666667, ans=0.125 2024-06-21 15:21:58,986 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=22.5 2024-06-21 15:21:59,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=404064.8333333333, ans=0.2 2024-06-21 15:21:59,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=404064.8333333333, ans=0.2 2024-06-21 15:22:06,424 INFO [train.py:1028] (0/2) Epoch 22, batch 7950, loss[loss=0.2347, simple_loss=0.282, pruned_loss=0.09368, over 10562.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2732, pruned_loss=0.07881, over 2575727.76 frames. ], batch size: 303, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:22:18,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=404101.5, ans=0.0 2024-06-21 15:22:28,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2024-06-21 15:22:28,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=404138.1666666667, ans=0.125 2024-06-21 15:22:32,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=404156.5, ans=10.0 2024-06-21 15:22:33,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=404156.5, ans=0.125 2024-06-21 15:22:36,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404156.5, ans=0.125 2024-06-21 15:22:37,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2024-06-21 15:22:39,594 INFO [train.py:1028] (0/2) Epoch 22, batch 8000, loss[loss=0.1837, simple_loss=0.2489, pruned_loss=0.05929, over 12604.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2732, pruned_loss=0.07839, over 2571468.16 frames. ], batch size: 29, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:22:41,639 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.266e+02 2.478e+02 2.713e+02 3.698e+02, threshold=4.957e+02, percent-clipped=0.0 2024-06-21 15:22:41,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=404174.8333333333, ans=0.125 2024-06-21 15:22:44,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404174.8333333333, ans=0.1 2024-06-21 15:22:49,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 15:22:50,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=404193.1666666667, ans=0.0 2024-06-21 15:22:54,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=404211.5, ans=0.0 2024-06-21 15:22:56,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=404211.5, ans=0.025 2024-06-21 15:22:56,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=404211.5, ans=0.125 2024-06-21 15:22:59,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=404229.8333333333, ans=0.125 2024-06-21 15:23:11,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404248.1666666667, ans=0.1 2024-06-21 15:23:12,948 INFO [train.py:1028] (0/2) Epoch 22, batch 8050, loss[loss=0.2238, simple_loss=0.2805, pruned_loss=0.08355, over 13224.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2735, pruned_loss=0.07856, over 2571759.17 frames. ], batch size: 83, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:23:17,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404266.5, ans=0.1 2024-06-21 15:23:23,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=404284.8333333333, ans=0.0 2024-06-21 15:23:44,971 INFO [train.py:1028] (0/2) Epoch 22, batch 8100, loss[loss=0.2119, simple_loss=0.2711, pruned_loss=0.07638, over 13183.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2737, pruned_loss=0.07877, over 2576329.98 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:23:45,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=404358.1666666667, ans=0.125 2024-06-21 15:23:46,903 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.260e+02 2.365e+02 2.604e+02 3.308e+02, threshold=4.729e+02, percent-clipped=0.0 2024-06-21 15:23:48,170 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2024-06-21 15:23:54,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.77 vs. limit=10.0 2024-06-21 15:24:17,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=404413.1666666667, ans=0.2 2024-06-21 15:24:27,426 INFO [train.py:1028] (0/2) Epoch 22, batch 8150, loss[loss=0.2174, simple_loss=0.272, pruned_loss=0.08147, over 13037.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2738, pruned_loss=0.0784, over 2579855.38 frames. ], batch size: 121, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:24:27,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=404449.8333333333, ans=0.125 2024-06-21 15:24:28,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404449.8333333333, ans=0.1 2024-06-21 15:24:30,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=404449.8333333333, ans=0.125 2024-06-21 15:24:30,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2024-06-21 15:24:31,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=404449.8333333333, ans=0.2 2024-06-21 15:24:33,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=404468.1666666667, ans=0.0 2024-06-21 15:24:43,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=404486.5, ans=0.125 2024-06-21 15:24:46,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=404504.8333333333, ans=0.125 2024-06-21 15:24:54,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2024-06-21 15:24:59,718 INFO [train.py:1028] (0/2) Epoch 22, batch 8200, loss[loss=0.2013, simple_loss=0.2711, pruned_loss=0.06578, over 13091.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2748, pruned_loss=0.07864, over 2583276.81 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:25:00,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=404541.5, ans=0.2 2024-06-21 15:25:01,511 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.260e+02 2.446e+02 2.726e+02 3.209e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-21 15:25:02,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404541.5, ans=0.1 2024-06-21 15:25:14,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404578.1666666667, ans=0.125 2024-06-21 15:25:25,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404596.5, ans=0.125 2024-06-21 15:25:25,578 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:25:32,912 INFO [train.py:1028] (0/2) Epoch 22, batch 8250, loss[loss=0.2237, simple_loss=0.2905, pruned_loss=0.07842, over 13301.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2751, pruned_loss=0.07875, over 2584205.34 frames. ], batch size: 52, lr: 2.62e-03, grad_scale: 64.0 2024-06-21 15:25:36,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404633.1666666667, ans=0.1 2024-06-21 15:25:40,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=404651.5, ans=0.125 2024-06-21 15:25:45,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=404669.8333333333, ans=0.0 2024-06-21 15:25:45,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404669.8333333333, ans=0.1 2024-06-21 15:26:05,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=404706.5, ans=0.125 2024-06-21 15:26:08,301 INFO [train.py:1028] (0/2) Epoch 22, batch 8300, loss[loss=0.2432, simple_loss=0.3024, pruned_loss=0.09195, over 13022.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2742, pruned_loss=0.07831, over 2581835.57 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 64.0 2024-06-21 15:26:13,456 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.206e+02 2.332e+02 2.465e+02 3.147e+02, threshold=4.664e+02, percent-clipped=0.0 2024-06-21 15:26:14,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=404724.8333333333, ans=0.125 2024-06-21 15:26:18,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=404743.1666666667, ans=0.125 2024-06-21 15:26:18,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404743.1666666667, ans=0.1 2024-06-21 15:26:31,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=404779.8333333333, ans=0.125 2024-06-21 15:26:33,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2024-06-21 15:26:44,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=404816.5, ans=0.125 2024-06-21 15:26:44,461 INFO [train.py:1028] (0/2) Epoch 22, batch 8350, loss[loss=0.2137, simple_loss=0.2727, pruned_loss=0.07736, over 13194.00 frames. ], tot_loss[loss=0.215, simple_loss=0.274, pruned_loss=0.07795, over 2582463.34 frames. ], batch size: 112, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:26:57,567 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.569e-03 2024-06-21 15:27:00,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.98 vs. limit=22.5 2024-06-21 15:27:04,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.69 vs. limit=22.5 2024-06-21 15:27:06,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=404871.5, ans=0.125 2024-06-21 15:27:07,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=404871.5, ans=0.015 2024-06-21 15:27:14,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=404889.8333333333, ans=0.125 2024-06-21 15:27:18,194 INFO [train.py:1028] (0/2) Epoch 22, batch 8400, loss[loss=0.1987, simple_loss=0.262, pruned_loss=0.06767, over 12952.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2743, pruned_loss=0.07818, over 2579393.41 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:27:20,796 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.276e+02 2.418e+02 2.610e+02 3.662e+02, threshold=4.835e+02, percent-clipped=0.0 2024-06-21 15:27:29,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2024-06-21 15:27:50,800 INFO [train.py:1028] (0/2) Epoch 22, batch 8450, loss[loss=0.2149, simple_loss=0.2743, pruned_loss=0.07776, over 13172.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.275, pruned_loss=0.07832, over 2579525.74 frames. ], batch size: 112, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:27:54,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=404999.8333333333, ans=0.2 2024-06-21 15:27:55,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=404999.8333333333, ans=0.125 2024-06-21 15:28:05,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=405018.1666666667, ans=0.0 2024-06-21 15:28:12,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=405036.5, ans=0.0 2024-06-21 15:28:12,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=405036.5, ans=0.05 2024-06-21 15:28:13,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=405036.5, ans=0.025 2024-06-21 15:28:24,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=405054.8333333333, ans=0.0 2024-06-21 15:28:24,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=405054.8333333333, ans=0.125 2024-06-21 15:28:31,868 INFO [train.py:1028] (0/2) Epoch 22, batch 8500, loss[loss=0.2066, simple_loss=0.267, pruned_loss=0.07309, over 12656.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2756, pruned_loss=0.07845, over 2577887.01 frames. ], batch size: 29, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:28:34,348 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.245e+02 2.423e+02 2.671e+02 3.802e+02, threshold=4.845e+02, percent-clipped=0.0 2024-06-21 15:28:34,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=405091.5, ans=0.2 2024-06-21 15:28:37,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=405109.8333333333, ans=0.125 2024-06-21 15:28:37,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=405109.8333333333, ans=0.125 2024-06-21 15:28:39,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=405109.8333333333, ans=0.0 2024-06-21 15:28:44,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2024-06-21 15:28:49,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=405128.1666666667, ans=0.125 2024-06-21 15:28:49,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=405128.1666666667, ans=0.07 2024-06-21 15:28:55,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=405146.5, ans=0.0 2024-06-21 15:29:00,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=405164.8333333333, ans=0.07 2024-06-21 15:29:01,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.79 vs. limit=15.0 2024-06-21 15:29:04,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405164.8333333333, ans=0.1 2024-06-21 15:29:04,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=405183.1666666667, ans=0.0 2024-06-21 15:29:05,473 INFO [train.py:1028] (0/2) Epoch 22, batch 8550, loss[loss=0.2196, simple_loss=0.2856, pruned_loss=0.07679, over 12632.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2753, pruned_loss=0.07842, over 2576565.88 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:29:07,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2024-06-21 15:29:12,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=405201.5, ans=0.125 2024-06-21 15:29:13,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=405201.5, ans=0.125 2024-06-21 15:29:14,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405201.5, ans=0.125 2024-06-21 15:29:23,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=405219.8333333333, ans=0.125 2024-06-21 15:29:28,027 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:29:37,866 INFO [train.py:1028] (0/2) Epoch 22, batch 8600, loss[loss=0.2099, simple_loss=0.2658, pruned_loss=0.07698, over 13147.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2758, pruned_loss=0.07861, over 2572659.90 frames. ], batch size: 112, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:29:40,481 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.241e+02 2.376e+02 2.574e+02 3.486e+02, threshold=4.753e+02, percent-clipped=0.0 2024-06-21 15:29:45,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=405293.1666666667, ans=0.2 2024-06-21 15:29:56,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2024-06-21 15:30:15,924 INFO [train.py:1028] (0/2) Epoch 22, batch 8650, loss[loss=0.202, simple_loss=0.253, pruned_loss=0.07545, over 12982.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.276, pruned_loss=0.07846, over 2575484.03 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:30:48,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2024-06-21 15:30:53,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2024-06-21 15:30:54,098 INFO [train.py:1028] (0/2) Epoch 22, batch 8700, loss[loss=0.2099, simple_loss=0.2719, pruned_loss=0.07397, over 13149.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2769, pruned_loss=0.07915, over 2572738.09 frames. ], batch size: 59, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:30:56,833 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.283e+02 2.430e+02 2.624e+02 3.622e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-21 15:30:57,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=405458.1666666667, ans=0.2 2024-06-21 15:30:59,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=405458.1666666667, ans=0.125 2024-06-21 15:31:03,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.65 vs. limit=22.5 2024-06-21 15:31:06,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=405476.5, ans=0.0 2024-06-21 15:31:06,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=405476.5, ans=0.04949747468305833 2024-06-21 15:31:08,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.88 vs. limit=6.0 2024-06-21 15:31:14,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=405513.1666666667, ans=0.1 2024-06-21 15:31:21,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=405531.5, ans=0.025 2024-06-21 15:31:27,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=405531.5, ans=0.125 2024-06-21 15:31:28,916 INFO [train.py:1028] (0/2) Epoch 22, batch 8750, loss[loss=0.2247, simple_loss=0.2751, pruned_loss=0.08715, over 13105.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2761, pruned_loss=0.07875, over 2567802.13 frames. ], batch size: 121, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:31:34,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=405549.8333333333, ans=0.0 2024-06-21 15:31:34,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=405549.8333333333, ans=0.125 2024-06-21 15:31:36,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=405568.1666666667, ans=0.1 2024-06-21 15:31:42,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=405586.5, ans=0.0 2024-06-21 15:31:43,724 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=12.0 2024-06-21 15:31:45,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=405586.5, ans=0.05 2024-06-21 15:31:46,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2024-06-21 15:31:58,308 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2024-06-21 15:32:06,797 INFO [train.py:1028] (0/2) Epoch 22, batch 8800, loss[loss=0.2087, simple_loss=0.275, pruned_loss=0.07115, over 13123.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2764, pruned_loss=0.07903, over 2573132.39 frames. ], batch size: 71, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:32:09,357 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.221e+02 2.365e+02 2.503e+02 3.280e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-21 15:32:13,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=405659.8333333333, ans=0.0 2024-06-21 15:32:27,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=405678.1666666667, ans=0.125 2024-06-21 15:32:35,924 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2024-06-21 15:32:36,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=405696.5, ans=0.125 2024-06-21 15:32:40,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405714.8333333333, ans=0.1 2024-06-21 15:32:40,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=405714.8333333333, ans=0.2 2024-06-21 15:32:45,669 INFO [train.py:1028] (0/2) Epoch 22, batch 8850, loss[loss=0.2325, simple_loss=0.2775, pruned_loss=0.09372, over 12567.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2765, pruned_loss=0.07957, over 2563357.83 frames. ], batch size: 202, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:32:49,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=405733.1666666667, ans=0.125 2024-06-21 15:32:58,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=405769.8333333333, ans=0.125 2024-06-21 15:33:00,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=405769.8333333333, ans=0.125 2024-06-21 15:33:11,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=405788.1666666667, ans=0.1 2024-06-21 15:33:18,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=405824.8333333333, ans=0.2 2024-06-21 15:33:19,075 INFO [train.py:1028] (0/2) Epoch 22, batch 8900, loss[loss=0.2406, simple_loss=0.301, pruned_loss=0.09007, over 12885.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2778, pruned_loss=0.08048, over 2561844.69 frames. ], batch size: 33, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:33:19,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.54 vs. limit=15.0 2024-06-21 15:33:21,714 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.305e+02 2.488e+02 2.717e+02 3.445e+02, threshold=4.976e+02, percent-clipped=0.0 2024-06-21 15:33:26,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405843.1666666667, ans=0.1 2024-06-21 15:33:33,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=405861.5, ans=0.125 2024-06-21 15:33:34,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=405861.5, ans=0.0 2024-06-21 15:33:39,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2024-06-21 15:33:52,857 INFO [train.py:1028] (0/2) Epoch 22, batch 8950, loss[loss=0.2376, simple_loss=0.2928, pruned_loss=0.09118, over 12536.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2775, pruned_loss=0.08004, over 2562914.41 frames. ], batch size: 202, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:34:09,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.43 vs. limit=10.0 2024-06-21 15:34:16,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405971.5, ans=0.1 2024-06-21 15:34:16,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=405971.5, ans=0.125 2024-06-21 15:34:22,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.06 vs. limit=15.0 2024-06-21 15:34:33,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=406008.1666666667, ans=0.0 2024-06-21 15:34:33,726 INFO [train.py:1028] (0/2) Epoch 22, batch 9000, loss[loss=0.2027, simple_loss=0.2687, pruned_loss=0.06834, over 13256.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2776, pruned_loss=0.07965, over 2569338.98 frames. ], batch size: 46, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:34:33,727 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 15:34:41,943 INFO [train.py:1060] (0/2) Epoch 22, validation: loss=0.1872, simple_loss=0.2511, pruned_loss=0.06169, over 351949.00 frames. 2024-06-21 15:34:41,944 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 15:34:44,816 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.219e+02 2.364e+02 2.565e+02 3.217e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 15:34:45,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=15.0 2024-06-21 15:34:50,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406026.5, ans=0.1 2024-06-21 15:34:50,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2024-06-21 15:34:59,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=406044.8333333333, ans=0.0 2024-06-21 15:35:14,543 INFO [train.py:1028] (0/2) Epoch 22, batch 9050, loss[loss=0.1972, simple_loss=0.2512, pruned_loss=0.07162, over 10706.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.278, pruned_loss=0.08009, over 2567769.73 frames. ], batch size: 16, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:35:17,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=406099.8333333333, ans=0.95 2024-06-21 15:35:18,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=406099.8333333333, ans=0.2 2024-06-21 15:35:27,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=406136.5, ans=0.2 2024-06-21 15:35:35,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=406154.8333333333, ans=0.125 2024-06-21 15:35:41,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.53 vs. limit=22.5 2024-06-21 15:35:47,203 INFO [train.py:1028] (0/2) Epoch 22, batch 9100, loss[loss=0.2305, simple_loss=0.2922, pruned_loss=0.08444, over 13211.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2782, pruned_loss=0.08003, over 2570590.65 frames. ], batch size: 72, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:35:49,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=406191.5, ans=0.0 2024-06-21 15:35:49,324 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:35:49,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.308e+02 2.428e+02 2.622e+02 3.376e+02, threshold=4.856e+02, percent-clipped=0.0 2024-06-21 15:35:51,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=406191.5, ans=0.125 2024-06-21 15:36:04,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=406228.1666666667, ans=0.125 2024-06-21 15:36:07,361 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2024-06-21 15:36:10,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406246.5, ans=0.1 2024-06-21 15:36:12,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2024-06-21 15:36:17,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.67 vs. limit=15.0 2024-06-21 15:36:19,228 INFO [train.py:1028] (0/2) Epoch 22, batch 9150, loss[loss=0.205, simple_loss=0.2779, pruned_loss=0.06605, over 13190.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2777, pruned_loss=0.07993, over 2570968.12 frames. ], batch size: 77, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:36:24,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=406283.1666666667, ans=0.2 2024-06-21 15:36:30,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406301.5, ans=0.125 2024-06-21 15:36:34,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=406319.8333333333, ans=0.0 2024-06-21 15:36:38,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=406338.1666666667, ans=0.0 2024-06-21 15:36:38,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=406338.1666666667, ans=0.0 2024-06-21 15:36:49,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2024-06-21 15:36:51,218 INFO [train.py:1028] (0/2) Epoch 22, batch 9200, loss[loss=0.2185, simple_loss=0.2815, pruned_loss=0.07778, over 12958.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2774, pruned_loss=0.07927, over 2572812.88 frames. ], batch size: 36, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:36:52,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=406374.8333333333, ans=10.0 2024-06-21 15:36:53,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.275e+02 2.411e+02 2.679e+02 3.309e+02, threshold=4.822e+02, percent-clipped=0.0 2024-06-21 15:36:57,638 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:37:17,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=406448.1666666667, ans=0.0 2024-06-21 15:37:23,034 INFO [train.py:1028] (0/2) Epoch 22, batch 9250, loss[loss=0.1947, simple_loss=0.2597, pruned_loss=0.06485, over 13233.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2778, pruned_loss=0.07913, over 2573152.76 frames. ], batch size: 67, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:37:25,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=6.0 2024-06-21 15:37:26,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=406466.5, ans=0.035 2024-06-21 15:37:29,915 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2024-06-21 15:37:31,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=406484.8333333333, ans=0.0 2024-06-21 15:37:37,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=406484.8333333333, ans=0.025 2024-06-21 15:37:57,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=406558.1666666667, ans=0.2 2024-06-21 15:37:58,251 INFO [train.py:1028] (0/2) Epoch 22, batch 9300, loss[loss=0.1988, simple_loss=0.2579, pruned_loss=0.06984, over 13221.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2767, pruned_loss=0.07837, over 2570485.04 frames. ], batch size: 40, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:37:59,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=406558.1666666667, ans=0.125 2024-06-21 15:38:00,741 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.226e+02 2.441e+02 2.618e+02 3.305e+02, threshold=4.883e+02, percent-clipped=0.0 2024-06-21 15:38:19,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2024-06-21 15:38:20,431 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.06 vs. limit=22.5 2024-06-21 15:38:21,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=406613.1666666667, ans=0.1 2024-06-21 15:38:32,441 INFO [train.py:1028] (0/2) Epoch 22, batch 9350, loss[loss=0.2159, simple_loss=0.275, pruned_loss=0.07842, over 12446.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2772, pruned_loss=0.07852, over 2568076.24 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:38:39,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=406668.1666666667, ans=0.125 2024-06-21 15:38:46,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406686.5, ans=0.1 2024-06-21 15:38:59,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=406723.1666666667, ans=0.125 2024-06-21 15:39:04,005 INFO [train.py:1028] (0/2) Epoch 22, batch 9400, loss[loss=0.2317, simple_loss=0.2903, pruned_loss=0.08654, over 13286.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2781, pruned_loss=0.07915, over 2567215.90 frames. ], batch size: 52, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:39:06,364 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.243e+02 2.355e+02 2.634e+02 3.526e+02, threshold=4.710e+02, percent-clipped=0.0 2024-06-21 15:39:06,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=406741.5, ans=0.125 2024-06-21 15:39:07,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=406741.5, ans=0.2 2024-06-21 15:39:11,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=406759.8333333333, ans=0.2 2024-06-21 15:39:13,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=406759.8333333333, ans=0.0 2024-06-21 15:39:29,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=406814.8333333333, ans=0.0 2024-06-21 15:39:35,140 INFO [train.py:1028] (0/2) Epoch 22, batch 9450, loss[loss=0.2202, simple_loss=0.2838, pruned_loss=0.07834, over 12697.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2789, pruned_loss=0.07953, over 2567850.80 frames. ], batch size: 22, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:39:35,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406833.1666666667, ans=0.1 2024-06-21 15:39:37,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=406833.1666666667, ans=0.0 2024-06-21 15:39:38,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=406833.1666666667, ans=0.025 2024-06-21 15:39:39,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=406833.1666666667, ans=0.125 2024-06-21 15:39:40,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406833.1666666667, ans=0.1 2024-06-21 15:39:40,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.97 vs. limit=12.0 2024-06-21 15:39:41,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=406851.5, ans=0.0 2024-06-21 15:39:42,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=406851.5, ans=0.0 2024-06-21 15:39:44,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2024-06-21 15:39:50,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=406869.8333333333, ans=0.1 2024-06-21 15:39:57,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=406888.1666666667, ans=0.125 2024-06-21 15:40:03,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2024-06-21 15:40:05,856 INFO [train.py:1028] (0/2) Epoch 22, batch 9500, loss[loss=0.2265, simple_loss=0.2741, pruned_loss=0.08948, over 13247.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2786, pruned_loss=0.0793, over 2577188.39 frames. ], batch size: 43, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:40:08,125 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.247e+02 2.353e+02 2.587e+02 4.190e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 15:40:08,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2024-06-21 15:40:18,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=406961.5, ans=0.2 2024-06-21 15:40:19,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=406961.5, ans=0.0 2024-06-21 15:40:23,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=406979.8333333333, ans=0.125 2024-06-21 15:40:36,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=406998.1666666667, ans=0.5 2024-06-21 15:40:39,223 INFO [train.py:1028] (0/2) Epoch 22, batch 9550, loss[loss=0.1874, simple_loss=0.2493, pruned_loss=0.06279, over 12955.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2775, pruned_loss=0.07867, over 2571009.76 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:40:42,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=407016.5, ans=0.0 2024-06-21 15:40:54,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=407053.1666666667, ans=0.035 2024-06-21 15:40:54,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=407053.1666666667, ans=0.0 2024-06-21 15:40:55,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.36 vs. limit=6.0 2024-06-21 15:41:04,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=407071.5, ans=0.125 2024-06-21 15:41:09,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=407089.8333333333, ans=0.0 2024-06-21 15:41:11,980 INFO [train.py:1028] (0/2) Epoch 22, batch 9600, loss[loss=0.2285, simple_loss=0.2732, pruned_loss=0.0919, over 10300.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2773, pruned_loss=0.07887, over 2569805.41 frames. ], batch size: 303, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:41:14,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 2.223e+02 2.369e+02 2.583e+02 3.273e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-21 15:41:14,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=407108.1666666667, ans=0.125 2024-06-21 15:41:40,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=407181.5, ans=0.125 2024-06-21 15:41:42,469 INFO [train.py:1028] (0/2) Epoch 22, batch 9650, loss[loss=0.2148, simple_loss=0.2689, pruned_loss=0.08035, over 13106.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2774, pruned_loss=0.07947, over 2559465.32 frames. ], batch size: 132, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:41:43,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407199.8333333333, ans=0.1 2024-06-21 15:41:45,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2024-06-21 15:42:05,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=407254.8333333333, ans=0.1 2024-06-21 15:42:11,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=407273.1666666667, ans=0.125 2024-06-21 15:42:13,008 INFO [train.py:1028] (0/2) Epoch 22, batch 9700, loss[loss=0.2082, simple_loss=0.2633, pruned_loss=0.07657, over 12987.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2775, pruned_loss=0.07959, over 2553450.35 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:42:14,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=407291.5, ans=0.1 2024-06-21 15:42:15,252 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.308e+02 2.437e+02 2.668e+02 3.352e+02, threshold=4.874e+02, percent-clipped=0.0 2024-06-21 15:42:16,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2024-06-21 15:42:25,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407328.1666666667, ans=0.1 2024-06-21 15:42:26,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=407328.1666666667, ans=0.2 2024-06-21 15:42:28,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=8.0 2024-06-21 15:42:35,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=407346.5, ans=0.0 2024-06-21 15:42:37,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=407364.8333333333, ans=0.025 2024-06-21 15:42:45,440 INFO [train.py:1028] (0/2) Epoch 22, batch 9750, loss[loss=0.2126, simple_loss=0.2669, pruned_loss=0.07919, over 13045.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2769, pruned_loss=0.07946, over 2550698.45 frames. ], batch size: 132, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:42:47,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.19 vs. limit=15.0 2024-06-21 15:42:47,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=407383.1666666667, ans=0.5 2024-06-21 15:43:12,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=407456.5, ans=0.0 2024-06-21 15:43:16,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=407456.5, ans=0.125 2024-06-21 15:43:16,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=407456.5, ans=0.125 2024-06-21 15:43:17,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2024-06-21 15:43:18,087 INFO [train.py:1028] (0/2) Epoch 22, batch 9800, loss[loss=0.1819, simple_loss=0.2443, pruned_loss=0.05971, over 12966.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2762, pruned_loss=0.07894, over 2544537.78 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:43:20,482 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.224e+02 2.364e+02 2.590e+02 3.586e+02, threshold=4.729e+02, percent-clipped=0.0 2024-06-21 15:43:29,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=407511.5, ans=0.125 2024-06-21 15:43:32,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=407511.5, ans=0.1 2024-06-21 15:43:37,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=407529.8333333333, ans=0.2 2024-06-21 15:43:40,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=407529.8333333333, ans=0.0 2024-06-21 15:43:43,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=407548.1666666667, ans=0.5 2024-06-21 15:43:45,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=407548.1666666667, ans=0.025 2024-06-21 15:43:48,675 INFO [train.py:1028] (0/2) Epoch 22, batch 9850, loss[loss=0.2189, simple_loss=0.2796, pruned_loss=0.07907, over 13025.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.276, pruned_loss=0.07879, over 2536808.65 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:44:06,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2024-06-21 15:44:18,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=407639.8333333333, ans=0.07 2024-06-21 15:44:19,282 INFO [train.py:1028] (0/2) Epoch 22, batch 9900, loss[loss=0.1816, simple_loss=0.249, pruned_loss=0.05712, over 13002.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2748, pruned_loss=0.07875, over 2530079.60 frames. ], batch size: 39, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:44:22,885 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.175e+02 2.297e+02 2.531e+02 3.263e+02, threshold=4.594e+02, percent-clipped=0.0 2024-06-21 15:44:35,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=407694.8333333333, ans=0.2 2024-06-21 15:44:39,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=407713.1666666667, ans=0.125 2024-06-21 15:44:47,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=407731.5, ans=0.125 2024-06-21 15:44:51,354 INFO [train.py:1028] (0/2) Epoch 22, batch 9950, loss[loss=0.2234, simple_loss=0.2801, pruned_loss=0.08334, over 12743.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2734, pruned_loss=0.0787, over 2524822.11 frames. ], batch size: 29, lr: 2.61e-03, grad_scale: 32.0 2024-06-21 15:45:00,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.54 vs. limit=15.0 2024-06-21 15:45:01,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=407768.1666666667, ans=0.125 2024-06-21 15:45:05,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=407786.5, ans=0.125 2024-06-21 15:45:09,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=407804.8333333333, ans=0.125 2024-06-21 15:45:09,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=407804.8333333333, ans=0.125 2024-06-21 15:45:10,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=15.0 2024-06-21 15:45:23,030 INFO [train.py:1028] (0/2) Epoch 22, batch 10000, loss[loss=0.2272, simple_loss=0.2922, pruned_loss=0.08109, over 12514.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2733, pruned_loss=0.07918, over 2486635.62 frames. ], batch size: 22, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:45:25,467 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.266e+02 2.442e+02 2.655e+02 3.714e+02, threshold=4.884e+02, percent-clipped=0.0 2024-06-21 15:45:27,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.06 vs. limit=15.0 2024-06-21 15:45:28,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=407841.5, ans=0.0 2024-06-21 15:45:42,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=407896.5, ans=0.025 2024-06-21 15:45:42,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=407896.5, ans=0.2 2024-06-21 15:45:42,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.69 vs. limit=15.0 2024-06-21 15:45:54,838 INFO [train.py:1028] (0/2) Epoch 22, batch 10050, loss[loss=0.2495, simple_loss=0.3129, pruned_loss=0.09305, over 12671.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2738, pruned_loss=0.07982, over 2444454.67 frames. ], batch size: 22, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:46:06,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2024-06-21 15:46:08,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=407969.8333333333, ans=0.0 2024-06-21 15:46:12,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=407988.1666666667, ans=0.125 2024-06-21 15:46:16,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=407988.1666666667, ans=0.125 2024-06-21 15:46:18,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=12.0 2024-06-21 15:46:22,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2024-06-21 15:46:24,713 INFO [train.py:1028] (0/2) Epoch 22, batch 10100, loss[loss=0.1977, simple_loss=0.2545, pruned_loss=0.07041, over 11295.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2727, pruned_loss=0.07893, over 2425703.09 frames. ], batch size: 17, lr: 2.60e-03, grad_scale: 32.0 2024-06-21 15:46:26,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408024.8333333333, ans=0.1 2024-06-21 15:46:27,141 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.270e+02 2.484e+02 2.752e+02 5.288e+02, threshold=4.968e+02, percent-clipped=1.0 2024-06-21 15:46:34,129 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.63 vs. limit=10.0 2024-06-21 15:46:38,407 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-22.pt 2024-06-21 15:48:44,207 INFO [train.py:1028] (0/2) Epoch 23, batch 0, loss[loss=0.1902, simple_loss=0.2474, pruned_loss=0.0665, over 12908.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2474, pruned_loss=0.0665, over 12908.00 frames. ], batch size: 36, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:48:44,208 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 15:48:51,154 INFO [train.py:1060] (0/2) Epoch 23, validation: loss=0.1885, simple_loss=0.2525, pruned_loss=0.06224, over 351949.00 frames. 2024-06-21 15:48:51,155 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 15:48:52,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=408057.8333333333, ans=0.125 2024-06-21 15:49:13,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=408112.8333333333, ans=0.025 2024-06-21 15:49:24,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-21 15:49:25,105 INFO [train.py:1028] (0/2) Epoch 23, batch 50, loss[loss=0.1831, simple_loss=0.2458, pruned_loss=0.0602, over 12767.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2556, pruned_loss=0.07181, over 574087.81 frames. ], batch size: 29, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:49:25,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=408149.5, ans=0.09899494936611666 2024-06-21 15:49:29,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=408149.5, ans=0.125 2024-06-21 15:49:30,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=408149.5, ans=0.125 2024-06-21 15:49:31,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408167.8333333333, ans=0.1 2024-06-21 15:49:33,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408167.8333333333, ans=0.1 2024-06-21 15:49:37,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=408186.1666666667, ans=0.125 2024-06-21 15:49:47,668 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.117e+02 2.231e+02 2.405e+02 4.538e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 15:49:49,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=408204.5, ans=0.2 2024-06-21 15:49:53,792 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.51 vs. limit=22.5 2024-06-21 15:50:01,814 INFO [train.py:1028] (0/2) Epoch 23, batch 100, loss[loss=0.2032, simple_loss=0.2657, pruned_loss=0.07036, over 13310.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2545, pruned_loss=0.07144, over 1017017.15 frames. ], batch size: 46, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:50:04,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=408241.1666666667, ans=0.05 2024-06-21 15:50:37,676 INFO [train.py:1028] (0/2) Epoch 23, batch 150, loss[loss=0.1907, simple_loss=0.2585, pruned_loss=0.06143, over 12692.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2542, pruned_loss=0.07025, over 1363908.19 frames. ], batch size: 29, lr: 2.55e-03, grad_scale: 32.0 2024-06-21 15:50:44,993 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:50:56,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=408369.5, ans=0.0 2024-06-21 15:50:56,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=408387.8333333333, ans=0.125 2024-06-21 15:51:00,834 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.142e+02 2.276e+02 2.640e+02 4.088e+02, threshold=4.553e+02, percent-clipped=0.0 2024-06-21 15:51:06,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.73 vs. limit=10.0 2024-06-21 15:51:10,061 INFO [train.py:1028] (0/2) Epoch 23, batch 200, loss[loss=0.2231, simple_loss=0.269, pruned_loss=0.08857, over 12495.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2543, pruned_loss=0.07018, over 1633792.47 frames. ], batch size: 202, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:51:17,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=408442.8333333333, ans=0.2 2024-06-21 15:51:23,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=408461.1666666667, ans=0.0 2024-06-21 15:51:25,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=408461.1666666667, ans=0.0 2024-06-21 15:51:26,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=408461.1666666667, ans=0.0 2024-06-21 15:51:31,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=408479.5, ans=0.0 2024-06-21 15:51:41,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2024-06-21 15:51:42,134 INFO [train.py:1028] (0/2) Epoch 23, batch 250, loss[loss=0.2016, simple_loss=0.2445, pruned_loss=0.0794, over 13041.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2542, pruned_loss=0.07026, over 1845533.64 frames. ], batch size: 144, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:51:45,799 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.24 vs. limit=22.5 2024-06-21 15:51:54,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=408552.8333333333, ans=0.125 2024-06-21 15:51:58,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=408552.8333333333, ans=0.2 2024-06-21 15:52:02,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=408571.1666666667, ans=0.09899494936611666 2024-06-21 15:52:08,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.104e+02 2.235e+02 2.394e+02 2.888e+02, threshold=4.469e+02, percent-clipped=0.0 2024-06-21 15:52:12,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=408589.5, ans=0.125 2024-06-21 15:52:22,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=408607.8333333333, ans=0.125 2024-06-21 15:52:22,674 INFO [train.py:1028] (0/2) Epoch 23, batch 300, loss[loss=0.1945, simple_loss=0.2501, pruned_loss=0.06941, over 13201.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2542, pruned_loss=0.07037, over 2008557.56 frames. ], batch size: 112, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:52:35,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=408644.5, ans=0.125 2024-06-21 15:52:36,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=408644.5, ans=0.09899494936611666 2024-06-21 15:52:41,407 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2024-06-21 15:52:48,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=408681.1666666667, ans=0.0 2024-06-21 15:52:53,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408699.5, ans=0.1 2024-06-21 15:52:54,073 INFO [train.py:1028] (0/2) Epoch 23, batch 350, loss[loss=0.2053, simple_loss=0.2696, pruned_loss=0.07046, over 12865.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.254, pruned_loss=0.07035, over 2138549.04 frames. ], batch size: 33, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:53:03,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=408717.8333333333, ans=0.125 2024-06-21 15:53:08,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=408736.1666666667, ans=0.0 2024-06-21 15:53:09,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=408736.1666666667, ans=0.025 2024-06-21 15:53:11,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=408736.1666666667, ans=0.125 2024-06-21 15:53:16,988 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.137e+02 2.255e+02 2.406e+02 3.009e+02, threshold=4.510e+02, percent-clipped=0.0 2024-06-21 15:53:17,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.70 vs. limit=15.0 2024-06-21 15:53:19,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=408772.8333333333, ans=0.125 2024-06-21 15:53:21,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=408772.8333333333, ans=0.0 2024-06-21 15:53:22,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=408772.8333333333, ans=0.0 2024-06-21 15:53:25,773 INFO [train.py:1028] (0/2) Epoch 23, batch 400, loss[loss=0.1981, simple_loss=0.2529, pruned_loss=0.07163, over 13266.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2542, pruned_loss=0.07021, over 2239629.36 frames. ], batch size: 63, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:53:28,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=408791.1666666667, ans=0.125 2024-06-21 15:53:29,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=408791.1666666667, ans=0.025 2024-06-21 15:53:41,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=408827.8333333333, ans=0.125 2024-06-21 15:53:45,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=408846.1666666667, ans=0.0 2024-06-21 15:53:53,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2024-06-21 15:53:55,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=15.0 2024-06-21 15:53:57,756 INFO [train.py:1028] (0/2) Epoch 23, batch 450, loss[loss=0.2065, simple_loss=0.2676, pruned_loss=0.07265, over 13214.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2542, pruned_loss=0.07001, over 2314205.76 frames. ], batch size: 67, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:54:11,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=408901.1666666667, ans=0.2 2024-06-21 15:54:14,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.22 vs. limit=22.5 2024-06-21 15:54:23,853 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.132e+02 2.260e+02 2.448e+02 3.944e+02, threshold=4.520e+02, percent-clipped=0.0 2024-06-21 15:54:35,936 INFO [train.py:1028] (0/2) Epoch 23, batch 500, loss[loss=0.1919, simple_loss=0.243, pruned_loss=0.0704, over 13136.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.255, pruned_loss=0.0701, over 2376659.24 frames. ], batch size: 121, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:54:49,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=409011.1666666667, ans=0.0 2024-06-21 15:54:52,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409011.1666666667, ans=0.1 2024-06-21 15:54:54,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=409011.1666666667, ans=0.0 2024-06-21 15:55:07,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409066.1666666667, ans=0.1 2024-06-21 15:55:08,136 INFO [train.py:1028] (0/2) Epoch 23, batch 550, loss[loss=0.2084, simple_loss=0.2585, pruned_loss=0.07912, over 12926.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.255, pruned_loss=0.07017, over 2420955.95 frames. ], batch size: 158, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:55:08,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=409066.1666666667, ans=0.0 2024-06-21 15:55:14,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=409084.5, ans=0.125 2024-06-21 15:55:14,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=409084.5, ans=0.0 2024-06-21 15:55:14,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=409084.5, ans=0.07 2024-06-21 15:55:18,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=409084.5, ans=0.0 2024-06-21 15:55:27,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409121.1666666667, ans=0.1 2024-06-21 15:55:30,914 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.127e+02 2.243e+02 2.449e+02 3.168e+02, threshold=4.485e+02, percent-clipped=0.0 2024-06-21 15:55:31,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=409121.1666666667, ans=0.04949747468305833 2024-06-21 15:55:38,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=409139.5, ans=0.125 2024-06-21 15:55:39,759 INFO [train.py:1028] (0/2) Epoch 23, batch 600, loss[loss=0.158, simple_loss=0.2165, pruned_loss=0.04979, over 13101.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.254, pruned_loss=0.06954, over 2458907.77 frames. ], batch size: 144, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:55:59,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2024-06-21 15:56:02,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=15.0 2024-06-21 15:56:14,430 INFO [train.py:1028] (0/2) Epoch 23, batch 650, loss[loss=0.1915, simple_loss=0.2506, pruned_loss=0.06617, over 13175.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2537, pruned_loss=0.06918, over 2489812.59 frames. ], batch size: 59, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:56:24,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=409267.8333333333, ans=0.125 2024-06-21 15:56:30,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=409286.1666666667, ans=0.1 2024-06-21 15:56:30,954 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:56:38,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409304.5, ans=0.1 2024-06-21 15:56:39,861 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.168e+02 2.337e+02 2.527e+02 3.744e+02, threshold=4.674e+02, percent-clipped=0.0 2024-06-21 15:56:48,518 INFO [train.py:1028] (0/2) Epoch 23, batch 700, loss[loss=0.1834, simple_loss=0.2472, pruned_loss=0.05981, over 13316.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.253, pruned_loss=0.0693, over 2512491.80 frames. ], batch size: 46, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:56:48,614 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 15:56:50,813 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.73 vs. limit=10.0 2024-06-21 15:57:19,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=409414.5, ans=0.0 2024-06-21 15:57:19,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.74 vs. limit=15.0 2024-06-21 15:57:20,465 INFO [train.py:1028] (0/2) Epoch 23, batch 750, loss[loss=0.1749, simple_loss=0.2395, pruned_loss=0.05519, over 13255.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2539, pruned_loss=0.06938, over 2528128.13 frames. ], batch size: 63, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:57:21,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=409432.8333333333, ans=0.0 2024-06-21 15:57:21,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=409432.8333333333, ans=0.125 2024-06-21 15:57:26,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=409451.1666666667, ans=0.0 2024-06-21 15:57:42,953 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.152e+02 2.285e+02 2.475e+02 3.135e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 15:57:51,958 INFO [train.py:1028] (0/2) Epoch 23, batch 800, loss[loss=0.1845, simple_loss=0.2469, pruned_loss=0.06105, over 12931.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.254, pruned_loss=0.06956, over 2541136.97 frames. ], batch size: 36, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:57:56,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=409524.5, ans=0.0 2024-06-21 15:58:05,815 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=15.0 2024-06-21 15:58:08,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=409561.1666666667, ans=0.125 2024-06-21 15:58:09,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=409561.1666666667, ans=0.125 2024-06-21 15:58:12,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=409579.5, ans=0.025 2024-06-21 15:58:27,678 INFO [train.py:1028] (0/2) Epoch 23, batch 850, loss[loss=0.2062, simple_loss=0.2596, pruned_loss=0.07641, over 13147.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2541, pruned_loss=0.06967, over 2552285.70 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:58:27,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=409616.1666666667, ans=0.0 2024-06-21 15:58:34,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.96 vs. limit=10.0 2024-06-21 15:58:53,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=409671.1666666667, ans=10.0 2024-06-21 15:58:54,190 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.149e+02 2.289e+02 2.463e+02 3.648e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 15:59:01,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409689.5, ans=0.1 2024-06-21 15:59:03,465 INFO [train.py:1028] (0/2) Epoch 23, batch 900, loss[loss=0.199, simple_loss=0.2604, pruned_loss=0.06878, over 12873.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.254, pruned_loss=0.06985, over 2556510.70 frames. ], batch size: 36, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:59:09,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=409726.1666666667, ans=0.125 2024-06-21 15:59:10,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=409726.1666666667, ans=0.04949747468305833 2024-06-21 15:59:12,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=409726.1666666667, ans=0.0 2024-06-21 15:59:13,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=409726.1666666667, ans=0.0 2024-06-21 15:59:15,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=409726.1666666667, ans=0.5 2024-06-21 15:59:21,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.97 vs. limit=15.0 2024-06-21 15:59:25,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=409762.8333333333, ans=0.125 2024-06-21 15:59:26,896 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.64 vs. limit=22.5 2024-06-21 15:59:36,238 INFO [train.py:1028] (0/2) Epoch 23, batch 950, loss[loss=0.1795, simple_loss=0.2386, pruned_loss=0.06023, over 13226.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.254, pruned_loss=0.06981, over 2560063.82 frames. ], batch size: 40, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 15:59:44,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2024-06-21 15:59:47,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=409817.8333333333, ans=0.05 2024-06-21 15:59:47,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=409817.8333333333, ans=0.0 2024-06-21 15:59:56,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.03 vs. limit=15.0 2024-06-21 15:59:57,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=409854.5, ans=0.125 2024-06-21 15:59:59,041 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 2.132e+02 2.283e+02 2.449e+02 2.805e+02, threshold=4.565e+02, percent-clipped=0.0 2024-06-21 16:00:10,746 INFO [train.py:1028] (0/2) Epoch 23, batch 1000, loss[loss=0.2013, simple_loss=0.2624, pruned_loss=0.07013, over 13086.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2542, pruned_loss=0.07029, over 2562842.36 frames. ], batch size: 48, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:00:28,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=409927.8333333333, ans=0.125 2024-06-21 16:00:31,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2024-06-21 16:00:36,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=409946.1666666667, ans=0.125 2024-06-21 16:00:47,747 INFO [train.py:1028] (0/2) Epoch 23, batch 1050, loss[loss=0.2056, simple_loss=0.272, pruned_loss=0.06958, over 13116.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2544, pruned_loss=0.07017, over 2566388.87 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:00:53,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409982.8333333333, ans=0.1 2024-06-21 16:00:55,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.90 vs. limit=15.0 2024-06-21 16:01:05,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=410019.5, ans=0.025 2024-06-21 16:01:11,523 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.103e+02 2.287e+02 2.506e+02 3.300e+02, threshold=4.574e+02, percent-clipped=0.0 2024-06-21 16:01:14,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=410056.1666666667, ans=0.035 2024-06-21 16:01:14,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=410056.1666666667, ans=0.125 2024-06-21 16:01:20,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410074.5, ans=0.125 2024-06-21 16:01:20,967 INFO [train.py:1028] (0/2) Epoch 23, batch 1100, loss[loss=0.214, simple_loss=0.2714, pruned_loss=0.07827, over 13248.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2552, pruned_loss=0.07049, over 2571050.76 frames. ], batch size: 52, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:01:28,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=410092.8333333333, ans=0.0 2024-06-21 16:01:28,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410092.8333333333, ans=0.125 2024-06-21 16:01:40,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=410129.5, ans=0.025 2024-06-21 16:01:40,631 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2024-06-21 16:01:49,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=410147.8333333333, ans=0.125 2024-06-21 16:01:50,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=410147.8333333333, ans=0.125 2024-06-21 16:01:51,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=410147.8333333333, ans=0.125 2024-06-21 16:01:53,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=410147.8333333333, ans=0.2 2024-06-21 16:01:54,295 INFO [train.py:1028] (0/2) Epoch 23, batch 1150, loss[loss=0.1928, simple_loss=0.256, pruned_loss=0.06481, over 13264.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2554, pruned_loss=0.07082, over 2572041.28 frames. ], batch size: 52, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:01:54,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=410166.1666666667, ans=0.125 2024-06-21 16:02:03,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410184.5, ans=0.125 2024-06-21 16:02:06,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=410184.5, ans=0.125 2024-06-21 16:02:13,039 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=15.0 2024-06-21 16:02:15,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=410202.8333333333, ans=0.125 2024-06-21 16:02:16,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=12.0 2024-06-21 16:02:20,230 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.150e+02 2.238e+02 2.488e+02 3.313e+02, threshold=4.476e+02, percent-clipped=0.0 2024-06-21 16:02:29,575 INFO [train.py:1028] (0/2) Epoch 23, batch 1200, loss[loss=0.1708, simple_loss=0.2297, pruned_loss=0.05593, over 13153.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2551, pruned_loss=0.07084, over 2574250.71 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:02:32,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=410257.8333333333, ans=10.0 2024-06-21 16:02:34,681 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:02:45,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=410294.5, ans=0.2 2024-06-21 16:02:55,967 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:02:58,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=410331.1666666667, ans=0.0 2024-06-21 16:03:01,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-21 16:03:04,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=410349.5, ans=0.125 2024-06-21 16:03:04,964 INFO [train.py:1028] (0/2) Epoch 23, batch 1250, loss[loss=0.1837, simple_loss=0.2399, pruned_loss=0.06376, over 13156.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2553, pruned_loss=0.07063, over 2584023.59 frames. ], batch size: 112, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:03:06,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=410349.5, ans=0.1 2024-06-21 16:03:09,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410349.5, ans=0.125 2024-06-21 16:03:10,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410349.5, ans=0.125 2024-06-21 16:03:11,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410367.8333333333, ans=0.1 2024-06-21 16:03:20,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=410386.1666666667, ans=0.125 2024-06-21 16:03:22,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.39 vs. limit=15.0 2024-06-21 16:03:28,482 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.089e+02 2.211e+02 2.343e+02 3.024e+02, threshold=4.422e+02, percent-clipped=0.0 2024-06-21 16:03:28,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410404.5, ans=0.0 2024-06-21 16:03:37,730 INFO [train.py:1028] (0/2) Epoch 23, batch 1300, loss[loss=0.1835, simple_loss=0.2399, pruned_loss=0.06352, over 12778.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2558, pruned_loss=0.07071, over 2585468.15 frames. ], batch size: 176, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:03:47,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=410459.5, ans=0.125 2024-06-21 16:03:49,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=410459.5, ans=0.2 2024-06-21 16:03:54,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=410477.8333333333, ans=0.0 2024-06-21 16:04:06,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=410514.5, ans=0.2 2024-06-21 16:04:09,883 INFO [train.py:1028] (0/2) Epoch 23, batch 1350, loss[loss=0.1889, simple_loss=0.2511, pruned_loss=0.06338, over 13219.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2556, pruned_loss=0.07049, over 2587573.51 frames. ], batch size: 59, lr: 2.54e-03, grad_scale: 64.0 2024-06-21 16:04:14,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2024-06-21 16:04:18,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=410532.8333333333, ans=0.125 2024-06-21 16:04:19,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=410551.1666666667, ans=0.1 2024-06-21 16:04:25,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=410569.5, ans=0.0 2024-06-21 16:04:25,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410569.5, ans=0.1 2024-06-21 16:04:29,066 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.02 vs. limit=15.0 2024-06-21 16:04:36,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=410587.8333333333, ans=0.125 2024-06-21 16:04:36,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=410587.8333333333, ans=0.0 2024-06-21 16:04:37,430 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.138e+02 2.264e+02 2.397e+02 2.852e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 16:04:48,578 INFO [train.py:1028] (0/2) Epoch 23, batch 1400, loss[loss=0.2191, simple_loss=0.2823, pruned_loss=0.078, over 12737.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2553, pruned_loss=0.07045, over 2589031.96 frames. ], batch size: 26, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:04:49,431 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:04:57,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=410642.8333333333, ans=0.0 2024-06-21 16:05:00,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=410642.8333333333, ans=0.2 2024-06-21 16:05:02,998 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-224000.pt 2024-06-21 16:05:11,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=410661.1666666667, ans=0.025 2024-06-21 16:05:12,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=410679.5, ans=0.0 2024-06-21 16:05:17,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=410679.5, ans=0.0 2024-06-21 16:05:18,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=410679.5, ans=0.125 2024-06-21 16:05:21,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=410697.8333333333, ans=0.05 2024-06-21 16:05:26,789 INFO [train.py:1028] (0/2) Epoch 23, batch 1450, loss[loss=0.1875, simple_loss=0.2486, pruned_loss=0.06318, over 13075.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2548, pruned_loss=0.0704, over 2587929.59 frames. ], batch size: 121, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:05:28,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=410716.1666666667, ans=0.0 2024-06-21 16:05:36,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410734.5, ans=0.125 2024-06-21 16:05:38,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=410734.5, ans=0.125 2024-06-21 16:05:39,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=410752.8333333333, ans=0.0 2024-06-21 16:05:40,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=410752.8333333333, ans=0.0 2024-06-21 16:05:43,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=410752.8333333333, ans=0.125 2024-06-21 16:05:49,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=410771.1666666667, ans=0.2 2024-06-21 16:05:50,208 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2024-06-21 16:05:50,443 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.166e+02 2.315e+02 2.437e+02 3.068e+02, threshold=4.630e+02, percent-clipped=0.0 2024-06-21 16:05:50,808 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.73 vs. limit=15.0 2024-06-21 16:05:54,581 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2024-06-21 16:05:55,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=410789.5, ans=0.05 2024-06-21 16:05:58,857 INFO [train.py:1028] (0/2) Epoch 23, batch 1500, loss[loss=0.2082, simple_loss=0.2598, pruned_loss=0.07829, over 13167.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.255, pruned_loss=0.07065, over 2589656.85 frames. ], batch size: 83, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:06:00,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=410807.8333333333, ans=0.0 2024-06-21 16:06:10,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=410826.1666666667, ans=0.0 2024-06-21 16:06:11,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.30 vs. limit=15.0 2024-06-21 16:06:17,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.97 vs. limit=12.0 2024-06-21 16:06:32,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2024-06-21 16:06:34,100 INFO [train.py:1028] (0/2) Epoch 23, batch 1550, loss[loss=0.2098, simple_loss=0.2611, pruned_loss=0.07925, over 13026.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2547, pruned_loss=0.07049, over 2584624.89 frames. ], batch size: 102, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:06:46,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=410917.8333333333, ans=0.125 2024-06-21 16:06:46,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410917.8333333333, ans=0.1 2024-06-21 16:06:49,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=410936.1666666667, ans=0.125 2024-06-21 16:07:02,020 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.216e+02 2.384e+02 2.553e+02 3.642e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-21 16:07:09,435 INFO [train.py:1028] (0/2) Epoch 23, batch 1600, loss[loss=0.2048, simple_loss=0.2653, pruned_loss=0.07218, over 13195.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2549, pruned_loss=0.0705, over 2580254.26 frames. ], batch size: 77, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:07:18,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=411009.5, ans=0.0 2024-06-21 16:07:42,351 INFO [train.py:1028] (0/2) Epoch 23, batch 1650, loss[loss=0.2082, simple_loss=0.2611, pruned_loss=0.07761, over 13150.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2558, pruned_loss=0.07113, over 2576246.64 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:07:45,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=411082.8333333333, ans=0.1 2024-06-21 16:07:46,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=411082.8333333333, ans=0.0 2024-06-21 16:07:57,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411119.5, ans=0.1 2024-06-21 16:07:59,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=411119.5, ans=0.07 2024-06-21 16:08:02,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=411137.8333333333, ans=0.125 2024-06-21 16:08:07,094 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.123e+02 2.253e+02 2.398e+02 2.932e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 16:08:16,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.81 vs. limit=22.5 2024-06-21 16:08:19,315 INFO [train.py:1028] (0/2) Epoch 23, batch 1700, loss[loss=0.1982, simple_loss=0.2617, pruned_loss=0.06739, over 12756.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2555, pruned_loss=0.07075, over 2581789.44 frames. ], batch size: 26, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:08:22,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=411174.5, ans=0.0 2024-06-21 16:08:25,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2024-06-21 16:08:29,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=411192.8333333333, ans=0.125 2024-06-21 16:08:42,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=411229.5, ans=10.0 2024-06-21 16:08:54,152 INFO [train.py:1028] (0/2) Epoch 23, batch 1750, loss[loss=0.2245, simple_loss=0.2841, pruned_loss=0.0824, over 12443.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2562, pruned_loss=0.07117, over 2583575.26 frames. ], batch size: 22, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:08:57,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=411266.1666666667, ans=0.05 2024-06-21 16:09:08,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=411302.8333333333, ans=0.05 2024-06-21 16:09:18,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=411321.1666666667, ans=0.0 2024-06-21 16:09:18,516 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.118e+02 2.248e+02 2.419e+02 3.038e+02, threshold=4.496e+02, percent-clipped=0.0 2024-06-21 16:09:25,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=411357.8333333333, ans=0.125 2024-06-21 16:09:26,120 INFO [train.py:1028] (0/2) Epoch 23, batch 1800, loss[loss=0.2014, simple_loss=0.2682, pruned_loss=0.06726, over 13250.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2558, pruned_loss=0.07098, over 2583510.81 frames. ], batch size: 67, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:09:41,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=411394.5, ans=0.0 2024-06-21 16:09:49,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=411412.8333333333, ans=0.0 2024-06-21 16:09:53,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.73 vs. limit=22.5 2024-06-21 16:09:55,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=411431.1666666667, ans=0.125 2024-06-21 16:09:56,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=411431.1666666667, ans=0.125 2024-06-21 16:09:58,579 INFO [train.py:1028] (0/2) Epoch 23, batch 1850, loss[loss=0.2307, simple_loss=0.2851, pruned_loss=0.08817, over 13236.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2559, pruned_loss=0.07104, over 2584890.95 frames. ], batch size: 83, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:10:26,471 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.138e+02 2.267e+02 2.439e+02 2.868e+02, threshold=4.535e+02, percent-clipped=0.0 2024-06-21 16:10:31,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=411522.8333333333, ans=0.125 2024-06-21 16:10:34,496 INFO [train.py:1028] (0/2) Epoch 23, batch 1900, loss[loss=0.1996, simple_loss=0.2575, pruned_loss=0.07085, over 13174.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2554, pruned_loss=0.07092, over 2587066.98 frames. ], batch size: 95, lr: 2.54e-03, grad_scale: 32.0 2024-06-21 16:10:38,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2024-06-21 16:10:49,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=411559.5, ans=0.0 2024-06-21 16:10:52,921 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-21 16:10:54,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2024-06-21 16:11:02,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=411596.1666666667, ans=0.0 2024-06-21 16:11:03,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=411614.5, ans=0.0 2024-06-21 16:11:06,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411614.5, ans=0.1 2024-06-21 16:11:09,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=411614.5, ans=15.0 2024-06-21 16:11:10,000 INFO [train.py:1028] (0/2) Epoch 23, batch 1950, loss[loss=0.1866, simple_loss=0.2478, pruned_loss=0.0627, over 13267.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2547, pruned_loss=0.07079, over 2592548.61 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:11:11,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=411632.8333333333, ans=0.0 2024-06-21 16:11:23,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.09 vs. limit=10.0 2024-06-21 16:11:34,612 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.173e+02 2.354e+02 2.588e+02 3.801e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 16:11:42,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2024-06-21 16:11:42,743 INFO [train.py:1028] (0/2) Epoch 23, batch 2000, loss[loss=0.1945, simple_loss=0.2563, pruned_loss=0.06641, over 12532.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.254, pruned_loss=0.07046, over 2588289.20 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:11:43,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=411724.5, ans=0.0 2024-06-21 16:11:43,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=411724.5, ans=0.0 2024-06-21 16:11:48,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=411724.5, ans=0.04949747468305833 2024-06-21 16:11:50,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411742.8333333333, ans=0.1 2024-06-21 16:11:52,345 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2024-06-21 16:12:14,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=411797.8333333333, ans=0.125 2024-06-21 16:12:19,546 INFO [train.py:1028] (0/2) Epoch 23, batch 2050, loss[loss=0.191, simple_loss=0.2483, pruned_loss=0.06683, over 12602.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2546, pruned_loss=0.07076, over 2584388.24 frames. ], batch size: 29, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:12:31,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=15.0 2024-06-21 16:12:36,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=411852.8333333333, ans=0.125 2024-06-21 16:12:44,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.165e+02 2.276e+02 2.490e+02 3.428e+02, threshold=4.553e+02, percent-clipped=0.0 2024-06-21 16:12:55,493 INFO [train.py:1028] (0/2) Epoch 23, batch 2100, loss[loss=0.1974, simple_loss=0.2539, pruned_loss=0.0704, over 13208.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2546, pruned_loss=0.07045, over 2587221.01 frames. ], batch size: 59, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:12:55,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=411907.8333333333, ans=0.0 2024-06-21 16:13:05,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411926.1666666667, ans=0.125 2024-06-21 16:13:28,857 INFO [train.py:1028] (0/2) Epoch 23, batch 2150, loss[loss=0.1935, simple_loss=0.2595, pruned_loss=0.06372, over 13272.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2551, pruned_loss=0.07041, over 2589283.78 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:13:31,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=411999.5, ans=0.0 2024-06-21 16:13:53,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.208e+02 2.368e+02 2.567e+02 3.186e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 16:13:53,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=412054.5, ans=0.5 2024-06-21 16:14:01,351 INFO [train.py:1028] (0/2) Epoch 23, batch 2200, loss[loss=0.2269, simple_loss=0.2767, pruned_loss=0.08858, over 13227.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2553, pruned_loss=0.07065, over 2589567.45 frames. ], batch size: 83, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:14:09,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=412109.5, ans=0.2 2024-06-21 16:14:17,754 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.49 vs. limit=10.0 2024-06-21 16:14:26,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=412146.1666666667, ans=0.125 2024-06-21 16:14:28,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412146.1666666667, ans=0.1 2024-06-21 16:14:29,320 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:14:34,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=412164.5, ans=0.125 2024-06-21 16:14:35,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=412182.8333333333, ans=0.2 2024-06-21 16:14:36,494 INFO [train.py:1028] (0/2) Epoch 23, batch 2250, loss[loss=0.202, simple_loss=0.2676, pruned_loss=0.06824, over 13324.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2556, pruned_loss=0.07075, over 2588508.80 frames. ], batch size: 63, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:14:49,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=412219.5, ans=0.0 2024-06-21 16:14:51,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0 2024-06-21 16:14:52,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=412219.5, ans=0.125 2024-06-21 16:14:57,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=412219.5, ans=0.025 2024-06-21 16:15:00,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=412237.8333333333, ans=0.0 2024-06-21 16:15:03,733 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.191e+02 2.347e+02 2.542e+02 2.948e+02, threshold=4.694e+02, percent-clipped=0.0 2024-06-21 16:15:11,920 INFO [train.py:1028] (0/2) Epoch 23, batch 2300, loss[loss=0.1825, simple_loss=0.2431, pruned_loss=0.06091, over 12986.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2557, pruned_loss=0.07063, over 2583213.96 frames. ], batch size: 33, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:15:16,599 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:15:31,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=12.0 2024-06-21 16:15:37,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=412347.8333333333, ans=0.125 2024-06-21 16:15:44,358 INFO [train.py:1028] (0/2) Epoch 23, batch 2350, loss[loss=0.2036, simple_loss=0.2591, pruned_loss=0.0741, over 13271.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2558, pruned_loss=0.07088, over 2586675.61 frames. ], batch size: 67, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:16:06,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=412421.1666666667, ans=0.125 2024-06-21 16:16:09,432 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.138e+02 2.256e+02 2.402e+02 3.027e+02, threshold=4.511e+02, percent-clipped=0.0 2024-06-21 16:16:20,120 INFO [train.py:1028] (0/2) Epoch 23, batch 2400, loss[loss=0.1935, simple_loss=0.2545, pruned_loss=0.06629, over 13293.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2549, pruned_loss=0.07065, over 2589926.79 frames. ], batch size: 46, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:16:29,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412476.1666666667, ans=0.1 2024-06-21 16:16:33,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=412494.5, ans=0.0 2024-06-21 16:16:35,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=412494.5, ans=0.07 2024-06-21 16:16:35,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2024-06-21 16:16:51,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=412531.1666666667, ans=0.0 2024-06-21 16:16:53,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=412531.1666666667, ans=0.0 2024-06-21 16:16:53,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412531.1666666667, ans=0.1 2024-06-21 16:16:54,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=412531.1666666667, ans=0.1 2024-06-21 16:16:55,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=412549.5, ans=0.0 2024-06-21 16:16:55,995 INFO [train.py:1028] (0/2) Epoch 23, batch 2450, loss[loss=0.1955, simple_loss=0.2547, pruned_loss=0.0682, over 13279.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2543, pruned_loss=0.0706, over 2586022.38 frames. ], batch size: 63, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:17:04,683 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2024-06-21 16:17:05,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.99 vs. limit=6.0 2024-06-21 16:17:13,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=412586.1666666667, ans=0.125 2024-06-21 16:17:20,677 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.161e+02 2.269e+02 2.540e+02 3.446e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 16:17:25,043 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=22.5 2024-06-21 16:17:28,553 INFO [train.py:1028] (0/2) Epoch 23, batch 2500, loss[loss=0.184, simple_loss=0.2359, pruned_loss=0.06607, over 13212.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2529, pruned_loss=0.07014, over 2589608.93 frames. ], batch size: 83, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:17:31,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=412641.1666666667, ans=0.125 2024-06-21 16:17:32,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=412641.1666666667, ans=0.125 2024-06-21 16:17:33,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=412641.1666666667, ans=0.125 2024-06-21 16:17:37,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=412659.5, ans=0.125 2024-06-21 16:17:40,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=412677.8333333333, ans=0.1 2024-06-21 16:17:48,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=412696.1666666667, ans=0.0 2024-06-21 16:17:51,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=412696.1666666667, ans=0.125 2024-06-21 16:17:55,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2024-06-21 16:17:59,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=412732.8333333333, ans=0.025 2024-06-21 16:18:00,538 INFO [train.py:1028] (0/2) Epoch 23, batch 2550, loss[loss=0.2114, simple_loss=0.2735, pruned_loss=0.07466, over 12512.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2522, pruned_loss=0.06991, over 2587912.05 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:18:06,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=412732.8333333333, ans=0.0 2024-06-21 16:18:07,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.34 vs. limit=15.0 2024-06-21 16:18:23,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=412787.8333333333, ans=0.0 2024-06-21 16:18:27,682 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.132e+02 2.240e+02 2.447e+02 2.935e+02, threshold=4.480e+02, percent-clipped=0.0 2024-06-21 16:18:38,018 INFO [train.py:1028] (0/2) Epoch 23, batch 2600, loss[loss=0.1871, simple_loss=0.2486, pruned_loss=0.06283, over 13350.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.251, pruned_loss=0.06976, over 2587200.37 frames. ], batch size: 52, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:18:39,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412824.5, ans=0.1 2024-06-21 16:18:46,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=412842.8333333333, ans=0.025 2024-06-21 16:18:47,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=412842.8333333333, ans=0.125 2024-06-21 16:18:48,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412842.8333333333, ans=0.125 2024-06-21 16:18:55,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.80 vs. limit=15.0 2024-06-21 16:18:58,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412879.5, ans=0.125 2024-06-21 16:19:05,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=22.5 2024-06-21 16:19:08,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=412897.8333333333, ans=0.125 2024-06-21 16:19:10,563 INFO [train.py:1028] (0/2) Epoch 23, batch 2650, loss[loss=0.1955, simple_loss=0.2482, pruned_loss=0.07147, over 13005.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2494, pruned_loss=0.06912, over 2587008.22 frames. ], batch size: 144, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:19:13,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=412916.1666666667, ans=0.05 2024-06-21 16:19:19,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=412934.5, ans=0.0 2024-06-21 16:19:21,489 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2024-06-21 16:19:30,460 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:19:34,894 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.089e+02 2.242e+02 2.444e+02 2.931e+02, threshold=4.484e+02, percent-clipped=0.0 2024-06-21 16:19:39,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412989.5, ans=0.125 2024-06-21 16:19:40,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=412989.5, ans=0.2 2024-06-21 16:19:42,937 INFO [train.py:1028] (0/2) Epoch 23, batch 2700, loss[loss=0.1919, simple_loss=0.247, pruned_loss=0.06839, over 13304.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.248, pruned_loss=0.06884, over 2585754.19 frames. ], batch size: 89, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:19:52,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.72 vs. limit=22.5 2024-06-21 16:19:59,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=15.0 2024-06-21 16:20:04,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=413044.5, ans=0.0 2024-06-21 16:20:04,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=413044.5, ans=0.0 2024-06-21 16:20:07,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.52 vs. limit=15.0 2024-06-21 16:20:15,367 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.808e+01 2024-06-21 16:20:19,885 INFO [train.py:1028] (0/2) Epoch 23, batch 2750, loss[loss=0.234, simple_loss=0.2853, pruned_loss=0.09137, over 13233.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2475, pruned_loss=0.06839, over 2582192.54 frames. ], batch size: 43, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:20:22,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=413099.5, ans=0.125 2024-06-21 16:20:28,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=413117.8333333333, ans=0.2 2024-06-21 16:20:30,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=413117.8333333333, ans=0.125 2024-06-21 16:20:35,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=413117.8333333333, ans=0.125 2024-06-21 16:20:49,127 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.079e+02 2.196e+02 2.353e+02 2.927e+02, threshold=4.392e+02, percent-clipped=0.0 2024-06-21 16:20:54,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=413172.8333333333, ans=0.0 2024-06-21 16:20:56,532 INFO [train.py:1028] (0/2) Epoch 23, batch 2800, loss[loss=0.1722, simple_loss=0.2134, pruned_loss=0.06553, over 10725.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2468, pruned_loss=0.06832, over 2579760.35 frames. ], batch size: 303, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:21:01,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=413191.1666666667, ans=0.5 2024-06-21 16:21:02,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=15.0 2024-06-21 16:21:06,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2024-06-21 16:21:08,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2024-06-21 16:21:12,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.17 vs. limit=22.5 2024-06-21 16:21:16,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2024-06-21 16:21:18,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=413246.1666666667, ans=0.125 2024-06-21 16:21:22,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=413264.5, ans=0.125 2024-06-21 16:21:22,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=413264.5, ans=0.125 2024-06-21 16:21:28,708 INFO [train.py:1028] (0/2) Epoch 23, batch 2850, loss[loss=0.1751, simple_loss=0.232, pruned_loss=0.05904, over 13291.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2457, pruned_loss=0.06798, over 2577360.34 frames. ], batch size: 49, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:21:34,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=413282.8333333333, ans=12.0 2024-06-21 16:21:35,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.44 vs. limit=15.0 2024-06-21 16:21:38,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413301.1666666667, ans=0.1 2024-06-21 16:21:38,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=413301.1666666667, ans=0.125 2024-06-21 16:21:42,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=413319.5, ans=0.0 2024-06-21 16:21:51,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=413337.8333333333, ans=0.0 2024-06-21 16:21:56,706 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.097e+02 2.224e+02 2.385e+02 2.857e+02, threshold=4.449e+02, percent-clipped=0.0 2024-06-21 16:22:03,704 INFO [train.py:1028] (0/2) Epoch 23, batch 2900, loss[loss=0.1766, simple_loss=0.235, pruned_loss=0.05909, over 13141.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2444, pruned_loss=0.06767, over 2584758.37 frames. ], batch size: 55, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:22:03,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=413374.5, ans=0.0 2024-06-21 16:22:09,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=413392.8333333333, ans=0.025 2024-06-21 16:22:16,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=413411.1666666667, ans=0.125 2024-06-21 16:22:26,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=413429.5, ans=0.0 2024-06-21 16:22:27,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=413429.5, ans=0.07 2024-06-21 16:22:30,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=413429.5, ans=0.125 2024-06-21 16:22:35,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=413447.8333333333, ans=0.0 2024-06-21 16:22:37,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=413447.8333333333, ans=0.125 2024-06-21 16:22:39,898 INFO [train.py:1028] (0/2) Epoch 23, batch 2950, loss[loss=0.1714, simple_loss=0.229, pruned_loss=0.05685, over 13236.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.244, pruned_loss=0.06729, over 2580187.99 frames. ], batch size: 43, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:22:40,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=413466.1666666667, ans=0.025 2024-06-21 16:22:41,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=413466.1666666667, ans=0.5 2024-06-21 16:22:45,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-21 16:22:55,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2024-06-21 16:23:03,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=413521.1666666667, ans=0.2 2024-06-21 16:23:05,977 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.067e+02 2.205e+02 2.441e+02 3.726e+02, threshold=4.411e+02, percent-clipped=0.0 2024-06-21 16:23:06,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413539.5, ans=0.1 2024-06-21 16:23:12,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=413557.8333333333, ans=0.125 2024-06-21 16:23:13,437 INFO [train.py:1028] (0/2) Epoch 23, batch 3000, loss[loss=0.1886, simple_loss=0.2511, pruned_loss=0.06309, over 13217.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2431, pruned_loss=0.067, over 2578080.83 frames. ], batch size: 59, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:23:13,439 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 16:23:18,114 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.1675, 2.4239, 2.8650, 1.7218], device='cuda:0') 2024-06-21 16:23:19,310 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5139, 2.1889, 4.1513, 3.8987], device='cuda:0') 2024-06-21 16:23:19,700 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.9968, 4.5933, 4.8953, 4.4221], device='cuda:0') 2024-06-21 16:23:21,499 INFO [train.py:1060] (0/2) Epoch 23, validation: loss=0.1874, simple_loss=0.2508, pruned_loss=0.06199, over 351949.00 frames. 2024-06-21 16:23:21,500 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 16:23:49,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=413612.8333333333, ans=0.015 2024-06-21 16:23:58,344 INFO [train.py:1028] (0/2) Epoch 23, batch 3050, loss[loss=0.1921, simple_loss=0.2476, pruned_loss=0.0683, over 13246.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2425, pruned_loss=0.06713, over 2577951.91 frames. ], batch size: 46, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:24:10,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=413686.1666666667, ans=0.125 2024-06-21 16:24:21,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=22.5 2024-06-21 16:24:29,191 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.067e+02 2.193e+02 2.332e+02 2.841e+02, threshold=4.385e+02, percent-clipped=0.0 2024-06-21 16:24:31,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=413722.8333333333, ans=0.1 2024-06-21 16:24:32,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=413722.8333333333, ans=0.125 2024-06-21 16:24:35,771 INFO [train.py:1028] (0/2) Epoch 23, batch 3100, loss[loss=0.1964, simple_loss=0.2527, pruned_loss=0.07008, over 13056.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.242, pruned_loss=0.06677, over 2578620.54 frames. ], batch size: 144, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:24:41,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=413741.1666666667, ans=0.125 2024-06-21 16:24:49,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=413777.8333333333, ans=0.125 2024-06-21 16:25:05,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=413814.5, ans=0.125 2024-06-21 16:25:09,068 INFO [train.py:1028] (0/2) Epoch 23, batch 3150, loss[loss=0.2032, simple_loss=0.2524, pruned_loss=0.07703, over 12939.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2413, pruned_loss=0.0664, over 2580308.79 frames. ], batch size: 158, lr: 2.53e-03, grad_scale: 16.0 2024-06-21 16:25:09,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=413832.8333333333, ans=0.05 2024-06-21 16:25:11,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=413832.8333333333, ans=0.025 2024-06-21 16:25:16,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=413851.1666666667, ans=0.09899494936611666 2024-06-21 16:25:19,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=413851.1666666667, ans=0.125 2024-06-21 16:25:35,658 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.083e+02 2.230e+02 2.420e+02 3.144e+02, threshold=4.460e+02, percent-clipped=0.0 2024-06-21 16:25:35,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=413906.1666666667, ans=0.07 2024-06-21 16:25:36,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.70 vs. limit=22.5 2024-06-21 16:25:42,121 INFO [train.py:1028] (0/2) Epoch 23, batch 3200, loss[loss=0.1617, simple_loss=0.2234, pruned_loss=0.04997, over 13195.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2402, pruned_loss=0.06611, over 2581239.03 frames. ], batch size: 55, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:25:52,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=413942.8333333333, ans=0.2 2024-06-21 16:26:05,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=413979.5, ans=0.0 2024-06-21 16:26:09,254 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.39 vs. limit=15.0 2024-06-21 16:26:17,047 INFO [train.py:1028] (0/2) Epoch 23, batch 3250, loss[loss=0.1929, simple_loss=0.2437, pruned_loss=0.07104, over 13088.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2397, pruned_loss=0.06611, over 2585434.53 frames. ], batch size: 71, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:26:23,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414034.5, ans=0.1 2024-06-21 16:26:25,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414034.5, ans=0.125 2024-06-21 16:26:41,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=414071.1666666667, ans=0.125 2024-06-21 16:26:42,068 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2024-06-21 16:26:44,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=414071.1666666667, ans=0.125 2024-06-21 16:26:46,427 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.055e+02 2.163e+02 2.294e+02 3.373e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 16:26:48,678 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:26:51,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414089.5, ans=0.1 2024-06-21 16:26:52,007 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:26:53,140 INFO [train.py:1028] (0/2) Epoch 23, batch 3300, loss[loss=0.1956, simple_loss=0.2472, pruned_loss=0.07206, over 12771.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.239, pruned_loss=0.06583, over 2581806.13 frames. ], batch size: 176, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:26:53,270 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:26:58,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=414107.8333333333, ans=0.125 2024-06-21 16:27:17,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=414162.8333333333, ans=0.125 2024-06-21 16:27:25,132 INFO [train.py:1028] (0/2) Epoch 23, batch 3350, loss[loss=0.1886, simple_loss=0.2364, pruned_loss=0.07043, over 12907.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2387, pruned_loss=0.06593, over 2577469.30 frames. ], batch size: 158, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:27:40,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=414217.8333333333, ans=0.125 2024-06-21 16:27:44,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=414236.1666666667, ans=0.0 2024-06-21 16:27:46,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=414236.1666666667, ans=0.125 2024-06-21 16:27:47,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=414254.5, ans=0.0 2024-06-21 16:27:48,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=414254.5, ans=0.125 2024-06-21 16:27:51,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=414254.5, ans=0.125 2024-06-21 16:27:51,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=414254.5, ans=0.09899494936611666 2024-06-21 16:27:53,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=414254.5, ans=0.125 2024-06-21 16:27:54,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.155e+02 2.244e+02 2.442e+02 2.825e+02, threshold=4.487e+02, percent-clipped=0.0 2024-06-21 16:27:56,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=414272.8333333333, ans=0.125 2024-06-21 16:27:58,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=414272.8333333333, ans=0.025 2024-06-21 16:28:00,680 INFO [train.py:1028] (0/2) Epoch 23, batch 3400, loss[loss=0.1945, simple_loss=0.2629, pruned_loss=0.06298, over 12814.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2388, pruned_loss=0.06639, over 2577724.52 frames. ], batch size: 22, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:28:17,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=414327.8333333333, ans=0.1 2024-06-21 16:28:25,056 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:28:28,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=414346.1666666667, ans=0.125 2024-06-21 16:28:36,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=414382.8333333333, ans=0.125 2024-06-21 16:28:36,549 INFO [train.py:1028] (0/2) Epoch 23, batch 3450, loss[loss=0.1999, simple_loss=0.2497, pruned_loss=0.07503, over 12840.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2385, pruned_loss=0.06608, over 2578470.14 frames. ], batch size: 176, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:28:39,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=414382.8333333333, ans=0.04949747468305833 2024-06-21 16:28:50,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414419.5, ans=0.125 2024-06-21 16:28:56,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=15.0 2024-06-21 16:28:57,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=414437.8333333333, ans=0.2 2024-06-21 16:29:02,272 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.105e+02 2.269e+02 2.452e+02 3.193e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 16:29:09,116 INFO [train.py:1028] (0/2) Epoch 23, batch 3500, loss[loss=0.167, simple_loss=0.2276, pruned_loss=0.05323, over 13055.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2378, pruned_loss=0.06548, over 2577704.77 frames. ], batch size: 33, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:29:14,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2024-06-21 16:29:15,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=414492.8333333333, ans=0.125 2024-06-21 16:29:19,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=414492.8333333333, ans=0.0 2024-06-21 16:29:26,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=6.0 2024-06-21 16:29:31,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.67 vs. limit=10.0 2024-06-21 16:29:32,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=414529.5, ans=0.0 2024-06-21 16:29:43,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414547.8333333333, ans=0.1 2024-06-21 16:29:45,750 INFO [train.py:1028] (0/2) Epoch 23, batch 3550, loss[loss=0.1688, simple_loss=0.2139, pruned_loss=0.06182, over 13100.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.237, pruned_loss=0.0651, over 2578973.44 frames. ], batch size: 95, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:29:48,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=414566.1666666667, ans=0.125 2024-06-21 16:29:50,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=414566.1666666667, ans=0.125 2024-06-21 16:30:11,213 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.679e+02 2.067e+02 2.199e+02 2.413e+02 3.033e+02, threshold=4.399e+02, percent-clipped=0.0 2024-06-21 16:30:16,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=414639.5, ans=0.125 2024-06-21 16:30:17,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=15.0 2024-06-21 16:30:21,198 INFO [train.py:1028] (0/2) Epoch 23, batch 3600, loss[loss=0.1735, simple_loss=0.2308, pruned_loss=0.05809, over 13328.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2368, pruned_loss=0.0651, over 2582294.55 frames. ], batch size: 49, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:30:28,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=414676.1666666667, ans=0.0 2024-06-21 16:30:32,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414676.1666666667, ans=0.125 2024-06-21 16:30:43,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=414712.8333333333, ans=0.125 2024-06-21 16:30:50,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-21 16:30:52,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414731.1666666667, ans=0.1 2024-06-21 16:30:55,895 INFO [train.py:1028] (0/2) Epoch 23, batch 3650, loss[loss=0.1648, simple_loss=0.2102, pruned_loss=0.05968, over 13000.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2366, pruned_loss=0.06478, over 2579127.31 frames. ], batch size: 102, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:30:59,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=414749.5, ans=0.0 2024-06-21 16:31:19,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=414804.5, ans=0.125 2024-06-21 16:31:22,553 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.067e+02 2.165e+02 2.287e+02 3.362e+02, threshold=4.330e+02, percent-clipped=0.0 2024-06-21 16:31:29,560 INFO [train.py:1028] (0/2) Epoch 23, batch 3700, loss[loss=0.1637, simple_loss=0.2181, pruned_loss=0.0547, over 13227.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2363, pruned_loss=0.06448, over 2583575.26 frames. ], batch size: 72, lr: 2.53e-03, grad_scale: 32.0 2024-06-21 16:31:41,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=414859.5, ans=0.1 2024-06-21 16:32:02,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=414914.5, ans=0.0 2024-06-21 16:32:07,511 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.43 vs. limit=22.5 2024-06-21 16:32:07,797 INFO [train.py:1028] (0/2) Epoch 23, batch 3750, loss[loss=0.1964, simple_loss=0.2607, pruned_loss=0.06608, over 12577.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2357, pruned_loss=0.06409, over 2585757.51 frames. ], batch size: 22, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:32:10,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=414932.8333333333, ans=0.125 2024-06-21 16:32:19,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=414951.1666666667, ans=0.0 2024-06-21 16:32:22,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414969.5, ans=0.1 2024-06-21 16:32:32,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=414987.8333333333, ans=0.125 2024-06-21 16:32:36,931 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.032e+02 2.163e+02 2.344e+02 3.282e+02, threshold=4.326e+02, percent-clipped=0.0 2024-06-21 16:32:39,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=415006.1666666667, ans=0.0 2024-06-21 16:32:40,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=415006.1666666667, ans=0.125 2024-06-21 16:32:41,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=415006.1666666667, ans=0.125 2024-06-21 16:32:42,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=415006.1666666667, ans=0.2 2024-06-21 16:32:43,783 INFO [train.py:1028] (0/2) Epoch 23, batch 3800, loss[loss=0.1952, simple_loss=0.236, pruned_loss=0.07718, over 13135.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.236, pruned_loss=0.06445, over 2584689.97 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:32:54,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=415042.8333333333, ans=0.125 2024-06-21 16:32:58,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415061.1666666667, ans=0.125 2024-06-21 16:33:06,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415079.5, ans=0.1 2024-06-21 16:33:17,569 INFO [train.py:1028] (0/2) Epoch 23, batch 3850, loss[loss=0.1821, simple_loss=0.226, pruned_loss=0.06912, over 12966.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2356, pruned_loss=0.06411, over 2584520.70 frames. ], batch size: 144, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:33:28,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.78 vs. limit=6.0 2024-06-21 16:33:29,077 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2024-06-21 16:33:31,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415152.8333333333, ans=0.1 2024-06-21 16:33:33,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2024-06-21 16:33:36,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2024-06-21 16:33:38,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=415171.1666666667, ans=0.125 2024-06-21 16:33:42,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=415171.1666666667, ans=15.0 2024-06-21 16:33:42,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=415189.5, ans=0.0 2024-06-21 16:33:42,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=415189.5, ans=0.0 2024-06-21 16:33:42,834 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.064e+02 2.244e+02 2.445e+02 3.288e+02, threshold=4.488e+02, percent-clipped=0.0 2024-06-21 16:33:44,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=415189.5, ans=0.0 2024-06-21 16:33:45,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2024-06-21 16:33:48,975 INFO [train.py:1028] (0/2) Epoch 23, batch 3900, loss[loss=0.1869, simple_loss=0.2363, pruned_loss=0.06877, over 13216.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2357, pruned_loss=0.06437, over 2587623.11 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:33:57,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=415226.1666666667, ans=0.125 2024-06-21 16:34:04,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415244.5, ans=0.125 2024-06-21 16:34:15,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.19 vs. limit=10.0 2024-06-21 16:34:15,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415262.8333333333, ans=0.125 2024-06-21 16:34:19,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=415281.1666666667, ans=0.125 2024-06-21 16:34:23,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=415281.1666666667, ans=0.0 2024-06-21 16:34:23,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=415281.1666666667, ans=0.0 2024-06-21 16:34:23,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.80 vs. limit=22.5 2024-06-21 16:34:24,480 INFO [train.py:1028] (0/2) Epoch 23, batch 3950, loss[loss=0.1953, simple_loss=0.2411, pruned_loss=0.07478, over 13082.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.2348, pruned_loss=0.06397, over 2589028.13 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:34:38,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=415317.8333333333, ans=0.125 2024-06-21 16:34:45,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415336.1666666667, ans=0.0 2024-06-21 16:34:50,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=415354.5, ans=0.0 2024-06-21 16:34:52,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=415372.8333333333, ans=0.125 2024-06-21 16:34:53,091 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.032e+02 2.128e+02 2.271e+02 3.267e+02, threshold=4.257e+02, percent-clipped=0.0 2024-06-21 16:34:59,678 INFO [train.py:1028] (0/2) Epoch 23, batch 4000, loss[loss=0.1688, simple_loss=0.2264, pruned_loss=0.05557, over 12943.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2342, pruned_loss=0.06384, over 2584230.96 frames. ], batch size: 39, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:35:04,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=415391.1666666667, ans=0.0 2024-06-21 16:35:08,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.36 vs. limit=22.5 2024-06-21 16:35:24,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=415446.1666666667, ans=0.0 2024-06-21 16:35:28,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415464.5, ans=0.125 2024-06-21 16:35:33,804 INFO [train.py:1028] (0/2) Epoch 23, batch 4050, loss[loss=0.1997, simple_loss=0.2469, pruned_loss=0.07621, over 10894.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2339, pruned_loss=0.06388, over 2581759.00 frames. ], batch size: 304, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:35:37,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=415482.8333333333, ans=0.07 2024-06-21 16:35:45,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2024-06-21 16:35:45,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.92 vs. limit=10.0 2024-06-21 16:36:03,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.014e+02 2.190e+02 2.392e+02 3.266e+02, threshold=4.380e+02, percent-clipped=0.0 2024-06-21 16:36:07,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2024-06-21 16:36:09,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=415574.5, ans=0.04949747468305833 2024-06-21 16:36:09,685 INFO [train.py:1028] (0/2) Epoch 23, batch 4100, loss[loss=0.1895, simple_loss=0.2376, pruned_loss=0.07065, over 13041.00 frames. ], tot_loss[loss=0.1814, simple_loss=0.2343, pruned_loss=0.06431, over 2577179.20 frames. ], batch size: 102, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:36:09,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=415574.5, ans=0.125 2024-06-21 16:36:20,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2024-06-21 16:36:27,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=415611.1666666667, ans=0.125 2024-06-21 16:36:29,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=415629.5, ans=0.125 2024-06-21 16:36:31,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=415629.5, ans=0.125 2024-06-21 16:36:36,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=415629.5, ans=0.09899494936611666 2024-06-21 16:36:36,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=415629.5, ans=0.125 2024-06-21 16:36:36,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.96 vs. limit=15.0 2024-06-21 16:36:43,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=415647.8333333333, ans=0.0 2024-06-21 16:36:43,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=415647.8333333333, ans=0.125 2024-06-21 16:36:45,365 INFO [train.py:1028] (0/2) Epoch 23, batch 4150, loss[loss=0.1677, simple_loss=0.2246, pruned_loss=0.05545, over 13108.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2339, pruned_loss=0.06386, over 2577590.96 frames. ], batch size: 55, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:36:45,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=415666.1666666667, ans=0.125 2024-06-21 16:36:54,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=415684.5, ans=0.025 2024-06-21 16:36:54,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=415684.5, ans=0.0 2024-06-21 16:36:57,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=415684.5, ans=0.0 2024-06-21 16:36:59,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=415702.8333333333, ans=0.2 2024-06-21 16:37:00,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.62 vs. limit=10.0 2024-06-21 16:37:05,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=415721.1666666667, ans=0.0 2024-06-21 16:37:11,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=415739.5, ans=0.125 2024-06-21 16:37:12,231 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.836e+02 2.067e+02 2.180e+02 2.371e+02 2.955e+02, threshold=4.361e+02, percent-clipped=0.0 2024-06-21 16:37:12,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=415739.5, ans=0.125 2024-06-21 16:37:15,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=415739.5, ans=0.95 2024-06-21 16:37:18,718 INFO [train.py:1028] (0/2) Epoch 23, batch 4200, loss[loss=0.1731, simple_loss=0.2195, pruned_loss=0.06332, over 13131.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2336, pruned_loss=0.06385, over 2579948.05 frames. ], batch size: 103, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:37:18,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=415757.8333333333, ans=0.1 2024-06-21 16:37:24,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=415776.1666666667, ans=0.125 2024-06-21 16:37:34,534 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2024-06-21 16:37:37,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2024-06-21 16:37:48,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.84 vs. limit=10.0 2024-06-21 16:37:55,105 INFO [train.py:1028] (0/2) Epoch 23, batch 4250, loss[loss=0.1702, simple_loss=0.2257, pruned_loss=0.05736, over 13306.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2334, pruned_loss=0.06337, over 2582878.04 frames. ], batch size: 46, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:37:55,630 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=22.5 2024-06-21 16:38:03,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=415867.8333333333, ans=12.0 2024-06-21 16:38:21,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 2.017e+02 2.105e+02 2.277e+02 3.385e+02, threshold=4.210e+02, percent-clipped=0.0 2024-06-21 16:38:22,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2024-06-21 16:38:22,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=415922.8333333333, ans=0.0 2024-06-21 16:38:27,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=415941.1666666667, ans=0.125 2024-06-21 16:38:28,107 INFO [train.py:1028] (0/2) Epoch 23, batch 4300, loss[loss=0.1783, simple_loss=0.2334, pruned_loss=0.06157, over 13229.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.233, pruned_loss=0.06332, over 2582662.46 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:38:30,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=415941.1666666667, ans=0.07 2024-06-21 16:39:01,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.64 vs. limit=15.0 2024-06-21 16:39:02,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=416014.5, ans=0.125 2024-06-21 16:39:03,992 INFO [train.py:1028] (0/2) Epoch 23, batch 4350, loss[loss=0.1662, simple_loss=0.2256, pruned_loss=0.05344, over 13145.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2325, pruned_loss=0.06333, over 2586662.88 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:39:12,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=416051.1666666667, ans=0.125 2024-06-21 16:39:19,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=416069.5, ans=0.125 2024-06-21 16:39:20,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=416069.5, ans=0.2 2024-06-21 16:39:25,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416087.8333333333, ans=0.125 2024-06-21 16:39:28,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416087.8333333333, ans=0.1 2024-06-21 16:39:29,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.086e+02 2.197e+02 2.398e+02 3.484e+02, threshold=4.393e+02, percent-clipped=0.0 2024-06-21 16:39:32,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416106.1666666667, ans=0.125 2024-06-21 16:39:33,965 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2024-06-21 16:39:36,902 INFO [train.py:1028] (0/2) Epoch 23, batch 4400, loss[loss=0.1817, simple_loss=0.231, pruned_loss=0.06616, over 13232.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2323, pruned_loss=0.06347, over 2586891.55 frames. ], batch size: 83, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:39:46,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=416142.8333333333, ans=0.2 2024-06-21 16:40:00,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=416179.5, ans=0.125 2024-06-21 16:40:03,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=416179.5, ans=0.2 2024-06-21 16:40:04,787 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:40:05,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.59 vs. limit=12.0 2024-06-21 16:40:06,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416197.8333333333, ans=0.1 2024-06-21 16:40:09,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=416197.8333333333, ans=0.125 2024-06-21 16:40:11,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.01 vs. limit=22.5 2024-06-21 16:40:13,468 INFO [train.py:1028] (0/2) Epoch 23, batch 4450, loss[loss=0.1745, simple_loss=0.2317, pruned_loss=0.0587, over 12946.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2324, pruned_loss=0.06379, over 2580774.29 frames. ], batch size: 33, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:40:16,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=416216.1666666667, ans=0.0 2024-06-21 16:40:16,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=416216.1666666667, ans=0.04949747468305833 2024-06-21 16:40:24,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=416234.5, ans=0.0 2024-06-21 16:40:29,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=416252.8333333333, ans=0.0 2024-06-21 16:40:38,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=416271.1666666667, ans=0.0 2024-06-21 16:40:42,530 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.158e+02 2.289e+02 2.418e+02 3.102e+02, threshold=4.579e+02, percent-clipped=0.0 2024-06-21 16:40:49,052 INFO [train.py:1028] (0/2) Epoch 23, batch 4500, loss[loss=0.1705, simple_loss=0.2228, pruned_loss=0.05915, over 13225.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2321, pruned_loss=0.0637, over 2584821.67 frames. ], batch size: 89, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:40:53,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=416307.8333333333, ans=0.0 2024-06-21 16:40:55,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=416326.1666666667, ans=0.125 2024-06-21 16:41:01,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=416344.5, ans=0.0 2024-06-21 16:41:02,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=416344.5, ans=0.1 2024-06-21 16:41:04,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=416344.5, ans=0.0 2024-06-21 16:41:08,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=416362.8333333333, ans=0.125 2024-06-21 16:41:13,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416362.8333333333, ans=0.125 2024-06-21 16:41:22,281 INFO [train.py:1028] (0/2) Epoch 23, batch 4550, loss[loss=0.1911, simple_loss=0.2484, pruned_loss=0.06688, over 13267.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.232, pruned_loss=0.06356, over 2588775.51 frames. ], batch size: 52, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:41:27,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=416399.5, ans=0.0 2024-06-21 16:41:38,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1.whitening_limit, batch_count=416436.1666666667, ans=10.0 2024-06-21 16:41:46,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=416454.5, ans=0.125 2024-06-21 16:41:51,698 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.060e+02 2.246e+02 2.493e+02 3.852e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 16:41:55,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=416472.8333333333, ans=0.0 2024-06-21 16:41:56,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=416472.8333333333, ans=0.1 2024-06-21 16:41:58,189 INFO [train.py:1028] (0/2) Epoch 23, batch 4600, loss[loss=0.1746, simple_loss=0.2261, pruned_loss=0.06155, over 12476.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2319, pruned_loss=0.06348, over 2584294.82 frames. ], batch size: 202, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:41:59,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416491.1666666667, ans=0.0 2024-06-21 16:41:59,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416491.1666666667, ans=0.125 2024-06-21 16:42:05,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2024-06-21 16:42:08,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=416509.5, ans=0.95 2024-06-21 16:42:09,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2024-06-21 16:42:10,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=416509.5, ans=0.0 2024-06-21 16:42:14,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416527.8333333333, ans=0.0 2024-06-21 16:42:30,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.46 vs. limit=22.5 2024-06-21 16:42:31,100 INFO [train.py:1028] (0/2) Epoch 23, batch 4650, loss[loss=0.1697, simple_loss=0.2147, pruned_loss=0.06238, over 13151.00 frames. ], tot_loss[loss=0.1789, simple_loss=0.2313, pruned_loss=0.06327, over 2587616.05 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:42:31,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=416582.8333333333, ans=0.015 2024-06-21 16:42:32,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=416582.8333333333, ans=0.2 2024-06-21 16:42:33,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=416582.8333333333, ans=0.125 2024-06-21 16:42:39,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=416582.8333333333, ans=0.125 2024-06-21 16:42:42,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.43 vs. limit=15.0 2024-06-21 16:42:46,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416601.1666666667, ans=0.125 2024-06-21 16:42:49,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.42 vs. limit=10.0 2024-06-21 16:42:50,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.43 vs. limit=15.0 2024-06-21 16:42:53,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=416637.8333333333, ans=0.125 2024-06-21 16:42:56,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=416637.8333333333, ans=0.1 2024-06-21 16:43:00,082 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.045e+02 2.200e+02 2.369e+02 2.947e+02, threshold=4.400e+02, percent-clipped=0.0 2024-06-21 16:43:04,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416656.1666666667, ans=0.1 2024-06-21 16:43:06,820 INFO [train.py:1028] (0/2) Epoch 23, batch 4700, loss[loss=0.1517, simple_loss=0.2122, pruned_loss=0.04563, over 12339.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2318, pruned_loss=0.06341, over 2581854.55 frames. ], batch size: 25, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:43:07,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416674.5, ans=0.125 2024-06-21 16:43:08,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416674.5, ans=0.1 2024-06-21 16:43:10,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=416674.5, ans=0.125 2024-06-21 16:43:13,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416692.8333333333, ans=0.1 2024-06-21 16:43:21,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416711.1666666667, ans=0.1 2024-06-21 16:43:22,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=416711.1666666667, ans=0.125 2024-06-21 16:43:31,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2024-06-21 16:43:39,981 INFO [train.py:1028] (0/2) Epoch 23, batch 4750, loss[loss=0.1951, simple_loss=0.2398, pruned_loss=0.07522, over 12519.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2319, pruned_loss=0.06378, over 2578797.27 frames. ], batch size: 202, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:43:42,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=416766.1666666667, ans=0.125 2024-06-21 16:43:59,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=416802.8333333333, ans=0.125 2024-06-21 16:43:59,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=416802.8333333333, ans=0.125 2024-06-21 16:44:01,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=416802.8333333333, ans=0.0 2024-06-21 16:44:01,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=416802.8333333333, ans=0.0 2024-06-21 16:44:03,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=416821.1666666667, ans=0.05 2024-06-21 16:44:06,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.31 vs. limit=10.0 2024-06-21 16:44:09,688 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.052e+02 2.194e+02 2.357e+02 3.097e+02, threshold=4.388e+02, percent-clipped=0.0 2024-06-21 16:44:11,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=416839.5, ans=0.125 2024-06-21 16:44:16,128 INFO [train.py:1028] (0/2) Epoch 23, batch 4800, loss[loss=0.1621, simple_loss=0.2155, pruned_loss=0.05439, over 13278.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2315, pruned_loss=0.06342, over 2576143.76 frames. ], batch size: 63, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:44:34,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=416894.5, ans=0.0 2024-06-21 16:44:35,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416894.5, ans=0.125 2024-06-21 16:44:39,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416912.8333333333, ans=0.1 2024-06-21 16:44:40,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2024-06-21 16:44:51,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=416931.1666666667, ans=0.0 2024-06-21 16:44:52,913 INFO [train.py:1028] (0/2) Epoch 23, batch 4850, loss[loss=0.179, simple_loss=0.2326, pruned_loss=0.06266, over 13221.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2315, pruned_loss=0.06338, over 2573291.91 frames. ], batch size: 89, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:45:07,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.41 vs. limit=22.5 2024-06-21 16:45:09,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416986.1666666667, ans=0.1 2024-06-21 16:45:17,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.30 vs. limit=15.0 2024-06-21 16:45:19,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=417022.8333333333, ans=0.0 2024-06-21 16:45:20,060 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.009e+02 2.120e+02 2.254e+02 3.202e+02, threshold=4.240e+02, percent-clipped=0.0 2024-06-21 16:45:26,689 INFO [train.py:1028] (0/2) Epoch 23, batch 4900, loss[loss=0.1573, simple_loss=0.2193, pruned_loss=0.04761, over 13230.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2314, pruned_loss=0.06329, over 2574600.68 frames. ], batch size: 59, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:45:29,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.47 vs. limit=15.0 2024-06-21 16:45:31,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=417041.1666666667, ans=0.125 2024-06-21 16:45:32,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=417041.1666666667, ans=0.125 2024-06-21 16:45:52,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=417096.1666666667, ans=0.0 2024-06-21 16:46:04,972 INFO [train.py:1028] (0/2) Epoch 23, batch 4950, loss[loss=0.193, simple_loss=0.2353, pruned_loss=0.07535, over 11078.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2315, pruned_loss=0.0636, over 2570346.87 frames. ], batch size: 304, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:46:12,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=417151.1666666667, ans=0.0 2024-06-21 16:46:13,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=417151.1666666667, ans=0.125 2024-06-21 16:46:14,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=417151.1666666667, ans=0.125 2024-06-21 16:46:21,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=417169.5, ans=0.0 2024-06-21 16:46:23,316 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=15.0 2024-06-21 16:46:23,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417187.8333333333, ans=0.1 2024-06-21 16:46:27,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=417187.8333333333, ans=0.125 2024-06-21 16:46:30,891 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.018e+02 2.135e+02 2.283e+02 2.656e+02, threshold=4.270e+02, percent-clipped=0.0 2024-06-21 16:46:37,222 INFO [train.py:1028] (0/2) Epoch 23, batch 5000, loss[loss=0.1795, simple_loss=0.222, pruned_loss=0.06851, over 13116.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2314, pruned_loss=0.06356, over 2573836.71 frames. ], batch size: 95, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:46:53,286 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.78 vs. limit=15.0 2024-06-21 16:47:00,283 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.35 vs. limit=6.0 2024-06-21 16:47:01,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=417279.5, ans=0.125 2024-06-21 16:47:10,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417297.8333333333, ans=0.125 2024-06-21 16:47:14,656 INFO [train.py:1028] (0/2) Epoch 23, batch 5050, loss[loss=0.1642, simple_loss=0.2216, pruned_loss=0.05337, over 12828.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2312, pruned_loss=0.06316, over 2573469.86 frames. ], batch size: 36, lr: 2.52e-03, grad_scale: 32.0 2024-06-21 16:47:17,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.78 vs. limit=15.0 2024-06-21 16:47:20,239 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:47:24,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=417334.5, ans=0.2 2024-06-21 16:47:41,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.113e+02 2.228e+02 2.433e+02 3.192e+02, threshold=4.457e+02, percent-clipped=0.0 2024-06-21 16:47:41,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=417389.5, ans=0.2 2024-06-21 16:47:51,526 INFO [train.py:1028] (0/2) Epoch 23, batch 5100, loss[loss=0.1883, simple_loss=0.2487, pruned_loss=0.06393, over 13251.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2308, pruned_loss=0.06322, over 2568831.02 frames. ], batch size: 40, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:47:53,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=417407.8333333333, ans=0.0 2024-06-21 16:47:58,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=417426.1666666667, ans=0.0 2024-06-21 16:47:59,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=417426.1666666667, ans=0.1 2024-06-21 16:48:01,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2024-06-21 16:48:10,753 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:48:19,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=417481.1666666667, ans=0.1 2024-06-21 16:48:21,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=417481.1666666667, ans=22.5 2024-06-21 16:48:23,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=417481.1666666667, ans=0.0 2024-06-21 16:48:24,806 INFO [train.py:1028] (0/2) Epoch 23, batch 5150, loss[loss=0.1873, simple_loss=0.2276, pruned_loss=0.07355, over 13115.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2306, pruned_loss=0.0635, over 2570833.12 frames. ], batch size: 132, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:48:39,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=417517.8333333333, ans=0.2 2024-06-21 16:48:41,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2024-06-21 16:48:50,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=417554.5, ans=0.2 2024-06-21 16:48:50,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=417554.5, ans=0.125 2024-06-21 16:48:54,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=417572.8333333333, ans=0.125 2024-06-21 16:48:54,815 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.029e+02 2.159e+02 2.371e+02 3.039e+02, threshold=4.317e+02, percent-clipped=0.0 2024-06-21 16:49:00,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=417572.8333333333, ans=0.025 2024-06-21 16:49:01,409 INFO [train.py:1028] (0/2) Epoch 23, batch 5200, loss[loss=0.1786, simple_loss=0.2274, pruned_loss=0.06493, over 13154.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2308, pruned_loss=0.06335, over 2573902.62 frames. ], batch size: 95, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:49:04,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=417591.1666666667, ans=0.125 2024-06-21 16:49:05,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.18 vs. limit=10.0 2024-06-21 16:49:17,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=417627.8333333333, ans=0.125 2024-06-21 16:49:23,455 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.85 vs. limit=22.5 2024-06-21 16:49:33,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=417664.5, ans=0.95 2024-06-21 16:49:34,787 INFO [train.py:1028] (0/2) Epoch 23, batch 5250, loss[loss=0.1732, simple_loss=0.2235, pruned_loss=0.06149, over 13265.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2309, pruned_loss=0.06352, over 2570823.45 frames. ], batch size: 52, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:49:58,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=417737.8333333333, ans=10.0 2024-06-21 16:49:59,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=417737.8333333333, ans=0.125 2024-06-21 16:50:04,343 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.065e+02 2.283e+02 2.534e+02 3.251e+02, threshold=4.566e+02, percent-clipped=0.0 2024-06-21 16:50:11,018 INFO [train.py:1028] (0/2) Epoch 23, batch 5300, loss[loss=0.1736, simple_loss=0.2207, pruned_loss=0.06329, over 13030.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2305, pruned_loss=0.06311, over 2568374.45 frames. ], batch size: 144, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:50:13,474 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.43 vs. limit=15.0 2024-06-21 16:50:15,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=417774.5, ans=0.0 2024-06-21 16:50:16,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=417774.5, ans=0.025 2024-06-21 16:50:16,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2024-06-21 16:50:48,437 INFO [train.py:1028] (0/2) Epoch 23, batch 5350, loss[loss=0.1916, simple_loss=0.2479, pruned_loss=0.06762, over 12328.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2302, pruned_loss=0.06302, over 2575843.85 frames. ], batch size: 18, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:50:48,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=417866.1666666667, ans=0.125 2024-06-21 16:50:48,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.72 vs. limit=6.0 2024-06-21 16:50:51,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=417866.1666666667, ans=0.0 2024-06-21 16:51:07,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=417921.1666666667, ans=0.025 2024-06-21 16:51:12,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=417921.1666666667, ans=0.1 2024-06-21 16:51:12,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=417921.1666666667, ans=0.0 2024-06-21 16:51:13,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=417939.5, ans=0.0 2024-06-21 16:51:14,398 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.062e+02 2.185e+02 2.313e+02 2.927e+02, threshold=4.369e+02, percent-clipped=0.0 2024-06-21 16:51:19,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=417939.5, ans=0.0 2024-06-21 16:51:20,899 INFO [train.py:1028] (0/2) Epoch 23, batch 5400, loss[loss=0.1901, simple_loss=0.2354, pruned_loss=0.07242, over 12322.00 frames. ], tot_loss[loss=0.1789, simple_loss=0.2307, pruned_loss=0.06352, over 2568170.22 frames. ], batch size: 241, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:51:23,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.93 vs. limit=15.0 2024-06-21 16:51:34,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=417994.5, ans=0.0 2024-06-21 16:51:35,698 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-228000.pt 2024-06-21 16:51:44,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=417994.5, ans=0.2 2024-06-21 16:51:46,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=418012.8333333333, ans=0.125 2024-06-21 16:51:49,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=418012.8333333333, ans=0.125 2024-06-21 16:51:59,116 INFO [train.py:1028] (0/2) Epoch 23, batch 5450, loss[loss=0.1817, simple_loss=0.2393, pruned_loss=0.06209, over 12386.00 frames. ], tot_loss[loss=0.179, simple_loss=0.231, pruned_loss=0.06349, over 2571072.59 frames. ], batch size: 25, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:52:28,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.036e+02 2.159e+02 2.324e+02 2.858e+02, threshold=4.318e+02, percent-clipped=0.0 2024-06-21 16:52:30,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=418122.8333333333, ans=0.0 2024-06-21 16:52:35,499 INFO [train.py:1028] (0/2) Epoch 23, batch 5500, loss[loss=0.1969, simple_loss=0.2383, pruned_loss=0.07777, over 12204.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.231, pruned_loss=0.06357, over 2564729.35 frames. ], batch size: 240, lr: 2.52e-03, grad_scale: 64.0 2024-06-21 16:52:43,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=26.51 vs. limit=22.5 2024-06-21 16:52:43,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=418159.5, ans=0.125 2024-06-21 16:52:50,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=418177.8333333333, ans=0.125 2024-06-21 16:52:52,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=418177.8333333333, ans=0.125 2024-06-21 16:53:04,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=22.5 2024-06-21 16:53:06,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=418214.5, ans=0.0 2024-06-21 16:53:08,054 INFO [train.py:1028] (0/2) Epoch 23, batch 5550, loss[loss=0.1744, simple_loss=0.2288, pruned_loss=0.06001, over 13203.00 frames. ], tot_loss[loss=0.179, simple_loss=0.231, pruned_loss=0.0635, over 2568043.56 frames. ], batch size: 43, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:53:19,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=418251.1666666667, ans=0.0 2024-06-21 16:53:30,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=22.5 2024-06-21 16:53:34,094 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.732e+02 2.032e+02 2.154e+02 2.273e+02 2.743e+02, threshold=4.309e+02, percent-clipped=0.0 2024-06-21 16:53:43,691 INFO [train.py:1028] (0/2) Epoch 23, batch 5600, loss[loss=0.17, simple_loss=0.2167, pruned_loss=0.06162, over 13227.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2307, pruned_loss=0.06332, over 2570384.42 frames. ], batch size: 89, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:53:51,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=418342.8333333333, ans=0.0 2024-06-21 16:53:57,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=418361.1666666667, ans=0.2 2024-06-21 16:53:58,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=418361.1666666667, ans=0.125 2024-06-21 16:54:04,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=418379.5, ans=0.125 2024-06-21 16:54:07,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=8.0 2024-06-21 16:54:10,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=418397.8333333333, ans=0.125 2024-06-21 16:54:11,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418397.8333333333, ans=0.1 2024-06-21 16:54:11,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=418397.8333333333, ans=0.025 2024-06-21 16:54:12,248 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.34 vs. limit=15.0 2024-06-21 16:54:13,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=418397.8333333333, ans=0.09899494936611666 2024-06-21 16:54:14,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=418397.8333333333, ans=0.0 2024-06-21 16:54:16,189 INFO [train.py:1028] (0/2) Epoch 23, batch 5650, loss[loss=0.1922, simple_loss=0.2414, pruned_loss=0.07151, over 12505.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2309, pruned_loss=0.06323, over 2576115.07 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:54:16,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=418416.1666666667, ans=0.025 2024-06-21 16:54:29,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=418434.5, ans=0.0 2024-06-21 16:54:32,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=418452.8333333333, ans=15.0 2024-06-21 16:54:46,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=418489.5, ans=0.07 2024-06-21 16:54:46,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.049e+02 2.187e+02 2.377e+02 3.305e+02, threshold=4.373e+02, percent-clipped=0.0 2024-06-21 16:54:53,290 INFO [train.py:1028] (0/2) Epoch 23, batch 5700, loss[loss=0.1667, simple_loss=0.2237, pruned_loss=0.05485, over 13252.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2308, pruned_loss=0.06324, over 2580073.65 frames. ], batch size: 63, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:54:57,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=418507.8333333333, ans=0.125 2024-06-21 16:55:03,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=418526.1666666667, ans=0.1 2024-06-21 16:55:03,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=418526.1666666667, ans=0.0 2024-06-21 16:55:04,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=418526.1666666667, ans=0.125 2024-06-21 16:55:12,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=418562.8333333333, ans=0.2 2024-06-21 16:55:26,558 INFO [train.py:1028] (0/2) Epoch 23, batch 5750, loss[loss=0.2058, simple_loss=0.2496, pruned_loss=0.08097, over 12793.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2313, pruned_loss=0.0634, over 2580084.33 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:55:28,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=418599.5, ans=0.025 2024-06-21 16:55:31,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=418599.5, ans=0.0 2024-06-21 16:55:34,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=418617.8333333333, ans=0.0 2024-06-21 16:55:56,671 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.052e+02 2.216e+02 2.399e+02 3.058e+02, threshold=4.431e+02, percent-clipped=0.0 2024-06-21 16:55:56,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418672.8333333333, ans=0.1 2024-06-21 16:56:02,971 INFO [train.py:1028] (0/2) Epoch 23, batch 5800, loss[loss=0.1797, simple_loss=0.2315, pruned_loss=0.06397, over 12757.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2326, pruned_loss=0.0642, over 2579206.35 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:56:06,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=418691.1666666667, ans=0.0 2024-06-21 16:56:08,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=418691.1666666667, ans=0.125 2024-06-21 16:56:12,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.19 vs. limit=22.5 2024-06-21 16:56:16,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=418727.8333333333, ans=0.025 2024-06-21 16:56:31,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=418746.1666666667, ans=0.0 2024-06-21 16:56:33,933 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:56:39,878 INFO [train.py:1028] (0/2) Epoch 23, batch 5850, loss[loss=0.1982, simple_loss=0.248, pruned_loss=0.07414, over 12560.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2347, pruned_loss=0.06516, over 2576903.52 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:56:50,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=418801.1666666667, ans=0.125 2024-06-21 16:56:57,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=418819.5, ans=10.0 2024-06-21 16:57:02,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=418837.8333333333, ans=0.0 2024-06-21 16:57:07,053 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.171e+02 2.301e+02 2.500e+02 3.619e+02, threshold=4.603e+02, percent-clipped=0.0 2024-06-21 16:57:13,699 INFO [train.py:1028] (0/2) Epoch 23, batch 5900, loss[loss=0.1708, simple_loss=0.2202, pruned_loss=0.06065, over 13165.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2358, pruned_loss=0.06531, over 2576613.96 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:57:14,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=418874.5, ans=0.125 2024-06-21 16:57:15,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.64 vs. limit=22.5 2024-06-21 16:57:16,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.63 vs. limit=10.0 2024-06-21 16:57:52,438 INFO [train.py:1028] (0/2) Epoch 23, batch 5950, loss[loss=0.1794, simple_loss=0.2265, pruned_loss=0.06613, over 13081.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2371, pruned_loss=0.06568, over 2580809.39 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:58:01,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=418984.5, ans=0.0 2024-06-21 16:58:19,032 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.113e+02 2.280e+02 2.544e+02 4.069e+02, threshold=4.560e+02, percent-clipped=0.0 2024-06-21 16:58:21,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=419039.5, ans=0.125 2024-06-21 16:58:24,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=419057.8333333333, ans=0.125 2024-06-21 16:58:25,498 INFO [train.py:1028] (0/2) Epoch 23, batch 6000, loss[loss=0.2104, simple_loss=0.2535, pruned_loss=0.08366, over 12152.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2382, pruned_loss=0.06611, over 2574139.10 frames. ], batch size: 241, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:58:25,499 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 16:58:37,922 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([4.2763, 3.3929, 3.6698, 3.3876, 2.7681, 3.3930, 3.7776, 3.5759], device='cuda:0') 2024-06-21 16:58:38,710 INFO [train.py:1060] (0/2) Epoch 23, validation: loss=0.1878, simple_loss=0.2508, pruned_loss=0.06241, over 351949.00 frames. 2024-06-21 16:58:38,710 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 16:58:40,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419057.8333333333, ans=0.125 2024-06-21 16:58:55,218 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:59:03,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.10 vs. limit=10.0 2024-06-21 16:59:12,938 INFO [train.py:1028] (0/2) Epoch 23, batch 6050, loss[loss=0.1786, simple_loss=0.2328, pruned_loss=0.06223, over 12963.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2393, pruned_loss=0.06629, over 2576491.48 frames. ], batch size: 39, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 16:59:19,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=419167.8333333333, ans=0.0 2024-06-21 16:59:33,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=419204.5, ans=0.5 2024-06-21 16:59:35,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=419204.5, ans=0.0 2024-06-21 16:59:39,469 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.190e+02 2.300e+02 2.576e+02 3.315e+02, threshold=4.601e+02, percent-clipped=0.0 2024-06-21 16:59:45,270 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 16:59:46,437 INFO [train.py:1028] (0/2) Epoch 23, batch 6100, loss[loss=0.1934, simple_loss=0.2398, pruned_loss=0.07351, over 13083.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2403, pruned_loss=0.06641, over 2578925.85 frames. ], batch size: 121, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:00:01,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=419259.5, ans=0.125 2024-06-21 17:00:07,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=419277.8333333333, ans=0.2 2024-06-21 17:00:11,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419296.1666666667, ans=0.1 2024-06-21 17:00:15,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=419296.1666666667, ans=0.04949747468305833 2024-06-21 17:00:17,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=419314.5, ans=0.2 2024-06-21 17:00:25,328 INFO [train.py:1028] (0/2) Epoch 23, batch 6150, loss[loss=0.1949, simple_loss=0.2426, pruned_loss=0.07367, over 10809.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2423, pruned_loss=0.06717, over 2576659.19 frames. ], batch size: 303, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:00:25,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419332.8333333333, ans=0.125 2024-06-21 17:00:27,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419332.8333333333, ans=0.1 2024-06-21 17:00:28,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2024-06-21 17:00:47,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=419369.5, ans=0.125 2024-06-21 17:00:51,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=419387.8333333333, ans=0.0 2024-06-21 17:00:52,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=12.0 2024-06-21 17:00:53,630 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2024-06-21 17:00:57,466 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.168e+02 2.448e+02 2.867e+02 4.396e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-21 17:00:57,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2024-06-21 17:01:04,709 INFO [train.py:1028] (0/2) Epoch 23, batch 6200, loss[loss=0.1944, simple_loss=0.2551, pruned_loss=0.06685, over 13236.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2439, pruned_loss=0.06764, over 2574804.72 frames. ], batch size: 89, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:01:13,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=419442.8333333333, ans=0.0 2024-06-21 17:01:13,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=419442.8333333333, ans=0.2 2024-06-21 17:01:16,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=419442.8333333333, ans=0.0 2024-06-21 17:01:20,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419461.1666666667, ans=0.125 2024-06-21 17:01:26,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-21 17:01:31,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.10 vs. limit=15.0 2024-06-21 17:01:32,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=419497.8333333333, ans=0.125 2024-06-21 17:01:39,293 INFO [train.py:1028] (0/2) Epoch 23, batch 6250, loss[loss=0.1905, simple_loss=0.2407, pruned_loss=0.07011, over 13226.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2445, pruned_loss=0.06797, over 2567833.84 frames. ], batch size: 83, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:01:45,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419534.5, ans=0.1 2024-06-21 17:01:46,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=419534.5, ans=0.125 2024-06-21 17:01:48,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=419534.5, ans=0.0 2024-06-21 17:01:57,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=419552.8333333333, ans=0.2 2024-06-21 17:02:01,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=419571.1666666667, ans=0.125 2024-06-21 17:02:01,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=419571.1666666667, ans=0.0 2024-06-21 17:02:02,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=419571.1666666667, ans=0.125 2024-06-21 17:02:04,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=419571.1666666667, ans=0.2 2024-06-21 17:02:08,636 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.127e+02 2.363e+02 2.536e+02 3.890e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 17:02:14,936 INFO [train.py:1028] (0/2) Epoch 23, batch 6300, loss[loss=0.1839, simple_loss=0.2484, pruned_loss=0.0597, over 11625.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2452, pruned_loss=0.06818, over 2562487.82 frames. ], batch size: 17, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:02:15,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.19 vs. limit=22.5 2024-06-21 17:02:15,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419607.8333333333, ans=0.1 2024-06-21 17:02:34,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=419662.8333333333, ans=0.0 2024-06-21 17:02:38,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=419662.8333333333, ans=0.125 2024-06-21 17:02:51,132 INFO [train.py:1028] (0/2) Epoch 23, batch 6350, loss[loss=0.2719, simple_loss=0.305, pruned_loss=0.1194, over 12602.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2471, pruned_loss=0.06878, over 2571847.07 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:03:03,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.39 vs. limit=22.5 2024-06-21 17:03:17,236 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.244e+02 2.400e+02 2.607e+02 3.431e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 17:03:23,883 INFO [train.py:1028] (0/2) Epoch 23, batch 6400, loss[loss=0.1597, simple_loss=0.2191, pruned_loss=0.05018, over 13244.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2493, pruned_loss=0.06961, over 2574407.94 frames. ], batch size: 67, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:03:31,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=419809.5, ans=0.125 2024-06-21 17:03:36,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=419809.5, ans=0.2 2024-06-21 17:03:38,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.59 vs. limit=15.0 2024-06-21 17:03:41,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.96 vs. limit=10.0 2024-06-21 17:03:45,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=419846.1666666667, ans=0.1 2024-06-21 17:03:57,116 INFO [train.py:1028] (0/2) Epoch 23, batch 6450, loss[loss=0.2325, simple_loss=0.2808, pruned_loss=0.09213, over 12530.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.251, pruned_loss=0.0704, over 2580724.63 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 64.0 2024-06-21 17:04:15,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.45 vs. limit=15.0 2024-06-21 17:04:28,403 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.221e+02 2.366e+02 2.529e+02 3.614e+02, threshold=4.733e+02, percent-clipped=0.0 2024-06-21 17:04:34,827 INFO [train.py:1028] (0/2) Epoch 23, batch 6500, loss[loss=0.1847, simple_loss=0.2306, pruned_loss=0.06944, over 10587.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2525, pruned_loss=0.07058, over 2583048.04 frames. ], batch size: 304, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:04:36,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=15.0 2024-06-21 17:04:38,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.01 vs. limit=5.0 2024-06-21 17:04:40,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=419974.5, ans=0.0 2024-06-21 17:04:41,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=419992.8333333333, ans=0.125 2024-06-21 17:04:42,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=419992.8333333333, ans=0.04949747468305833 2024-06-21 17:04:49,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=420011.1666666667, ans=0.125 2024-06-21 17:05:05,477 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:05:09,706 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.88 vs. limit=15.0 2024-06-21 17:05:11,448 INFO [train.py:1028] (0/2) Epoch 23, batch 6550, loss[loss=0.1783, simple_loss=0.2424, pruned_loss=0.05712, over 12566.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2532, pruned_loss=0.07052, over 2587497.61 frames. ], batch size: 22, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:05:13,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.80 vs. limit=15.0 2024-06-21 17:05:18,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=15.0 2024-06-21 17:05:21,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=420084.5, ans=0.125 2024-06-21 17:05:30,100 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:05:38,178 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.203e+02 2.341e+02 2.506e+02 3.544e+02, threshold=4.683e+02, percent-clipped=0.0 2024-06-21 17:05:43,743 INFO [train.py:1028] (0/2) Epoch 23, batch 6600, loss[loss=0.1892, simple_loss=0.2552, pruned_loss=0.06157, over 13277.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2536, pruned_loss=0.07082, over 2589352.62 frames. ], batch size: 72, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:05:46,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.34 vs. limit=6.0 2024-06-21 17:05:53,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=420176.1666666667, ans=0.125 2024-06-21 17:05:55,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=420176.1666666667, ans=0.1 2024-06-21 17:06:01,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=420194.5, ans=0.0 2024-06-21 17:06:03,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.78 vs. limit=22.5 2024-06-21 17:06:09,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=420212.8333333333, ans=0.125 2024-06-21 17:06:14,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=420231.1666666667, ans=0.2 2024-06-21 17:06:14,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=420231.1666666667, ans=0.125 2024-06-21 17:06:20,114 INFO [train.py:1028] (0/2) Epoch 23, batch 6650, loss[loss=0.2136, simple_loss=0.2684, pruned_loss=0.07941, over 12969.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2554, pruned_loss=0.07147, over 2583111.77 frames. ], batch size: 158, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:06:24,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=420249.5, ans=0.025 2024-06-21 17:06:25,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=420249.5, ans=0.2 2024-06-21 17:06:29,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.85 vs. limit=12.0 2024-06-21 17:06:33,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.01 vs. limit=15.0 2024-06-21 17:06:33,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=420286.1666666667, ans=0.125 2024-06-21 17:06:40,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=420304.5, ans=0.1 2024-06-21 17:06:40,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=420304.5, ans=0.0 2024-06-21 17:06:47,494 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.193e+02 2.403e+02 2.706e+02 4.064e+02, threshold=4.806e+02, percent-clipped=0.0 2024-06-21 17:06:48,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=420322.8333333333, ans=0.0 2024-06-21 17:06:51,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=420322.8333333333, ans=0.2 2024-06-21 17:06:53,420 INFO [train.py:1028] (0/2) Epoch 23, batch 6700, loss[loss=0.2213, simple_loss=0.2654, pruned_loss=0.08864, over 12793.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2567, pruned_loss=0.07217, over 2582710.85 frames. ], batch size: 177, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:07:07,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=420359.5, ans=0.025 2024-06-21 17:07:10,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=420377.8333333333, ans=0.125 2024-06-21 17:07:17,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=420396.1666666667, ans=0.0 2024-06-21 17:07:17,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=420396.1666666667, ans=0.125 2024-06-21 17:07:21,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=420396.1666666667, ans=0.95 2024-06-21 17:07:24,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=420414.5, ans=0.0 2024-06-21 17:07:25,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=420414.5, ans=0.0 2024-06-21 17:07:28,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2024-06-21 17:07:30,151 INFO [train.py:1028] (0/2) Epoch 23, batch 6750, loss[loss=0.2438, simple_loss=0.2931, pruned_loss=0.0972, over 12124.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2581, pruned_loss=0.07287, over 2576723.54 frames. ], batch size: 240, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:07:31,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2024-06-21 17:07:45,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420469.5, ans=0.125 2024-06-21 17:07:52,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=420487.8333333333, ans=12.0 2024-06-21 17:07:57,233 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.308e+02 2.490e+02 2.770e+02 3.712e+02, threshold=4.981e+02, percent-clipped=0.0 2024-06-21 17:08:03,108 INFO [train.py:1028] (0/2) Epoch 23, batch 6800, loss[loss=0.1997, simple_loss=0.2526, pruned_loss=0.07344, over 13268.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2589, pruned_loss=0.0731, over 2578929.53 frames. ], batch size: 67, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:08:10,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=420524.5, ans=0.2 2024-06-21 17:08:12,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=420542.8333333333, ans=0.125 2024-06-21 17:08:12,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=420542.8333333333, ans=0.0 2024-06-21 17:08:17,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=420542.8333333333, ans=0.125 2024-06-21 17:08:22,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=420561.1666666667, ans=0.0 2024-06-21 17:08:33,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=420597.8333333333, ans=0.125 2024-06-21 17:08:36,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=420597.8333333333, ans=0.125 2024-06-21 17:08:36,565 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=15.0 2024-06-21 17:08:36,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=420597.8333333333, ans=0.5 2024-06-21 17:08:37,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=22.5 2024-06-21 17:08:37,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=420597.8333333333, ans=0.125 2024-06-21 17:08:39,270 INFO [train.py:1028] (0/2) Epoch 23, batch 6850, loss[loss=0.2083, simple_loss=0.2706, pruned_loss=0.07301, over 13267.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2592, pruned_loss=0.07279, over 2583771.99 frames. ], batch size: 63, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:08:41,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=420616.1666666667, ans=0.025 2024-06-21 17:08:48,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=420634.5, ans=0.125 2024-06-21 17:08:49,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2024-06-21 17:08:51,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.03 vs. limit=10.0 2024-06-21 17:08:51,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=420634.5, ans=0.0 2024-06-21 17:08:57,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=420652.8333333333, ans=0.125 2024-06-21 17:09:07,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=420671.1666666667, ans=0.0 2024-06-21 17:09:09,822 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.215e+02 2.460e+02 2.801e+02 3.938e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-21 17:09:15,972 INFO [train.py:1028] (0/2) Epoch 23, batch 6900, loss[loss=0.1881, simple_loss=0.249, pruned_loss=0.06356, over 13265.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2593, pruned_loss=0.07276, over 2585480.71 frames. ], batch size: 49, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:09:18,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=420707.8333333333, ans=0.125 2024-06-21 17:09:25,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=420726.1666666667, ans=0.0 2024-06-21 17:09:31,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.95 vs. limit=22.5 2024-06-21 17:09:34,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=420762.8333333333, ans=0.125 2024-06-21 17:09:36,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.68 vs. limit=15.0 2024-06-21 17:09:45,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=420781.1666666667, ans=0.2 2024-06-21 17:09:47,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=420781.1666666667, ans=0.125 2024-06-21 17:09:49,129 INFO [train.py:1028] (0/2) Epoch 23, batch 6950, loss[loss=0.2108, simple_loss=0.2691, pruned_loss=0.0763, over 11150.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2596, pruned_loss=0.07281, over 2579103.86 frames. ], batch size: 16, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:10:02,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=420836.1666666667, ans=0.025 2024-06-21 17:10:05,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=420836.1666666667, ans=0.125 2024-06-21 17:10:15,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=420854.5, ans=0.0 2024-06-21 17:10:16,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=420854.5, ans=0.07 2024-06-21 17:10:19,611 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.171e+02 2.391e+02 2.564e+02 3.307e+02, threshold=4.782e+02, percent-clipped=0.0 2024-06-21 17:10:25,344 INFO [train.py:1028] (0/2) Epoch 23, batch 7000, loss[loss=0.1988, simple_loss=0.2589, pruned_loss=0.06932, over 12913.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2595, pruned_loss=0.07247, over 2576264.45 frames. ], batch size: 158, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:10:26,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=420891.1666666667, ans=0.0 2024-06-21 17:10:30,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420891.1666666667, ans=0.1 2024-06-21 17:10:33,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=420909.5, ans=0.0 2024-06-21 17:10:47,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=420946.1666666667, ans=0.07 2024-06-21 17:11:00,145 INFO [train.py:1028] (0/2) Epoch 23, batch 7050, loss[loss=0.2196, simple_loss=0.2707, pruned_loss=0.08429, over 12751.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2609, pruned_loss=0.07307, over 2582537.03 frames. ], batch size: 176, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:11:02,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=420982.8333333333, ans=0.125 2024-06-21 17:11:04,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420982.8333333333, ans=0.125 2024-06-21 17:11:13,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=421001.1666666667, ans=0.0 2024-06-21 17:11:21,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=421019.5, ans=0.0 2024-06-21 17:11:22,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=421037.8333333333, ans=0.0 2024-06-21 17:11:25,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=421037.8333333333, ans=0.0 2024-06-21 17:11:29,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=421056.1666666667, ans=0.125 2024-06-21 17:11:29,655 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.236e+02 2.440e+02 2.712e+02 3.589e+02, threshold=4.880e+02, percent-clipped=0.0 2024-06-21 17:11:30,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=421056.1666666667, ans=0.2 2024-06-21 17:11:31,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=421056.1666666667, ans=0.0 2024-06-21 17:11:35,230 INFO [train.py:1028] (0/2) Epoch 23, batch 7100, loss[loss=0.2101, simple_loss=0.2766, pruned_loss=0.07183, over 13144.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2618, pruned_loss=0.07362, over 2575520.06 frames. ], batch size: 112, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:11:36,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=421074.5, ans=0.04949747468305833 2024-06-21 17:11:49,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=421111.1666666667, ans=12.0 2024-06-21 17:11:51,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=421111.1666666667, ans=0.0 2024-06-21 17:11:57,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.92 vs. limit=22.5 2024-06-21 17:12:00,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=421129.5, ans=0.0 2024-06-21 17:12:01,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=421147.8333333333, ans=0.125 2024-06-21 17:12:04,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=421147.8333333333, ans=0.0 2024-06-21 17:12:08,887 INFO [train.py:1028] (0/2) Epoch 23, batch 7150, loss[loss=0.2318, simple_loss=0.2816, pruned_loss=0.09101, over 12548.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2623, pruned_loss=0.07359, over 2573580.07 frames. ], batch size: 202, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:12:11,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=421166.1666666667, ans=0.0 2024-06-21 17:12:19,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.18 vs. limit=15.0 2024-06-21 17:12:29,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=421202.8333333333, ans=0.0 2024-06-21 17:12:33,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=421221.1666666667, ans=0.2 2024-06-21 17:12:34,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=421221.1666666667, ans=0.125 2024-06-21 17:12:35,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421221.1666666667, ans=0.1 2024-06-21 17:12:39,415 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.241e+02 2.374e+02 2.639e+02 4.131e+02, threshold=4.749e+02, percent-clipped=0.0 2024-06-21 17:12:45,139 INFO [train.py:1028] (0/2) Epoch 23, batch 7200, loss[loss=0.2217, simple_loss=0.2793, pruned_loss=0.08203, over 13186.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2638, pruned_loss=0.0739, over 2579012.21 frames. ], batch size: 112, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:12:57,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-21 17:13:00,237 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2024-06-21 17:13:06,484 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:13:21,857 INFO [train.py:1028] (0/2) Epoch 23, batch 7250, loss[loss=0.2224, simple_loss=0.2877, pruned_loss=0.07848, over 13011.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2645, pruned_loss=0.07389, over 2578672.48 frames. ], batch size: 36, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:13:25,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421349.5, ans=0.0 2024-06-21 17:13:31,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=15.0 2024-06-21 17:13:32,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=421367.8333333333, ans=0.1 2024-06-21 17:13:33,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=421367.8333333333, ans=0.09899494936611666 2024-06-21 17:13:36,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=421386.1666666667, ans=0.09899494936611666 2024-06-21 17:13:40,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=421386.1666666667, ans=0.125 2024-06-21 17:13:43,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=421404.5, ans=0.125 2024-06-21 17:13:48,969 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.888e+02 2.269e+02 2.470e+02 2.809e+02 3.868e+02, threshold=4.940e+02, percent-clipped=0.0 2024-06-21 17:13:49,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421422.8333333333, ans=0.1 2024-06-21 17:13:54,448 INFO [train.py:1028] (0/2) Epoch 23, batch 7300, loss[loss=0.2265, simple_loss=0.2835, pruned_loss=0.08472, over 12926.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2665, pruned_loss=0.07456, over 2578815.28 frames. ], batch size: 36, lr: 2.51e-03, grad_scale: 32.0 2024-06-21 17:13:57,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=421441.1666666667, ans=0.0 2024-06-21 17:13:57,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=421441.1666666667, ans=0.125 2024-06-21 17:13:57,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=421441.1666666667, ans=0.125 2024-06-21 17:14:06,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421459.5, ans=0.125 2024-06-21 17:14:10,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.61 vs. limit=15.0 2024-06-21 17:14:30,773 INFO [train.py:1028] (0/2) Epoch 23, batch 7350, loss[loss=0.2128, simple_loss=0.2749, pruned_loss=0.07542, over 13371.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2661, pruned_loss=0.07447, over 2580607.58 frames. ], batch size: 46, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:14:42,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=421551.1666666667, ans=0.2 2024-06-21 17:14:43,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=421569.5, ans=0.125 2024-06-21 17:14:44,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=421569.5, ans=0.0 2024-06-21 17:14:52,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=421587.8333333333, ans=0.0 2024-06-21 17:14:57,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=421606.1666666667, ans=0.0 2024-06-21 17:14:58,012 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.271e+02 2.454e+02 2.689e+02 3.742e+02, threshold=4.907e+02, percent-clipped=0.0 2024-06-21 17:15:03,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=421624.5, ans=0.1 2024-06-21 17:15:03,917 INFO [train.py:1028] (0/2) Epoch 23, batch 7400, loss[loss=0.2322, simple_loss=0.3068, pruned_loss=0.07876, over 13281.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2659, pruned_loss=0.07417, over 2586378.43 frames. ], batch size: 63, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:15:30,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=421679.5, ans=0.125 2024-06-21 17:15:31,477 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:15:37,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2024-06-21 17:15:41,626 INFO [train.py:1028] (0/2) Epoch 23, batch 7450, loss[loss=0.1812, simple_loss=0.2435, pruned_loss=0.05948, over 12561.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2659, pruned_loss=0.07413, over 2581087.13 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:15:50,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=421734.5, ans=0.0 2024-06-21 17:15:52,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2024-06-21 17:15:56,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=421752.8333333333, ans=0.125 2024-06-21 17:15:59,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=421752.8333333333, ans=0.0 2024-06-21 17:16:05,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=421771.1666666667, ans=0.0 2024-06-21 17:16:08,067 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2024-06-21 17:16:09,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=421789.5, ans=0.0 2024-06-21 17:16:09,612 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.263e+02 2.400e+02 2.749e+02 3.809e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 17:16:12,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421789.5, ans=0.0 2024-06-21 17:16:14,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.67 vs. limit=15.0 2024-06-21 17:16:15,287 INFO [train.py:1028] (0/2) Epoch 23, batch 7500, loss[loss=0.2352, simple_loss=0.2773, pruned_loss=0.09658, over 10358.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2668, pruned_loss=0.07474, over 2578519.84 frames. ], batch size: 304, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:16:16,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=421807.8333333333, ans=0.0 2024-06-21 17:16:17,084 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.46 vs. limit=15.0 2024-06-21 17:16:33,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=421844.5, ans=0.025 2024-06-21 17:16:40,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=421862.8333333333, ans=0.0 2024-06-21 17:16:41,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=421862.8333333333, ans=0.125 2024-06-21 17:16:47,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2024-06-21 17:16:51,343 INFO [train.py:1028] (0/2) Epoch 23, batch 7550, loss[loss=0.2017, simple_loss=0.2576, pruned_loss=0.0729, over 13003.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2676, pruned_loss=0.07562, over 2577710.02 frames. ], batch size: 158, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:17:05,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=12.0 2024-06-21 17:17:14,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=421954.5, ans=0.1 2024-06-21 17:17:16,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2024-06-21 17:17:18,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.284e+02 2.397e+02 2.675e+02 3.911e+02, threshold=4.793e+02, percent-clipped=0.0 2024-06-21 17:17:21,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=421972.8333333333, ans=0.125 2024-06-21 17:17:28,247 INFO [train.py:1028] (0/2) Epoch 23, batch 7600, loss[loss=0.2096, simple_loss=0.2746, pruned_loss=0.07237, over 13166.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2679, pruned_loss=0.07576, over 2578492.44 frames. ], batch size: 83, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:17:29,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=421991.1666666667, ans=0.0 2024-06-21 17:17:32,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=421991.1666666667, ans=0.125 2024-06-21 17:17:36,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=422009.5, ans=0.1 2024-06-21 17:17:42,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=15.0 2024-06-21 17:17:51,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=422046.1666666667, ans=0.0 2024-06-21 17:17:52,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=422046.1666666667, ans=0.1 2024-06-21 17:17:53,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2024-06-21 17:18:02,113 INFO [train.py:1028] (0/2) Epoch 23, batch 7650, loss[loss=0.2181, simple_loss=0.2761, pruned_loss=0.08004, over 12879.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2689, pruned_loss=0.07618, over 2572713.15 frames. ], batch size: 33, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:18:03,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.18 vs. limit=10.0 2024-06-21 17:18:03,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.49 vs. limit=15.0 2024-06-21 17:18:03,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=422082.8333333333, ans=0.125 2024-06-21 17:18:18,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=422119.5, ans=0.0 2024-06-21 17:18:27,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:28,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:31,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422137.8333333333, ans=0.125 2024-06-21 17:18:32,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=422156.1666666667, ans=0.125 2024-06-21 17:18:33,805 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.270e+02 2.433e+02 2.654e+02 3.335e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-21 17:18:34,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=422156.1666666667, ans=0.125 2024-06-21 17:18:35,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=422156.1666666667, ans=0.2 2024-06-21 17:18:39,964 INFO [train.py:1028] (0/2) Epoch 23, batch 7700, loss[loss=0.202, simple_loss=0.2721, pruned_loss=0.06596, over 13254.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2691, pruned_loss=0.07604, over 2569430.45 frames. ], batch size: 63, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:18:45,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=422192.8333333333, ans=0.125 2024-06-21 17:18:49,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=422192.8333333333, ans=0.125 2024-06-21 17:18:58,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=422229.5, ans=0.125 2024-06-21 17:19:05,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=422229.5, ans=15.0 2024-06-21 17:19:05,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2024-06-21 17:19:05,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422247.8333333333, ans=0.125 2024-06-21 17:19:09,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2024-06-21 17:19:12,664 INFO [train.py:1028] (0/2) Epoch 23, batch 7750, loss[loss=0.2027, simple_loss=0.273, pruned_loss=0.06619, over 13078.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2696, pruned_loss=0.07659, over 2573560.70 frames. ], batch size: 71, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:19:31,158 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2024-06-21 17:19:42,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=422339.5, ans=0.2 2024-06-21 17:19:43,202 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.297e+02 2.433e+02 2.676e+02 3.659e+02, threshold=4.866e+02, percent-clipped=0.0 2024-06-21 17:19:49,469 INFO [train.py:1028] (0/2) Epoch 23, batch 7800, loss[loss=0.2082, simple_loss=0.2643, pruned_loss=0.076, over 13234.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2704, pruned_loss=0.07666, over 2578447.01 frames. ], batch size: 95, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:20:05,403 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=8.0 2024-06-21 17:20:06,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=422394.5, ans=0.07 2024-06-21 17:20:10,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.98 vs. limit=10.0 2024-06-21 17:20:12,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=422412.8333333333, ans=0.125 2024-06-21 17:20:20,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=422431.1666666667, ans=0.0 2024-06-21 17:20:23,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=422431.1666666667, ans=0.07 2024-06-21 17:20:25,834 INFO [train.py:1028] (0/2) Epoch 23, batch 7850, loss[loss=0.1906, simple_loss=0.2514, pruned_loss=0.06488, over 11143.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2709, pruned_loss=0.07673, over 2572326.85 frames. ], batch size: 16, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:20:28,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.53 vs. limit=10.0 2024-06-21 17:20:32,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=422467.8333333333, ans=0.025 2024-06-21 17:20:42,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2024-06-21 17:20:52,644 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.296e+02 2.425e+02 2.552e+02 3.284e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-21 17:20:58,639 INFO [train.py:1028] (0/2) Epoch 23, batch 7900, loss[loss=0.2104, simple_loss=0.2737, pruned_loss=0.07358, over 13149.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2709, pruned_loss=0.07688, over 2571676.01 frames. ], batch size: 77, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:21:09,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=422559.5, ans=0.025 2024-06-21 17:21:10,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=422559.5, ans=0.125 2024-06-21 17:21:12,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=422577.8333333333, ans=0.0 2024-06-21 17:21:34,972 INFO [train.py:1028] (0/2) Epoch 23, batch 7950, loss[loss=0.2189, simple_loss=0.2668, pruned_loss=0.08549, over 10655.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2716, pruned_loss=0.07677, over 2575003.44 frames. ], batch size: 303, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:21:51,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2024-06-21 17:21:52,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.94 vs. limit=6.0 2024-06-21 17:21:54,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422687.8333333333, ans=0.125 2024-06-21 17:21:57,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=422687.8333333333, ans=0.0 2024-06-21 17:22:00,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=422687.8333333333, ans=0.125 2024-06-21 17:22:01,897 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.249e+02 2.398e+02 2.663e+02 3.324e+02, threshold=4.796e+02, percent-clipped=0.0 2024-06-21 17:22:08,119 INFO [train.py:1028] (0/2) Epoch 23, batch 8000, loss[loss=0.2011, simple_loss=0.2684, pruned_loss=0.06694, over 12749.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2723, pruned_loss=0.07705, over 2573209.83 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:22:08,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=422724.5, ans=0.125 2024-06-21 17:22:24,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=422761.1666666667, ans=0.025 2024-06-21 17:22:24,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=422761.1666666667, ans=0.125 2024-06-21 17:22:26,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=422761.1666666667, ans=0.2 2024-06-21 17:22:41,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=422797.8333333333, ans=0.025 2024-06-21 17:22:45,759 INFO [train.py:1028] (0/2) Epoch 23, batch 8050, loss[loss=0.2278, simple_loss=0.2815, pruned_loss=0.08708, over 13169.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2716, pruned_loss=0.07682, over 2573165.57 frames. ], batch size: 83, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:22:48,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.71 vs. limit=15.0 2024-06-21 17:22:52,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=422834.5, ans=0.1 2024-06-21 17:22:54,025 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2024-06-21 17:23:00,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422852.8333333333, ans=0.1 2024-06-21 17:23:02,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=422852.8333333333, ans=0.0 2024-06-21 17:23:12,262 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.258e+02 2.389e+02 2.590e+02 3.796e+02, threshold=4.777e+02, percent-clipped=0.0 2024-06-21 17:23:13,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=422889.5, ans=0.0 2024-06-21 17:23:14,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2024-06-21 17:23:21,341 INFO [train.py:1028] (0/2) Epoch 23, batch 8100, loss[loss=0.2292, simple_loss=0.2884, pruned_loss=0.08495, over 13148.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2721, pruned_loss=0.07679, over 2577715.53 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:23:30,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.47 vs. limit=15.0 2024-06-21 17:23:40,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=422944.5, ans=0.125 2024-06-21 17:23:54,838 INFO [train.py:1028] (0/2) Epoch 23, batch 8150, loss[loss=0.1994, simple_loss=0.2497, pruned_loss=0.07452, over 13062.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.272, pruned_loss=0.07648, over 2580345.44 frames. ], batch size: 121, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:23:58,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.90 vs. limit=22.5 2024-06-21 17:23:59,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=422999.5, ans=0.125 2024-06-21 17:24:01,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-21 17:24:08,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=423036.1666666667, ans=0.125 2024-06-21 17:24:08,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=12.0 2024-06-21 17:24:12,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=423036.1666666667, ans=0.125 2024-06-21 17:24:13,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=423036.1666666667, ans=0.0 2024-06-21 17:24:18,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2024-06-21 17:24:24,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.289e+02 2.395e+02 2.540e+02 3.096e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-21 17:24:30,768 INFO [train.py:1028] (0/2) Epoch 23, batch 8200, loss[loss=0.2446, simple_loss=0.2945, pruned_loss=0.09733, over 13128.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2726, pruned_loss=0.07668, over 2583264.11 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:24:38,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=423109.5, ans=0.025 2024-06-21 17:24:47,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423127.8333333333, ans=0.1 2024-06-21 17:24:48,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=423127.8333333333, ans=0.0 2024-06-21 17:24:50,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.47 vs. limit=22.5 2024-06-21 17:24:52,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=423146.1666666667, ans=0.0 2024-06-21 17:24:52,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=423146.1666666667, ans=0.0 2024-06-21 17:24:55,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=423146.1666666667, ans=0.0 2024-06-21 17:24:59,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=423164.5, ans=0.0 2024-06-21 17:25:01,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=423164.5, ans=0.125 2024-06-21 17:25:04,408 INFO [train.py:1028] (0/2) Epoch 23, batch 8250, loss[loss=0.1917, simple_loss=0.2633, pruned_loss=0.06001, over 13236.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2728, pruned_loss=0.07684, over 2584041.81 frames. ], batch size: 52, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:25:11,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=423201.1666666667, ans=0.125 2024-06-21 17:25:23,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=423219.5, ans=0.125 2024-06-21 17:25:24,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.21 vs. limit=15.0 2024-06-21 17:25:29,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423237.8333333333, ans=0.125 2024-06-21 17:25:33,547 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.244e+02 2.353e+02 2.485e+02 4.515e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 17:25:39,134 INFO [train.py:1028] (0/2) Epoch 23, batch 8300, loss[loss=0.2017, simple_loss=0.2561, pruned_loss=0.07362, over 13019.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.272, pruned_loss=0.07641, over 2582135.24 frames. ], batch size: 102, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:25:39,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=423274.5, ans=0.2 2024-06-21 17:25:53,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423311.1666666667, ans=0.125 2024-06-21 17:25:56,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=423311.1666666667, ans=0.125 2024-06-21 17:26:02,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=423329.5, ans=0.125 2024-06-21 17:26:06,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=423347.8333333333, ans=0.0 2024-06-21 17:26:07,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=423347.8333333333, ans=0.0 2024-06-21 17:26:12,253 INFO [train.py:1028] (0/2) Epoch 23, batch 8350, loss[loss=0.211, simple_loss=0.27, pruned_loss=0.07597, over 13220.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2724, pruned_loss=0.07634, over 2581831.51 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:26:12,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=423366.1666666667, ans=0.125 2024-06-21 17:26:14,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=423366.1666666667, ans=0.015 2024-06-21 17:26:20,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423366.1666666667, ans=0.1 2024-06-21 17:26:29,029 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2024-06-21 17:26:29,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2024-06-21 17:26:43,211 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.281e+02 2.427e+02 2.658e+02 3.952e+02, threshold=4.854e+02, percent-clipped=0.0 2024-06-21 17:26:49,490 INFO [train.py:1028] (0/2) Epoch 23, batch 8400, loss[loss=0.1922, simple_loss=0.2571, pruned_loss=0.06362, over 12940.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2722, pruned_loss=0.07656, over 2577906.88 frames. ], batch size: 39, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:27:13,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=423512.8333333333, ans=0.0 2024-06-21 17:27:17,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.97 vs. limit=22.5 2024-06-21 17:27:26,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=423549.5, ans=0.1 2024-06-21 17:27:26,504 INFO [train.py:1028] (0/2) Epoch 23, batch 8450, loss[loss=0.2309, simple_loss=0.2856, pruned_loss=0.08812, over 13107.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2728, pruned_loss=0.07655, over 2579449.56 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:27:37,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=423567.8333333333, ans=0.2 2024-06-21 17:27:41,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=423586.1666666667, ans=15.0 2024-06-21 17:27:49,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=423604.5, ans=0.0 2024-06-21 17:27:51,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423604.5, ans=0.1 2024-06-21 17:27:53,928 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.283e+02 2.427e+02 2.737e+02 3.622e+02, threshold=4.853e+02, percent-clipped=0.0 2024-06-21 17:27:54,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=423622.8333333333, ans=0.0 2024-06-21 17:27:59,891 INFO [train.py:1028] (0/2) Epoch 23, batch 8500, loss[loss=0.2082, simple_loss=0.2654, pruned_loss=0.07553, over 12502.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2734, pruned_loss=0.07669, over 2577882.18 frames. ], batch size: 29, lr: 2.50e-03, grad_scale: 64.0 2024-06-21 17:28:01,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=423641.1666666667, ans=0.125 2024-06-21 17:28:08,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=423659.5, ans=0.2 2024-06-21 17:28:09,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=423659.5, ans=0.2 2024-06-21 17:28:33,882 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.42 vs. limit=10.0 2024-06-21 17:28:37,447 INFO [train.py:1028] (0/2) Epoch 23, batch 8550, loss[loss=0.2076, simple_loss=0.2785, pruned_loss=0.06834, over 12605.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2723, pruned_loss=0.07576, over 2577034.35 frames. ], batch size: 22, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:28:38,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=423732.8333333333, ans=0.0 2024-06-21 17:28:39,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=423732.8333333333, ans=0.0 2024-06-21 17:28:40,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=423732.8333333333, ans=0.2 2024-06-21 17:28:44,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=423751.1666666667, ans=0.1 2024-06-21 17:29:05,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.307e+02 2.470e+02 2.765e+02 3.807e+02, threshold=4.940e+02, percent-clipped=0.0 2024-06-21 17:29:10,347 INFO [train.py:1028] (0/2) Epoch 23, batch 8600, loss[loss=0.2094, simple_loss=0.264, pruned_loss=0.07742, over 13136.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2735, pruned_loss=0.07651, over 2574090.09 frames. ], batch size: 112, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:29:11,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=423824.5, ans=0.125 2024-06-21 17:29:20,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=12.0 2024-06-21 17:29:34,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=423879.5, ans=0.125 2024-06-21 17:29:41,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.55 vs. limit=22.5 2024-06-21 17:29:43,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=423897.8333333333, ans=0.125 2024-06-21 17:29:45,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=423897.8333333333, ans=0.07 2024-06-21 17:29:47,799 INFO [train.py:1028] (0/2) Epoch 23, batch 8650, loss[loss=0.202, simple_loss=0.2556, pruned_loss=0.07424, over 13104.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2738, pruned_loss=0.07648, over 2577529.41 frames. ], batch size: 103, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:30:00,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=423952.8333333333, ans=0.125 2024-06-21 17:30:05,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423952.8333333333, ans=0.1 2024-06-21 17:30:06,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=423952.8333333333, ans=0.05 2024-06-21 17:30:15,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.277e+02 2.375e+02 2.589e+02 3.238e+02, threshold=4.751e+02, percent-clipped=0.0 2024-06-21 17:30:25,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=424007.8333333333, ans=0.125 2024-06-21 17:30:26,238 INFO [train.py:1028] (0/2) Epoch 23, batch 8700, loss[loss=0.229, simple_loss=0.2956, pruned_loss=0.08121, over 13218.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2745, pruned_loss=0.0772, over 2575243.30 frames. ], batch size: 59, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:30:26,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424007.8333333333, ans=0.1 2024-06-21 17:30:26,406 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:30:29,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=424007.8333333333, ans=0.0 2024-06-21 17:30:29,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=424007.8333333333, ans=0.0 2024-06-21 17:30:32,364 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2024-06-21 17:30:34,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=424026.1666666667, ans=0.125 2024-06-21 17:30:37,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=424026.1666666667, ans=0.025 2024-06-21 17:30:37,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=424026.1666666667, ans=0.1 2024-06-21 17:30:45,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2024-06-21 17:30:49,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=424062.8333333333, ans=0.025 2024-06-21 17:30:57,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=424081.1666666667, ans=0.025 2024-06-21 17:31:00,914 INFO [train.py:1028] (0/2) Epoch 23, batch 8750, loss[loss=0.2077, simple_loss=0.2595, pruned_loss=0.07796, over 13096.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2751, pruned_loss=0.07773, over 2571037.15 frames. ], batch size: 121, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:31:07,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=424117.8333333333, ans=0.125 2024-06-21 17:31:10,601 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:31:17,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424136.1666666667, ans=0.1 2024-06-21 17:31:18,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=424136.1666666667, ans=0.125 2024-06-21 17:31:23,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=424154.5, ans=0.0 2024-06-21 17:31:32,906 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.322e+02 2.457e+02 2.775e+02 3.821e+02, threshold=4.913e+02, percent-clipped=0.0 2024-06-21 17:31:33,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=424172.8333333333, ans=0.125 2024-06-21 17:31:34,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424172.8333333333, ans=0.1 2024-06-21 17:31:38,477 INFO [train.py:1028] (0/2) Epoch 23, batch 8800, loss[loss=0.2054, simple_loss=0.2756, pruned_loss=0.06757, over 13199.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2749, pruned_loss=0.07743, over 2575302.36 frames. ], batch size: 72, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:31:40,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.66 vs. limit=22.5 2024-06-21 17:31:53,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=424227.8333333333, ans=0.025 2024-06-21 17:32:03,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.85 vs. limit=15.0 2024-06-21 17:32:12,745 INFO [train.py:1028] (0/2) Epoch 23, batch 8850, loss[loss=0.2317, simple_loss=0.2897, pruned_loss=0.08684, over 12564.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2749, pruned_loss=0.07796, over 2565249.90 frames. ], batch size: 202, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:32:15,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=424282.8333333333, ans=0.0 2024-06-21 17:32:21,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424282.8333333333, ans=0.1 2024-06-21 17:32:27,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=424301.1666666667, ans=0.125 2024-06-21 17:32:34,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=424319.5, ans=0.125 2024-06-21 17:32:45,766 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.298e+02 2.512e+02 2.729e+02 3.474e+02, threshold=5.024e+02, percent-clipped=0.0 2024-06-21 17:32:51,226 INFO [train.py:1028] (0/2) Epoch 23, batch 8900, loss[loss=0.228, simple_loss=0.2877, pruned_loss=0.08414, over 12911.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2756, pruned_loss=0.07836, over 2564462.68 frames. ], batch size: 33, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:32:52,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=424374.5, ans=0.125 2024-06-21 17:32:53,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=424374.5, ans=0.125 2024-06-21 17:32:58,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=424392.8333333333, ans=0.125 2024-06-21 17:32:58,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=424392.8333333333, ans=0.0 2024-06-21 17:33:03,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=424392.8333333333, ans=0.025 2024-06-21 17:33:04,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=424411.1666666667, ans=0.125 2024-06-21 17:33:12,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=424429.5, ans=0.09899494936611666 2024-06-21 17:33:21,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=424447.8333333333, ans=0.2 2024-06-21 17:33:28,690 INFO [train.py:1028] (0/2) Epoch 23, batch 8950, loss[loss=0.2332, simple_loss=0.2893, pruned_loss=0.08856, over 12607.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2759, pruned_loss=0.07805, over 2563894.03 frames. ], batch size: 202, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:33:38,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=424484.5, ans=0.2 2024-06-21 17:33:43,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=424502.8333333333, ans=10.0 2024-06-21 17:33:49,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2024-06-21 17:33:52,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=424521.1666666667, ans=0.125 2024-06-21 17:33:56,805 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.317e+02 2.447e+02 2.721e+02 3.537e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-21 17:33:57,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=424539.5, ans=0.2 2024-06-21 17:33:58,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=15.0 2024-06-21 17:34:01,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2024-06-21 17:34:02,030 INFO [train.py:1028] (0/2) Epoch 23, batch 9000, loss[loss=0.2057, simple_loss=0.2734, pruned_loss=0.06904, over 13312.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2759, pruned_loss=0.07775, over 2569915.35 frames. ], batch size: 46, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:34:02,031 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 17:34:07,942 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.8546, 3.3594, 3.2589, 2.6625, 3.0065, 3.1172, 3.1020, 3.0734], device='cuda:0') 2024-06-21 17:34:10,001 INFO [train.py:1060] (0/2) Epoch 23, validation: loss=0.1885, simple_loss=0.2513, pruned_loss=0.06289, over 351949.00 frames. 2024-06-21 17:34:10,002 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 17480MB 2024-06-21 17:34:35,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=424612.8333333333, ans=0.2 2024-06-21 17:34:41,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=424631.1666666667, ans=0.2 2024-06-21 17:34:45,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424649.5, ans=0.1 2024-06-21 17:34:46,058 INFO [train.py:1028] (0/2) Epoch 23, batch 9050, loss[loss=0.2133, simple_loss=0.2784, pruned_loss=0.07412, over 11369.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2767, pruned_loss=0.07821, over 2568390.22 frames. ], batch size: 17, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:34:51,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=424649.5, ans=0.0 2024-06-21 17:34:54,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=424667.8333333333, ans=0.125 2024-06-21 17:34:58,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.49 vs. limit=15.0 2024-06-21 17:34:59,151 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:35:03,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=424686.1666666667, ans=0.0 2024-06-21 17:35:13,535 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.312e+02 2.452e+02 2.669e+02 3.532e+02, threshold=4.905e+02, percent-clipped=0.0 2024-06-21 17:35:18,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=424741.1666666667, ans=0.0 2024-06-21 17:35:18,753 INFO [train.py:1028] (0/2) Epoch 23, batch 9100, loss[loss=0.1945, simple_loss=0.2587, pruned_loss=0.06511, over 13052.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2759, pruned_loss=0.07784, over 2568435.86 frames. ], batch size: 71, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:35:33,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424777.8333333333, ans=0.1 2024-06-21 17:35:50,329 INFO [train.py:1028] (0/2) Epoch 23, batch 9150, loss[loss=0.179, simple_loss=0.2473, pruned_loss=0.05541, over 13137.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2762, pruned_loss=0.07788, over 2569493.97 frames. ], batch size: 77, lr: 2.50e-03, grad_scale: 32.0 2024-06-21 17:35:52,080 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.49 vs. limit=6.0 2024-06-21 17:35:57,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=424851.1666666667, ans=0.025 2024-06-21 17:35:57,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=424851.1666666667, ans=0.025 2024-06-21 17:35:59,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=424851.1666666667, ans=0.125 2024-06-21 17:36:01,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=424851.1666666667, ans=0.0 2024-06-21 17:36:17,199 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.199e+02 2.303e+02 2.511e+02 3.040e+02, threshold=4.605e+02, percent-clipped=0.0 2024-06-21 17:36:19,952 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:36:22,389 INFO [train.py:1028] (0/2) Epoch 23, batch 9200, loss[loss=0.2056, simple_loss=0.2685, pruned_loss=0.07138, over 12944.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2756, pruned_loss=0.0772, over 2572946.56 frames. ], batch size: 36, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:36:33,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=424942.8333333333, ans=0.125 2024-06-21 17:36:34,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=424961.1666666667, ans=0.125 2024-06-21 17:36:48,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424979.5, ans=0.1 2024-06-21 17:36:48,621 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.50 vs. limit=5.0 2024-06-21 17:36:52,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.48 vs. limit=10.0 2024-06-21 17:36:57,421 INFO [train.py:1028] (0/2) Epoch 23, batch 9250, loss[loss=0.2282, simple_loss=0.2958, pruned_loss=0.08025, over 13225.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2758, pruned_loss=0.07711, over 2574542.58 frames. ], batch size: 67, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:37:01,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=425016.1666666667, ans=0.125 2024-06-21 17:37:12,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=425052.8333333333, ans=0.2 2024-06-21 17:37:23,531 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2024-06-21 17:37:24,476 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.269e+02 2.451e+02 2.577e+02 3.290e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-21 17:37:27,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=425089.5, ans=0.125 2024-06-21 17:37:29,672 INFO [train.py:1028] (0/2) Epoch 23, batch 9300, loss[loss=0.1898, simple_loss=0.2481, pruned_loss=0.06572, over 12890.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2756, pruned_loss=0.07703, over 2571705.58 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:37:35,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=425126.1666666667, ans=0.0 2024-06-21 17:37:42,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2024-06-21 17:37:45,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=425144.5, ans=0.1 2024-06-21 17:37:56,936 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:37:59,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2024-06-21 17:38:01,548 INFO [train.py:1028] (0/2) Epoch 23, batch 9350, loss[loss=0.2097, simple_loss=0.2668, pruned_loss=0.07632, over 12555.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2754, pruned_loss=0.07695, over 2568642.93 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:38:10,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=425217.8333333333, ans=0.125 2024-06-21 17:38:17,855 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.21 vs. limit=15.0 2024-06-21 17:38:28,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=425272.8333333333, ans=0.02 2024-06-21 17:38:28,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=425272.8333333333, ans=0.0 2024-06-21 17:38:28,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=425272.8333333333, ans=0.125 2024-06-21 17:38:29,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=425272.8333333333, ans=0.125 2024-06-21 17:38:29,410 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.296e+02 2.428e+02 2.618e+02 3.615e+02, threshold=4.857e+02, percent-clipped=0.0 2024-06-21 17:38:32,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2024-06-21 17:38:34,236 INFO [train.py:1028] (0/2) Epoch 23, batch 9400, loss[loss=0.2032, simple_loss=0.2718, pruned_loss=0.06727, over 13255.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2754, pruned_loss=0.077, over 2567757.50 frames. ], batch size: 52, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:38:36,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=425291.1666666667, ans=0.125 2024-06-21 17:38:37,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=425291.1666666667, ans=15.0 2024-06-21 17:38:45,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425309.5, ans=0.1 2024-06-21 17:38:46,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=425327.8333333333, ans=0.09899494936611666 2024-06-21 17:38:46,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=425327.8333333333, ans=0.0 2024-06-21 17:38:47,529 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-232000.pt 2024-06-21 17:38:59,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=425346.1666666667, ans=0.125 2024-06-21 17:38:59,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425346.1666666667, ans=0.1 2024-06-21 17:38:59,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=425346.1666666667, ans=0.125 2024-06-21 17:39:02,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425346.1666666667, ans=0.1 2024-06-21 17:39:05,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=425364.5, ans=0.125 2024-06-21 17:39:10,000 INFO [train.py:1028] (0/2) Epoch 23, batch 9450, loss[loss=0.2201, simple_loss=0.2788, pruned_loss=0.08067, over 12685.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2763, pruned_loss=0.07747, over 2567408.09 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:39:11,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=425382.8333333333, ans=0.125 2024-06-21 17:39:30,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425437.8333333333, ans=0.1 2024-06-21 17:39:35,537 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.271e+02 2.413e+02 2.592e+02 3.343e+02, threshold=4.826e+02, percent-clipped=0.0 2024-06-21 17:39:40,442 INFO [train.py:1028] (0/2) Epoch 23, batch 9500, loss[loss=0.1974, simple_loss=0.2679, pruned_loss=0.06351, over 13213.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2758, pruned_loss=0.07685, over 2576521.88 frames. ], batch size: 43, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:39:46,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=425492.8333333333, ans=0.125 2024-06-21 17:39:46,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=425492.8333333333, ans=0.0 2024-06-21 17:39:50,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.81 vs. limit=22.5 2024-06-21 17:39:54,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=425511.1666666667, ans=0.5 2024-06-21 17:39:58,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=12.0 2024-06-21 17:40:13,274 INFO [train.py:1028] (0/2) Epoch 23, batch 9550, loss[loss=0.2015, simple_loss=0.2618, pruned_loss=0.07063, over 12948.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2762, pruned_loss=0.07756, over 2573259.26 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:40:29,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.78 vs. limit=10.0 2024-06-21 17:40:31,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=425621.1666666667, ans=0.125 2024-06-21 17:40:39,093 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.265e+02 2.426e+02 2.648e+02 3.442e+02, threshold=4.853e+02, percent-clipped=0.0 2024-06-21 17:40:44,258 INFO [train.py:1028] (0/2) Epoch 23, batch 9600, loss[loss=0.2316, simple_loss=0.2739, pruned_loss=0.0946, over 10626.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.276, pruned_loss=0.0776, over 2570972.77 frames. ], batch size: 304, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:03,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425712.8333333333, ans=0.1 2024-06-21 17:41:13,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=425731.1666666667, ans=0.125 2024-06-21 17:41:17,014 INFO [train.py:1028] (0/2) Epoch 23, batch 9650, loss[loss=0.2137, simple_loss=0.2746, pruned_loss=0.07645, over 13124.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2758, pruned_loss=0.0778, over 2561123.10 frames. ], batch size: 132, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:19,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=425749.5, ans=0.125 2024-06-21 17:41:23,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=425767.8333333333, ans=0.1 2024-06-21 17:41:29,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=425786.1666666667, ans=0.125 2024-06-21 17:41:43,029 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.324e+02 2.526e+02 2.748e+02 4.281e+02, threshold=5.052e+02, percent-clipped=0.0 2024-06-21 17:41:43,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=425822.8333333333, ans=0.125 2024-06-21 17:41:43,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=425822.8333333333, ans=0.0 2024-06-21 17:41:47,882 INFO [train.py:1028] (0/2) Epoch 23, batch 9700, loss[loss=0.2164, simple_loss=0.2705, pruned_loss=0.08116, over 13045.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2751, pruned_loss=0.0777, over 2554912.36 frames. ], batch size: 144, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:41:48,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.23 vs. limit=15.0 2024-06-21 17:42:03,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2024-06-21 17:42:18,044 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:42:20,185 INFO [train.py:1028] (0/2) Epoch 23, batch 9750, loss[loss=0.221, simple_loss=0.2795, pruned_loss=0.08124, over 12998.00 frames. ], tot_loss[loss=0.214, simple_loss=0.274, pruned_loss=0.077, over 2551469.29 frames. ], batch size: 132, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:42:24,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=425932.8333333333, ans=0.125 2024-06-21 17:42:42,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2024-06-21 17:42:45,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=15.0 2024-06-21 17:42:46,362 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.276e+02 2.368e+02 2.518e+02 3.119e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 17:42:48,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=426006.1666666667, ans=0.035 2024-06-21 17:42:51,365 INFO [train.py:1028] (0/2) Epoch 23, batch 9800, loss[loss=0.1818, simple_loss=0.2543, pruned_loss=0.05468, over 12902.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2729, pruned_loss=0.07628, over 2545556.96 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:42:55,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=426024.5, ans=10.0 2024-06-21 17:43:00,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=426042.8333333333, ans=0.0 2024-06-21 17:43:00,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=426042.8333333333, ans=0.125 2024-06-21 17:43:03,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=426042.8333333333, ans=0.0 2024-06-21 17:43:09,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=15.0 2024-06-21 17:43:17,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.15 vs. limit=22.5 2024-06-21 17:43:23,863 INFO [train.py:1028] (0/2) Epoch 23, batch 9850, loss[loss=0.2104, simple_loss=0.2669, pruned_loss=0.07694, over 12982.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2724, pruned_loss=0.0759, over 2538056.12 frames. ], batch size: 102, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:43:50,152 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.280e+02 2.409e+02 2.617e+02 3.405e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-21 17:43:54,958 INFO [train.py:1028] (0/2) Epoch 23, batch 9900, loss[loss=0.1897, simple_loss=0.2486, pruned_loss=0.06537, over 12881.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2724, pruned_loss=0.07642, over 2529759.49 frames. ], batch size: 39, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:44:08,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=426244.5, ans=0.0 2024-06-21 17:44:08,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=426244.5, ans=0.2 2024-06-21 17:44:19,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.90 vs. limit=22.5 2024-06-21 17:44:19,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=426262.8333333333, ans=0.125 2024-06-21 17:44:21,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=426281.1666666667, ans=0.125 2024-06-21 17:44:26,869 INFO [train.py:1028] (0/2) Epoch 23, batch 9950, loss[loss=0.226, simple_loss=0.2888, pruned_loss=0.08165, over 12608.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2718, pruned_loss=0.0767, over 2525199.47 frames. ], batch size: 29, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:44:36,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=426317.8333333333, ans=0.0 2024-06-21 17:44:47,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=426354.5, ans=0.2 2024-06-21 17:44:47,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=426354.5, ans=0.2 2024-06-21 17:44:49,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=426354.5, ans=0.2 2024-06-21 17:44:50,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=426354.5, ans=0.2 2024-06-21 17:44:53,927 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.277e+02 2.439e+02 2.633e+02 3.435e+02, threshold=4.879e+02, percent-clipped=0.0 2024-06-21 17:44:58,886 INFO [train.py:1028] (0/2) Epoch 23, batch 10000, loss[loss=0.2364, simple_loss=0.3018, pruned_loss=0.08546, over 12446.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2719, pruned_loss=0.07666, over 2487286.03 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:45:02,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=426391.1666666667, ans=0.09899494936611666 2024-06-21 17:45:09,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=426409.5, ans=0.125 2024-06-21 17:45:18,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=426446.1666666667, ans=0.0 2024-06-21 17:45:19,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=426446.1666666667, ans=0.05 2024-06-21 17:45:26,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=24.93 vs. limit=22.5 2024-06-21 17:45:31,337 INFO [train.py:1028] (0/2) Epoch 23, batch 10050, loss[loss=0.1965, simple_loss=0.2605, pruned_loss=0.06622, over 12556.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2722, pruned_loss=0.0774, over 2444275.16 frames. ], batch size: 22, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:45:33,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2024-06-21 17:45:38,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.41 vs. limit=22.5 2024-06-21 17:45:44,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=426519.5, ans=0.0 2024-06-21 17:45:46,193 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.66 vs. limit=15.0 2024-06-21 17:45:47,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=426519.5, ans=0.125 2024-06-21 17:45:48,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=426519.5, ans=0.5 2024-06-21 17:45:50,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=426537.8333333333, ans=0.125 2024-06-21 17:45:55,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=426556.1666666667, ans=0.0 2024-06-21 17:45:56,645 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.314e+02 2.443e+02 2.672e+02 4.002e+02, threshold=4.887e+02, percent-clipped=0.0 2024-06-21 17:45:57,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=426556.1666666667, ans=0.2 2024-06-21 17:45:59,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=426556.1666666667, ans=15.0 2024-06-21 17:46:01,839 INFO [train.py:1028] (0/2) Epoch 23, batch 10100, loss[loss=0.1901, simple_loss=0.2634, pruned_loss=0.05843, over 11274.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2728, pruned_loss=0.07757, over 2424960.29 frames. ], batch size: 17, lr: 2.49e-03, grad_scale: 32.0 2024-06-21 17:46:04,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.01 vs. limit=22.5 2024-06-21 17:46:05,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=426574.5, ans=0.2 2024-06-21 17:46:08,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.54 vs. limit=22.5 2024-06-21 17:46:09,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=426592.8333333333, ans=0.2 2024-06-21 17:46:15,000 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-23.pt 2024-06-21 17:48:14,255 INFO [train.py:1028] (0/2) Epoch 24, batch 0, loss[loss=0.2033, simple_loss=0.2609, pruned_loss=0.07284, over 12912.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2609, pruned_loss=0.07284, over 12912.00 frames. ], batch size: 36, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:48:14,256 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 17:48:21,352 INFO [train.py:1060] (0/2) Epoch 24, validation: loss=0.189, simple_loss=0.252, pruned_loss=0.06296, over 351949.00 frames. 2024-06-21 17:48:21,352 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 17:48:31,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=426624.0, ans=0.95 2024-06-21 17:48:32,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=426624.0, ans=0.0 2024-06-21 17:48:34,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=426642.3333333333, ans=0.125 2024-06-21 17:48:45,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=426660.6666666667, ans=0.125 2024-06-21 17:48:55,164 INFO [train.py:1028] (0/2) Epoch 24, batch 50, loss[loss=0.1856, simple_loss=0.2446, pruned_loss=0.06325, over 12560.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2523, pruned_loss=0.06983, over 575028.45 frames. ], batch size: 29, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:48:56,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=426697.3333333333, ans=0.0 2024-06-21 17:49:01,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.44 vs. limit=22.5 2024-06-21 17:49:01,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=15.0 2024-06-21 17:49:02,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=426715.6666666667, ans=0.0 2024-06-21 17:49:09,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.82 vs. limit=15.0 2024-06-21 17:49:10,703 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.162e+02 2.257e+02 2.402e+02 2.871e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-21 17:49:14,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=426752.3333333333, ans=0.0 2024-06-21 17:49:16,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=8.0 2024-06-21 17:49:29,126 INFO [train.py:1028] (0/2) Epoch 24, batch 100, loss[loss=0.1829, simple_loss=0.246, pruned_loss=0.05995, over 13279.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.253, pruned_loss=0.06967, over 1017974.86 frames. ], batch size: 46, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:49:34,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=426789.0, ans=0.125 2024-06-21 17:49:53,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=426844.0, ans=0.0 2024-06-21 17:49:53,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.78 vs. limit=22.5 2024-06-21 17:50:01,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=426862.3333333333, ans=0.125 2024-06-21 17:50:05,780 INFO [train.py:1028] (0/2) Epoch 24, batch 150, loss[loss=0.1885, simple_loss=0.252, pruned_loss=0.06247, over 12795.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2528, pruned_loss=0.06855, over 1365299.03 frames. ], batch size: 29, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:50:21,747 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.187e+02 2.331e+02 2.568e+02 3.088e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 17:50:22,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=426917.3333333333, ans=0.0 2024-06-21 17:50:29,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=426935.6666666667, ans=0.05 2024-06-21 17:50:34,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=426954.0, ans=0.0 2024-06-21 17:50:34,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=426954.0, ans=0.5 2024-06-21 17:50:37,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426972.3333333333, ans=0.125 2024-06-21 17:50:37,822 INFO [train.py:1028] (0/2) Epoch 24, batch 200, loss[loss=0.2279, simple_loss=0.2758, pruned_loss=0.09001, over 12528.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2536, pruned_loss=0.06927, over 1634506.53 frames. ], batch size: 202, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:50:46,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.88 vs. limit=22.5 2024-06-21 17:50:55,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=12.0 2024-06-21 17:51:09,666 INFO [train.py:1028] (0/2) Epoch 24, batch 250, loss[loss=0.1781, simple_loss=0.2256, pruned_loss=0.06532, over 13033.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2531, pruned_loss=0.06897, over 1846062.32 frames. ], batch size: 144, lr: 2.44e-03, grad_scale: 32.0 2024-06-21 17:51:13,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427064.0, ans=0.1 2024-06-21 17:51:18,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2024-06-21 17:51:19,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=427082.3333333333, ans=0.0 2024-06-21 17:51:26,391 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.204e+02 2.311e+02 2.506e+02 3.068e+02, threshold=4.622e+02, percent-clipped=0.0 2024-06-21 17:51:28,447 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:51:36,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=427119.0, ans=0.125 2024-06-21 17:51:48,444 INFO [train.py:1028] (0/2) Epoch 24, batch 300, loss[loss=0.192, simple_loss=0.246, pruned_loss=0.06905, over 13212.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2535, pruned_loss=0.06937, over 2009823.89 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 32.0 2024-06-21 17:51:48,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=427155.6666666667, ans=0.0 2024-06-21 17:51:54,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=427174.0, ans=0.0 2024-06-21 17:51:55,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=427174.0, ans=0.09899494936611666 2024-06-21 17:51:59,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=427174.0, ans=0.125 2024-06-21 17:52:02,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=427192.3333333333, ans=0.0 2024-06-21 17:52:04,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=427192.3333333333, ans=0.125 2024-06-21 17:52:07,113 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:52:19,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=427229.0, ans=0.2 2024-06-21 17:52:20,306 INFO [train.py:1028] (0/2) Epoch 24, batch 350, loss[loss=0.1776, simple_loss=0.2474, pruned_loss=0.05392, over 12776.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2531, pruned_loss=0.06904, over 2138309.61 frames. ], batch size: 33, lr: 2.43e-03, grad_scale: 32.0 2024-06-21 17:52:22,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=427247.3333333333, ans=0.125 2024-06-21 17:52:25,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=427247.3333333333, ans=0.0 2024-06-21 17:52:28,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=427265.6666666667, ans=0.125 2024-06-21 17:52:31,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=427265.6666666667, ans=0.0 2024-06-21 17:52:34,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=427284.0, ans=0.0 2024-06-21 17:52:36,322 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.136e+02 2.248e+02 2.490e+02 3.097e+02, threshold=4.495e+02, percent-clipped=0.0 2024-06-21 17:52:38,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=427302.3333333333, ans=0.125 2024-06-21 17:52:46,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=427320.6666666667, ans=0.125 2024-06-21 17:52:52,003 INFO [train.py:1028] (0/2) Epoch 24, batch 400, loss[loss=0.1998, simple_loss=0.2605, pruned_loss=0.06952, over 13269.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2529, pruned_loss=0.06889, over 2239937.60 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:53:01,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=427357.3333333333, ans=0.0 2024-06-21 17:53:04,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=427375.6666666667, ans=0.0 2024-06-21 17:53:08,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=427375.6666666667, ans=0.125 2024-06-21 17:53:11,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=427394.0, ans=0.0 2024-06-21 17:53:16,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.63 vs. limit=22.5 2024-06-21 17:53:20,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.33 vs. limit=15.0 2024-06-21 17:53:20,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=427412.3333333333, ans=0.2 2024-06-21 17:53:22,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=427412.3333333333, ans=0.125 2024-06-21 17:53:24,130 INFO [train.py:1028] (0/2) Epoch 24, batch 450, loss[loss=0.1893, simple_loss=0.2506, pruned_loss=0.06403, over 13268.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2531, pruned_loss=0.06853, over 2314642.42 frames. ], batch size: 67, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:53:27,996 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:53:43,022 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.139e+02 2.272e+02 2.402e+02 2.964e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 17:53:44,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2024-06-21 17:53:46,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.16 vs. limit=6.0 2024-06-21 17:53:47,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=427485.6666666667, ans=0.125 2024-06-21 17:53:51,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=427485.6666666667, ans=0.2 2024-06-21 17:53:56,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=427504.0, ans=0.2 2024-06-21 17:54:02,096 INFO [train.py:1028] (0/2) Epoch 24, batch 500, loss[loss=0.178, simple_loss=0.2361, pruned_loss=0.05994, over 13106.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2534, pruned_loss=0.06844, over 2376490.58 frames. ], batch size: 121, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:54:04,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2024-06-21 17:54:06,807 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.33 vs. limit=15.0 2024-06-21 17:54:14,380 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=15.0 2024-06-21 17:54:14,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2024-06-21 17:54:19,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=427559.0, ans=0.1 2024-06-21 17:54:24,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=427577.3333333333, ans=0.2 2024-06-21 17:54:34,229 INFO [train.py:1028] (0/2) Epoch 24, batch 550, loss[loss=0.1891, simple_loss=0.2423, pruned_loss=0.06792, over 12955.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2536, pruned_loss=0.06825, over 2420913.03 frames. ], batch size: 158, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:54:49,966 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.213e+02 2.331e+02 2.537e+02 3.094e+02, threshold=4.661e+02, percent-clipped=0.0 2024-06-21 17:54:50,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=427650.6666666667, ans=0.025 2024-06-21 17:54:58,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=427687.3333333333, ans=0.0 2024-06-21 17:55:03,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=427687.3333333333, ans=0.125 2024-06-21 17:55:05,764 INFO [train.py:1028] (0/2) Epoch 24, batch 600, loss[loss=0.1946, simple_loss=0.2416, pruned_loss=0.07382, over 13054.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2536, pruned_loss=0.06827, over 2458262.56 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:55:07,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=427705.6666666667, ans=10.0 2024-06-21 17:55:13,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=427724.0, ans=0.125 2024-06-21 17:55:26,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2024-06-21 17:55:41,203 INFO [train.py:1028] (0/2) Epoch 24, batch 650, loss[loss=0.1927, simple_loss=0.2523, pruned_loss=0.06653, over 13236.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2535, pruned_loss=0.06803, over 2490184.67 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:55:44,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2024-06-21 17:55:47,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=427815.6666666667, ans=0.125 2024-06-21 17:55:49,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=427815.6666666667, ans=10.0 2024-06-21 17:55:50,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427815.6666666667, ans=0.125 2024-06-21 17:55:51,349 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=2.537e-03 2024-06-21 17:55:57,079 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.080e+02 2.204e+02 2.343e+02 2.882e+02, threshold=4.408e+02, percent-clipped=0.0 2024-06-21 17:56:07,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=427852.3333333333, ans=0.0 2024-06-21 17:56:15,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=427889.0, ans=0.125 2024-06-21 17:56:15,667 INFO [train.py:1028] (0/2) Epoch 24, batch 700, loss[loss=0.1958, simple_loss=0.2552, pruned_loss=0.06816, over 13271.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2527, pruned_loss=0.06803, over 2513235.05 frames. ], batch size: 46, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:56:34,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=427944.0, ans=0.0 2024-06-21 17:56:36,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.45 vs. limit=15.0 2024-06-21 17:56:43,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2024-06-21 17:56:48,140 INFO [train.py:1028] (0/2) Epoch 24, batch 750, loss[loss=0.2082, simple_loss=0.2663, pruned_loss=0.07507, over 13282.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2525, pruned_loss=0.06814, over 2527517.34 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:56:49,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=427980.6666666667, ans=0.125 2024-06-21 17:56:52,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=427980.6666666667, ans=0.025 2024-06-21 17:56:58,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=427999.0, ans=0.125 2024-06-21 17:57:00,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2024-06-21 17:57:04,497 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.150e+02 2.249e+02 2.402e+02 2.853e+02, threshold=4.498e+02, percent-clipped=0.0 2024-06-21 17:57:09,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=428035.6666666667, ans=0.2 2024-06-21 17:57:16,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=428054.0, ans=0.0 2024-06-21 17:57:20,423 INFO [train.py:1028] (0/2) Epoch 24, batch 800, loss[loss=0.1948, simple_loss=0.2591, pruned_loss=0.06527, over 12907.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2525, pruned_loss=0.06819, over 2540804.14 frames. ], batch size: 36, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:57:21,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428072.3333333333, ans=0.125 2024-06-21 17:57:21,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=428072.3333333333, ans=0.125 2024-06-21 17:57:23,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=428072.3333333333, ans=0.09899494936611666 2024-06-21 17:57:32,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=428090.6666666667, ans=0.2 2024-06-21 17:57:34,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428109.0, ans=0.125 2024-06-21 17:57:36,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=428109.0, ans=0.125 2024-06-21 17:57:37,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=428109.0, ans=0.125 2024-06-21 17:57:39,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=428127.3333333333, ans=0.0 2024-06-21 17:57:41,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=428127.3333333333, ans=0.125 2024-06-21 17:57:46,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=428127.3333333333, ans=0.0 2024-06-21 17:57:47,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=428127.3333333333, ans=0.125 2024-06-21 17:57:50,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=428145.6666666667, ans=0.2 2024-06-21 17:57:52,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2024-06-21 17:57:56,427 INFO [train.py:1028] (0/2) Epoch 24, batch 850, loss[loss=0.2069, simple_loss=0.2599, pruned_loss=0.07697, over 13214.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2521, pruned_loss=0.06775, over 2551718.79 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:57:57,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2024-06-21 17:58:00,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=428164.0, ans=0.125 2024-06-21 17:58:09,516 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2024-06-21 17:58:11,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=428200.6666666667, ans=0.0 2024-06-21 17:58:14,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=428200.6666666667, ans=0.025 2024-06-21 17:58:15,215 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.188e+02 2.338e+02 2.577e+02 3.264e+02, threshold=4.675e+02, percent-clipped=0.0 2024-06-21 17:58:25,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=428237.3333333333, ans=0.0 2024-06-21 17:58:31,199 INFO [train.py:1028] (0/2) Epoch 24, batch 900, loss[loss=0.1817, simple_loss=0.2391, pruned_loss=0.06213, over 12945.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2522, pruned_loss=0.06795, over 2556739.41 frames. ], batch size: 36, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:58:36,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2024-06-21 17:58:56,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.92 vs. limit=6.0 2024-06-21 17:59:01,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=428329.0, ans=0.0 2024-06-21 17:59:03,570 INFO [train.py:1028] (0/2) Epoch 24, batch 950, loss[loss=0.1951, simple_loss=0.2608, pruned_loss=0.06466, over 13009.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2527, pruned_loss=0.06789, over 2561068.94 frames. ], batch size: 39, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:59:05,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2024-06-21 17:59:07,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=428347.3333333333, ans=0.125 2024-06-21 17:59:19,418 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.193e+02 2.334e+02 2.505e+02 3.278e+02, threshold=4.668e+02, percent-clipped=0.0 2024-06-21 17:59:25,742 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 17:59:32,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2024-06-21 17:59:34,715 INFO [train.py:1028] (0/2) Epoch 24, batch 1000, loss[loss=0.2091, simple_loss=0.2656, pruned_loss=0.07625, over 13279.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2529, pruned_loss=0.06821, over 2563193.10 frames. ], batch size: 49, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 17:59:40,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428457.3333333333, ans=0.1 2024-06-21 17:59:56,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=428494.0, ans=0.2 2024-06-21 17:59:57,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=428494.0, ans=0.0 2024-06-21 17:59:58,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=428494.0, ans=0.125 2024-06-21 18:00:04,086 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:00:09,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=428512.3333333333, ans=0.125 2024-06-21 18:00:14,758 INFO [train.py:1028] (0/2) Epoch 24, batch 1050, loss[loss=0.1915, simple_loss=0.2625, pruned_loss=0.06029, over 13192.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.253, pruned_loss=0.06791, over 2566865.59 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:00:27,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=428567.3333333333, ans=0.0 2024-06-21 18:00:28,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=428567.3333333333, ans=0.125 2024-06-21 18:00:31,199 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.118e+02 2.255e+02 2.442e+02 3.113e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 18:00:32,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.24 vs. limit=15.0 2024-06-21 18:00:35,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=428585.6666666667, ans=0.125 2024-06-21 18:00:36,631 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:00:44,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=428604.0, ans=0.125 2024-06-21 18:00:48,189 INFO [train.py:1028] (0/2) Epoch 24, batch 1100, loss[loss=0.1913, simple_loss=0.2612, pruned_loss=0.06065, over 13240.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2528, pruned_loss=0.06766, over 2571484.40 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:00:57,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428640.6666666667, ans=0.1 2024-06-21 18:00:58,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=428640.6666666667, ans=0.2 2024-06-21 18:01:21,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-21 18:01:21,327 INFO [train.py:1028] (0/2) Epoch 24, batch 1150, loss[loss=0.2102, simple_loss=0.2701, pruned_loss=0.07516, over 13250.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2527, pruned_loss=0.06796, over 2571945.12 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:01:21,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=428714.0, ans=10.0 2024-06-21 18:01:30,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=428732.3333333333, ans=0.125 2024-06-21 18:01:34,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=428750.6666666667, ans=0.0 2024-06-21 18:01:34,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428750.6666666667, ans=0.125 2024-06-21 18:01:37,329 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.164e+02 2.379e+02 2.600e+02 3.386e+02, threshold=4.758e+02, percent-clipped=0.0 2024-06-21 18:01:42,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428750.6666666667, ans=0.1 2024-06-21 18:01:53,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=428787.3333333333, ans=0.125 2024-06-21 18:01:56,177 INFO [train.py:1028] (0/2) Epoch 24, batch 1200, loss[loss=0.1929, simple_loss=0.2506, pruned_loss=0.06755, over 13157.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2533, pruned_loss=0.06865, over 2574024.14 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:02:03,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=428805.6666666667, ans=0.1 2024-06-21 18:02:15,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428842.3333333333, ans=0.1 2024-06-21 18:02:16,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2024-06-21 18:02:27,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=428879.0, ans=0.0 2024-06-21 18:02:29,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.21 vs. limit=22.5 2024-06-21 18:02:30,773 INFO [train.py:1028] (0/2) Epoch 24, batch 1250, loss[loss=0.195, simple_loss=0.2508, pruned_loss=0.06958, over 13146.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2533, pruned_loss=0.0687, over 2582934.28 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:02:46,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.135e+02 2.257e+02 2.419e+02 3.120e+02, threshold=4.515e+02, percent-clipped=0.0 2024-06-21 18:02:55,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=428952.3333333333, ans=0.0 2024-06-21 18:02:59,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=428970.6666666667, ans=0.09899494936611666 2024-06-21 18:03:01,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=428970.6666666667, ans=0.07 2024-06-21 18:03:03,087 INFO [train.py:1028] (0/2) Epoch 24, batch 1300, loss[loss=0.2001, simple_loss=0.2477, pruned_loss=0.07625, over 12702.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2537, pruned_loss=0.06887, over 2583128.75 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:03:03,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428989.0, ans=0.1 2024-06-21 18:03:15,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=429025.6666666667, ans=0.2 2024-06-21 18:03:21,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=429025.6666666667, ans=0.025 2024-06-21 18:03:23,183 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.95 vs. limit=12.0 2024-06-21 18:03:27,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=429044.0, ans=0.125 2024-06-21 18:03:27,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.46 vs. limit=15.0 2024-06-21 18:03:31,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=429062.3333333333, ans=0.125 2024-06-21 18:03:36,156 INFO [train.py:1028] (0/2) Epoch 24, batch 1350, loss[loss=0.1983, simple_loss=0.2631, pruned_loss=0.06677, over 13214.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.254, pruned_loss=0.06906, over 2585189.98 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:03:37,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=429080.6666666667, ans=0.125 2024-06-21 18:03:39,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-21 18:03:40,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=429080.6666666667, ans=0.0 2024-06-21 18:03:43,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=429099.0, ans=0.125 2024-06-21 18:03:55,685 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.126e+02 2.280e+02 2.421e+02 3.286e+02, threshold=4.559e+02, percent-clipped=0.0 2024-06-21 18:04:03,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=429135.6666666667, ans=0.0 2024-06-21 18:04:08,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.77 vs. limit=10.0 2024-06-21 18:04:15,456 INFO [train.py:1028] (0/2) Epoch 24, batch 1400, loss[loss=0.2047, simple_loss=0.2714, pruned_loss=0.06894, over 12507.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2538, pruned_loss=0.06915, over 2585954.01 frames. ], batch size: 25, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:04:18,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=429172.3333333333, ans=0.95 2024-06-21 18:04:25,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=429190.6666666667, ans=0.125 2024-06-21 18:04:28,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=429209.0, ans=10.0 2024-06-21 18:04:34,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2024-06-21 18:04:35,314 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2024-06-21 18:04:39,031 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:04:44,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=429245.6666666667, ans=0.2 2024-06-21 18:04:47,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=429264.0, ans=0.125 2024-06-21 18:04:47,691 INFO [train.py:1028] (0/2) Epoch 24, batch 1450, loss[loss=0.1851, simple_loss=0.2372, pruned_loss=0.06652, over 13101.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.254, pruned_loss=0.06935, over 2586267.57 frames. ], batch size: 121, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:04:50,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=429264.0, ans=0.025 2024-06-21 18:04:52,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=429264.0, ans=10.0 2024-06-21 18:04:56,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429282.3333333333, ans=0.1 2024-06-21 18:05:02,910 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2024-06-21 18:05:04,178 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.150e+02 2.242e+02 2.405e+02 2.827e+02, threshold=4.484e+02, percent-clipped=0.0 2024-06-21 18:05:10,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=429319.0, ans=0.125 2024-06-21 18:05:15,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=429337.3333333333, ans=0.0 2024-06-21 18:05:20,546 INFO [train.py:1028] (0/2) Epoch 24, batch 1500, loss[loss=0.2003, simple_loss=0.2476, pruned_loss=0.07653, over 13231.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2534, pruned_loss=0.06932, over 2588866.28 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:05:20,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429355.6666666667, ans=0.125 2024-06-21 18:05:26,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=429374.0, ans=0.0 2024-06-21 18:05:45,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=429410.6666666667, ans=0.125 2024-06-21 18:05:45,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=12.0 2024-06-21 18:05:52,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=12.0 2024-06-21 18:05:56,850 INFO [train.py:1028] (0/2) Epoch 24, batch 1550, loss[loss=0.1974, simple_loss=0.2514, pruned_loss=0.07174, over 12987.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2537, pruned_loss=0.06965, over 2583494.96 frames. ], batch size: 102, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:06:16,382 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.208e+02 2.302e+02 2.469e+02 3.333e+02, threshold=4.604e+02, percent-clipped=0.0 2024-06-21 18:06:19,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=429502.3333333333, ans=0.125 2024-06-21 18:06:24,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=429502.3333333333, ans=0.125 2024-06-21 18:06:25,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2024-06-21 18:06:32,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2024-06-21 18:06:33,063 INFO [train.py:1028] (0/2) Epoch 24, batch 1600, loss[loss=0.177, simple_loss=0.2412, pruned_loss=0.05634, over 13168.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2541, pruned_loss=0.06972, over 2579218.58 frames. ], batch size: 77, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:06:50,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.69 vs. limit=15.0 2024-06-21 18:06:52,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=429594.0, ans=0.125 2024-06-21 18:06:56,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429594.0, ans=0.0 2024-06-21 18:06:59,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=429612.3333333333, ans=0.025 2024-06-21 18:07:00,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=429612.3333333333, ans=0.1 2024-06-21 18:07:03,337 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2024-06-21 18:07:05,640 INFO [train.py:1028] (0/2) Epoch 24, batch 1650, loss[loss=0.2091, simple_loss=0.255, pruned_loss=0.08162, over 13132.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2538, pruned_loss=0.06961, over 2575701.86 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:07:12,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2024-06-21 18:07:13,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=429649.0, ans=0.125 2024-06-21 18:07:21,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=429667.3333333333, ans=0.0 2024-06-21 18:07:21,610 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.179e+02 2.299e+02 2.474e+02 3.086e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 18:07:37,665 INFO [train.py:1028] (0/2) Epoch 24, batch 1700, loss[loss=0.1861, simple_loss=0.2499, pruned_loss=0.06114, over 12805.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2538, pruned_loss=0.06904, over 2581491.26 frames. ], batch size: 25, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:07:39,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=429722.3333333333, ans=0.2 2024-06-21 18:07:52,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=429759.0, ans=0.125 2024-06-21 18:08:00,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429759.0, ans=0.1 2024-06-21 18:08:08,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=429795.6666666667, ans=0.0 2024-06-21 18:08:08,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=429795.6666666667, ans=0.2 2024-06-21 18:08:15,271 INFO [train.py:1028] (0/2) Epoch 24, batch 1750, loss[loss=0.2039, simple_loss=0.2734, pruned_loss=0.06726, over 12475.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2544, pruned_loss=0.0692, over 2581941.23 frames. ], batch size: 22, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:08:17,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=429814.0, ans=0.2 2024-06-21 18:08:23,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429832.3333333333, ans=0.1 2024-06-21 18:08:25,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.71 vs. limit=22.5 2024-06-21 18:08:29,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=429850.6666666667, ans=0.05 2024-06-21 18:08:31,336 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.148e+02 2.317e+02 2.451e+02 4.030e+02, threshold=4.633e+02, percent-clipped=0.0 2024-06-21 18:08:32,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=429850.6666666667, ans=0.95 2024-06-21 18:08:36,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.73 vs. limit=22.5 2024-06-21 18:08:47,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=429905.6666666667, ans=0.0 2024-06-21 18:08:47,381 INFO [train.py:1028] (0/2) Epoch 24, batch 1800, loss[loss=0.195, simple_loss=0.2509, pruned_loss=0.06955, over 13219.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2543, pruned_loss=0.06923, over 2583164.75 frames. ], batch size: 67, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:08:51,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2024-06-21 18:08:59,614 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.54 vs. limit=6.0 2024-06-21 18:09:00,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2024-06-21 18:09:03,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=429942.3333333333, ans=0.025 2024-06-21 18:09:08,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=429960.6666666667, ans=0.125 2024-06-21 18:09:18,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=429979.0, ans=0.125 2024-06-21 18:09:20,259 INFO [train.py:1028] (0/2) Epoch 24, batch 1850, loss[loss=0.2013, simple_loss=0.257, pruned_loss=0.07277, over 13259.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2538, pruned_loss=0.06889, over 2584701.84 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:09:21,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=429997.3333333333, ans=0.025 2024-06-21 18:09:21,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=429997.3333333333, ans=0.125 2024-06-21 18:09:30,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=430015.6666666667, ans=0.07 2024-06-21 18:09:31,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=430015.6666666667, ans=0.2 2024-06-21 18:09:36,130 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.152e+02 2.261e+02 2.429e+02 2.923e+02, threshold=4.523e+02, percent-clipped=0.0 2024-06-21 18:09:48,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2024-06-21 18:09:51,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=430070.6666666667, ans=0.125 2024-06-21 18:09:52,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2024-06-21 18:09:53,290 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2024-06-21 18:09:56,669 INFO [train.py:1028] (0/2) Epoch 24, batch 1900, loss[loss=0.1831, simple_loss=0.2404, pruned_loss=0.06289, over 13154.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.253, pruned_loss=0.06875, over 2586627.14 frames. ], batch size: 95, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:09:57,751 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.97 vs. limit=6.0 2024-06-21 18:09:58,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=430089.0, ans=0.125 2024-06-21 18:10:04,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=430089.0, ans=0.125 2024-06-21 18:10:06,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=430107.3333333333, ans=0.025 2024-06-21 18:10:06,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=430107.3333333333, ans=0.09899494936611666 2024-06-21 18:10:07,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=430107.3333333333, ans=0.125 2024-06-21 18:10:14,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=430125.6666666667, ans=0.0 2024-06-21 18:10:14,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=430125.6666666667, ans=0.0 2024-06-21 18:10:17,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430125.6666666667, ans=0.1 2024-06-21 18:10:24,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=430144.0, ans=0.125 2024-06-21 18:10:30,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=430162.3333333333, ans=0.125 2024-06-21 18:10:32,061 INFO [train.py:1028] (0/2) Epoch 24, batch 1950, loss[loss=0.1807, simple_loss=0.2453, pruned_loss=0.05808, over 13208.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2524, pruned_loss=0.06877, over 2592644.73 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:10:33,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2024-06-21 18:10:36,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=430180.6666666667, ans=0.0 2024-06-21 18:10:45,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=430217.3333333333, ans=0.2 2024-06-21 18:10:48,240 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.210e+02 2.339e+02 2.462e+02 3.423e+02, threshold=4.679e+02, percent-clipped=0.0 2024-06-21 18:10:53,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=430235.6666666667, ans=0.0 2024-06-21 18:11:01,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430254.0, ans=0.1 2024-06-21 18:11:04,171 INFO [train.py:1028] (0/2) Epoch 24, batch 2000, loss[loss=0.1946, simple_loss=0.2624, pruned_loss=0.06334, over 12587.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2524, pruned_loss=0.06869, over 2587810.84 frames. ], batch size: 22, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:11:20,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=430309.0, ans=0.125 2024-06-21 18:11:28,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=430327.3333333333, ans=0.2 2024-06-21 18:11:36,819 INFO [train.py:1028] (0/2) Epoch 24, batch 2050, loss[loss=0.1913, simple_loss=0.2526, pruned_loss=0.06502, over 12493.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2524, pruned_loss=0.06857, over 2581419.43 frames. ], batch size: 29, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:11:42,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.49 vs. limit=15.0 2024-06-21 18:11:45,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430364.0, ans=0.1 2024-06-21 18:11:53,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=430400.6666666667, ans=0.125 2024-06-21 18:11:55,985 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.195e+02 2.324e+02 2.549e+02 3.134e+02, threshold=4.648e+02, percent-clipped=0.0 2024-06-21 18:12:01,872 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.10 vs. limit=15.0 2024-06-21 18:12:07,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=430437.3333333333, ans=0.0 2024-06-21 18:12:07,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=430437.3333333333, ans=0.0 2024-06-21 18:12:14,701 INFO [train.py:1028] (0/2) Epoch 24, batch 2100, loss[loss=0.1773, simple_loss=0.2417, pruned_loss=0.05646, over 13241.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2529, pruned_loss=0.06862, over 2584277.20 frames. ], batch size: 59, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:12:21,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=430474.0, ans=10.0 2024-06-21 18:12:36,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=430510.6666666667, ans=0.0 2024-06-21 18:12:36,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=430510.6666666667, ans=0.035 2024-06-21 18:12:37,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=430510.6666666667, ans=0.2 2024-06-21 18:12:47,438 INFO [train.py:1028] (0/2) Epoch 24, batch 2150, loss[loss=0.1805, simple_loss=0.2457, pruned_loss=0.05765, over 13272.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2527, pruned_loss=0.06833, over 2587117.14 frames. ], batch size: 52, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:12:47,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=430547.3333333333, ans=0.0 2024-06-21 18:12:49,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2024-06-21 18:12:51,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=430547.3333333333, ans=0.125 2024-06-21 18:12:59,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=430584.0, ans=0.125 2024-06-21 18:13:00,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=430584.0, ans=0.0 2024-06-21 18:13:02,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=430584.0, ans=0.125 2024-06-21 18:13:03,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.150e+02 2.252e+02 2.394e+02 3.073e+02, threshold=4.504e+02, percent-clipped=0.0 2024-06-21 18:13:07,836 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=430602.3333333333, ans=0.125 2024-06-21 18:13:18,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=430620.6666666667, ans=0.0 2024-06-21 18:13:20,284 INFO [train.py:1028] (0/2) Epoch 24, batch 2200, loss[loss=0.2138, simple_loss=0.2679, pruned_loss=0.07982, over 13194.00 frames. ], tot_loss[loss=0.195, simple_loss=0.253, pruned_loss=0.06854, over 2587615.02 frames. ], batch size: 83, lr: 2.43e-03, grad_scale: 64.0 2024-06-21 18:13:24,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=12.0 2024-06-21 18:13:27,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=430657.3333333333, ans=0.125 2024-06-21 18:13:28,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.90 vs. limit=12.0 2024-06-21 18:13:34,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=430675.6666666667, ans=0.125 2024-06-21 18:13:54,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=430712.3333333333, ans=0.125 2024-06-21 18:13:57,059 INFO [train.py:1028] (0/2) Epoch 24, batch 2250, loss[loss=0.1946, simple_loss=0.2509, pruned_loss=0.06913, over 13286.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2533, pruned_loss=0.06863, over 2586345.30 frames. ], batch size: 63, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:13:58,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=430730.6666666667, ans=0.0 2024-06-21 18:14:11,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430749.0, ans=0.1 2024-06-21 18:14:16,316 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.186e+02 2.359e+02 2.509e+02 3.108e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 18:14:23,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=430785.6666666667, ans=0.125 2024-06-21 18:14:30,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=430804.0, ans=0.0 2024-06-21 18:14:33,308 INFO [train.py:1028] (0/2) Epoch 24, batch 2300, loss[loss=0.2068, simple_loss=0.2693, pruned_loss=0.07214, over 12941.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2532, pruned_loss=0.06845, over 2581025.97 frames. ], batch size: 33, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:14:36,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=430822.3333333333, ans=0.125 2024-06-21 18:14:37,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=430822.3333333333, ans=0.125 2024-06-21 18:14:45,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=430840.6666666667, ans=0.125 2024-06-21 18:14:51,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=430859.0, ans=0.125 2024-06-21 18:14:55,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=15.0 2024-06-21 18:15:02,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=430895.6666666667, ans=0.0 2024-06-21 18:15:06,400 INFO [train.py:1028] (0/2) Epoch 24, batch 2350, loss[loss=0.1962, simple_loss=0.2518, pruned_loss=0.07027, over 13251.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2544, pruned_loss=0.06927, over 2584783.57 frames. ], batch size: 67, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:15:13,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.57 vs. limit=15.0 2024-06-21 18:15:17,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.14 vs. limit=22.5 2024-06-21 18:15:22,959 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.173e+02 2.304e+02 2.444e+02 3.454e+02, threshold=4.608e+02, percent-clipped=0.0 2024-06-21 18:15:39,750 INFO [train.py:1028] (0/2) Epoch 24, batch 2400, loss[loss=0.1845, simple_loss=0.2472, pruned_loss=0.06094, over 13351.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2536, pruned_loss=0.06903, over 2587153.53 frames. ], batch size: 46, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:15:45,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=431005.6666666667, ans=0.025 2024-06-21 18:15:52,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=431024.0, ans=0.04949747468305833 2024-06-21 18:16:00,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=431042.3333333333, ans=0.05 2024-06-21 18:16:04,814 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2024-06-21 18:16:06,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=431060.6666666667, ans=0.125 2024-06-21 18:16:06,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=431060.6666666667, ans=0.0 2024-06-21 18:16:06,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2024-06-21 18:16:11,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=431060.6666666667, ans=0.04949747468305833 2024-06-21 18:16:19,189 INFO [train.py:1028] (0/2) Epoch 24, batch 2450, loss[loss=0.1666, simple_loss=0.232, pruned_loss=0.05055, over 13227.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2524, pruned_loss=0.06889, over 2583420.95 frames. ], batch size: 63, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:16:21,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=431097.3333333333, ans=0.125 2024-06-21 18:16:34,873 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.173e+02 2.332e+02 2.539e+02 3.140e+02, threshold=4.664e+02, percent-clipped=0.0 2024-06-21 18:16:44,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431170.6666666667, ans=0.125 2024-06-21 18:16:48,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=431170.6666666667, ans=0.125 2024-06-21 18:16:49,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=431170.6666666667, ans=0.2 2024-06-21 18:16:51,291 INFO [train.py:1028] (0/2) Epoch 24, batch 2500, loss[loss=0.1948, simple_loss=0.2502, pruned_loss=0.06964, over 13231.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2521, pruned_loss=0.06889, over 2586498.47 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:16:55,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=431189.0, ans=0.0 2024-06-21 18:16:59,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=12.0 2024-06-21 18:17:04,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=431225.6666666667, ans=0.125 2024-06-21 18:17:21,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=431262.3333333333, ans=0.125 2024-06-21 18:17:22,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=431262.3333333333, ans=0.025 2024-06-21 18:17:23,645 INFO [train.py:1028] (0/2) Epoch 24, batch 2550, loss[loss=0.2067, simple_loss=0.2651, pruned_loss=0.07412, over 12572.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2505, pruned_loss=0.06823, over 2587686.66 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:17:25,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=431280.6666666667, ans=0.0 2024-06-21 18:17:42,295 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.135e+02 2.254e+02 2.392e+02 3.018e+02, threshold=4.509e+02, percent-clipped=0.0 2024-06-21 18:17:57,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=431354.0, ans=0.04949747468305833 2024-06-21 18:18:01,535 INFO [train.py:1028] (0/2) Epoch 24, batch 2600, loss[loss=0.1738, simple_loss=0.2317, pruned_loss=0.05799, over 13284.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2493, pruned_loss=0.06818, over 2586589.58 frames. ], batch size: 52, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:18:13,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=431390.6666666667, ans=0.0 2024-06-21 18:18:13,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=431409.0, ans=0.07 2024-06-21 18:18:33,765 INFO [train.py:1028] (0/2) Epoch 24, batch 2650, loss[loss=0.2059, simple_loss=0.2496, pruned_loss=0.08115, over 13053.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2484, pruned_loss=0.06792, over 2587015.09 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:18:33,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431464.0, ans=0.1 2024-06-21 18:18:49,870 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.198e+02 2.343e+02 2.646e+02 3.229e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-21 18:18:50,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=431500.6666666667, ans=0.125 2024-06-21 18:19:06,172 INFO [train.py:1028] (0/2) Epoch 24, batch 2700, loss[loss=0.1823, simple_loss=0.2367, pruned_loss=0.0639, over 13225.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2474, pruned_loss=0.0678, over 2584777.12 frames. ], batch size: 89, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:19:28,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=15.0 2024-06-21 18:19:31,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=431610.6666666667, ans=0.125 2024-06-21 18:19:41,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=431629.0, ans=0.2 2024-06-21 18:19:42,971 INFO [train.py:1028] (0/2) Epoch 24, batch 2750, loss[loss=0.1954, simple_loss=0.2454, pruned_loss=0.07265, over 13297.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2465, pruned_loss=0.06697, over 2582169.58 frames. ], batch size: 43, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:19:59,900 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.175e+02 2.264e+02 2.535e+02 3.560e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 18:20:00,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=431684.0, ans=0.0 2024-06-21 18:20:04,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=431702.3333333333, ans=0.1 2024-06-21 18:20:10,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431720.6666666667, ans=0.1 2024-06-21 18:20:16,516 INFO [train.py:1028] (0/2) Epoch 24, batch 2800, loss[loss=0.1908, simple_loss=0.2416, pruned_loss=0.07005, over 10995.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2465, pruned_loss=0.06729, over 2579969.25 frames. ], batch size: 304, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:20:20,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=431739.0, ans=0.125 2024-06-21 18:20:22,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=431757.3333333333, ans=0.015 2024-06-21 18:20:26,888 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:20:36,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2024-06-21 18:20:37,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=431794.0, ans=0.125 2024-06-21 18:20:37,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2024-06-21 18:20:41,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=431812.3333333333, ans=0.125 2024-06-21 18:20:44,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=431812.3333333333, ans=0.125 2024-06-21 18:20:46,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=431812.3333333333, ans=0.0 2024-06-21 18:20:48,787 INFO [train.py:1028] (0/2) Epoch 24, batch 2850, loss[loss=0.1723, simple_loss=0.2355, pruned_loss=0.05458, over 13310.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2458, pruned_loss=0.06731, over 2578449.25 frames. ], batch size: 49, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:20:58,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=431849.0, ans=0.2 2024-06-21 18:21:05,175 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.139e+02 2.287e+02 2.457e+02 3.072e+02, threshold=4.575e+02, percent-clipped=0.0 2024-06-21 18:21:12,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=431885.6666666667, ans=0.125 2024-06-21 18:21:24,370 INFO [train.py:1028] (0/2) Epoch 24, batch 2900, loss[loss=0.1709, simple_loss=0.2248, pruned_loss=0.05848, over 13163.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2434, pruned_loss=0.06639, over 2587613.05 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:21:28,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=431922.3333333333, ans=0.0 2024-06-21 18:21:28,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=431922.3333333333, ans=0.09899494936611666 2024-06-21 18:21:30,246 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2024-06-21 18:21:44,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2024-06-21 18:21:52,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=431977.3333333333, ans=0.125 2024-06-21 18:21:53,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.21 vs. limit=15.0 2024-06-21 18:21:55,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=431995.6666666667, ans=0.125 2024-06-21 18:21:58,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=431995.6666666667, ans=0.0 2024-06-21 18:22:01,223 INFO [train.py:1028] (0/2) Epoch 24, batch 2950, loss[loss=0.1766, simple_loss=0.2309, pruned_loss=0.0612, over 13180.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.243, pruned_loss=0.06608, over 2582598.13 frames. ], batch size: 43, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:22:01,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=432014.0, ans=0.125 2024-06-21 18:22:01,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432014.0, ans=0.1 2024-06-21 18:22:03,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432014.0, ans=0.125 2024-06-21 18:22:05,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=432014.0, ans=0.035 2024-06-21 18:22:06,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=15.0 2024-06-21 18:22:16,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=432050.6666666667, ans=0.0 2024-06-21 18:22:17,824 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.158e+02 2.332e+02 2.540e+02 3.517e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 18:22:20,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2024-06-21 18:22:21,567 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.87 vs. limit=12.0 2024-06-21 18:22:23,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=432069.0, ans=0.0 2024-06-21 18:22:33,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=432087.3333333333, ans=0.95 2024-06-21 18:22:34,095 INFO [train.py:1028] (0/2) Epoch 24, batch 3000, loss[loss=0.1796, simple_loss=0.2424, pruned_loss=0.05845, over 13215.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2415, pruned_loss=0.06551, over 2582393.21 frames. ], batch size: 59, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:22:34,096 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 18:22:39,855 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.5999, 4.2344, 5.1034, 4.9199], device='cuda:0') 2024-06-21 18:22:42,125 INFO [train.py:1060] (0/2) Epoch 24, validation: loss=0.1881, simple_loss=0.2507, pruned_loss=0.0627, over 351949.00 frames. 2024-06-21 18:22:42,126 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 18:22:53,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=432124.0, ans=0.2 2024-06-21 18:22:56,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.18 vs. limit=15.0 2024-06-21 18:23:11,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=432179.0, ans=0.0 2024-06-21 18:23:13,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=432179.0, ans=0.2 2024-06-21 18:23:15,075 INFO [train.py:1028] (0/2) Epoch 24, batch 3050, loss[loss=0.1806, simple_loss=0.2341, pruned_loss=0.06352, over 13321.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2404, pruned_loss=0.06555, over 2581761.00 frames. ], batch size: 46, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:23:16,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=432197.3333333333, ans=0.2 2024-06-21 18:23:25,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=432215.6666666667, ans=0.125 2024-06-21 18:23:37,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.131e+02 2.236e+02 2.419e+02 2.947e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 18:23:47,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=432270.6666666667, ans=0.125 2024-06-21 18:23:53,595 INFO [train.py:1028] (0/2) Epoch 24, batch 3100, loss[loss=0.1745, simple_loss=0.2291, pruned_loss=0.05994, over 12973.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2391, pruned_loss=0.06478, over 2582541.15 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 128.0 2024-06-21 18:23:53,826 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:23:54,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=432289.0, ans=0.2 2024-06-21 18:24:07,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2024-06-21 18:24:08,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=432325.6666666667, ans=0.125 2024-06-21 18:24:20,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=432362.3333333333, ans=0.025 2024-06-21 18:24:26,250 INFO [train.py:1028] (0/2) Epoch 24, batch 3150, loss[loss=0.1751, simple_loss=0.2283, pruned_loss=0.06099, over 12929.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2387, pruned_loss=0.06436, over 2584605.13 frames. ], batch size: 158, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:24:32,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=432399.0, ans=0.125 2024-06-21 18:24:39,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2024-06-21 18:24:40,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=432417.3333333333, ans=0.125 2024-06-21 18:24:41,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=432417.3333333333, ans=0.04949747468305833 2024-06-21 18:24:43,122 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.115e+02 2.287e+02 2.443e+02 3.354e+02, threshold=4.574e+02, percent-clipped=0.0 2024-06-21 18:24:45,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=432435.6666666667, ans=0.125 2024-06-21 18:24:45,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=432435.6666666667, ans=0.125 2024-06-21 18:24:58,601 INFO [train.py:1028] (0/2) Epoch 24, batch 3200, loss[loss=0.1668, simple_loss=0.2311, pruned_loss=0.05124, over 13090.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2384, pruned_loss=0.06422, over 2584688.83 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:25:29,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432527.3333333333, ans=0.1 2024-06-21 18:25:30,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=432545.6666666667, ans=0.5 2024-06-21 18:25:37,041 INFO [train.py:1028] (0/2) Epoch 24, batch 3250, loss[loss=0.1682, simple_loss=0.2273, pruned_loss=0.05452, over 13236.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2383, pruned_loss=0.06433, over 2587909.08 frames. ], batch size: 72, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:25:40,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432564.0, ans=0.1 2024-06-21 18:25:40,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=432564.0, ans=0.0 2024-06-21 18:25:55,126 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.158e+02 2.271e+02 2.513e+02 4.932e+02, threshold=4.543e+02, percent-clipped=1.0 2024-06-21 18:26:01,803 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:26:06,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=432637.3333333333, ans=0.2 2024-06-21 18:26:10,912 INFO [train.py:1028] (0/2) Epoch 24, batch 3300, loss[loss=0.2083, simple_loss=0.2574, pruned_loss=0.07956, over 12721.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2378, pruned_loss=0.06413, over 2583380.73 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 64.0 2024-06-21 18:26:14,070 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-236000.pt 2024-06-21 18:26:21,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432674.0, ans=0.1 2024-06-21 18:26:25,030 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:26:25,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2024-06-21 18:26:28,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432692.3333333333, ans=0.1 2024-06-21 18:26:37,086 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2024-06-21 18:26:40,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=432710.6666666667, ans=0.125 2024-06-21 18:26:48,249 INFO [train.py:1028] (0/2) Epoch 24, batch 3350, loss[loss=0.1759, simple_loss=0.2218, pruned_loss=0.06494, over 12879.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2376, pruned_loss=0.06435, over 2577985.58 frames. ], batch size: 158, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:26:53,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=432747.3333333333, ans=0.125 2024-06-21 18:26:58,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432765.6666666667, ans=0.1 2024-06-21 18:26:58,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432765.6666666667, ans=0.1 2024-06-21 18:27:01,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.42 vs. limit=10.0 2024-06-21 18:27:05,795 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.187e+02 2.430e+02 2.629e+02 3.288e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-21 18:27:22,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=432820.6666666667, ans=0.09899494936611666 2024-06-21 18:27:22,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432820.6666666667, ans=0.1 2024-06-21 18:27:23,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.77 vs. limit=22.5 2024-06-21 18:27:23,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=432820.6666666667, ans=0.0 2024-06-21 18:27:27,255 INFO [train.py:1028] (0/2) Epoch 24, batch 3400, loss[loss=0.195, simple_loss=0.2489, pruned_loss=0.07058, over 12648.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2378, pruned_loss=0.06509, over 2576031.91 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:27:29,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432839.0, ans=0.1 2024-06-21 18:27:41,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2024-06-21 18:27:50,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432894.0, ans=0.1 2024-06-21 18:27:58,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=432912.3333333333, ans=0.0 2024-06-21 18:27:58,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=432912.3333333333, ans=0.07 2024-06-21 18:28:00,667 INFO [train.py:1028] (0/2) Epoch 24, batch 3450, loss[loss=0.2016, simple_loss=0.2506, pruned_loss=0.07634, over 12684.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.237, pruned_loss=0.06465, over 2576994.22 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:28:17,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.120e+02 2.236e+02 2.448e+02 3.660e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 18:28:23,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=432985.6666666667, ans=0.0 2024-06-21 18:28:31,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=433004.0, ans=0.0 2024-06-21 18:28:33,092 INFO [train.py:1028] (0/2) Epoch 24, batch 3500, loss[loss=0.1968, simple_loss=0.2481, pruned_loss=0.07273, over 12943.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2375, pruned_loss=0.06482, over 2576760.42 frames. ], batch size: 33, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:28:35,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=433022.3333333333, ans=0.2 2024-06-21 18:28:47,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=433059.0, ans=0.125 2024-06-21 18:28:51,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=22.5 2024-06-21 18:28:52,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=433077.3333333333, ans=0.0 2024-06-21 18:28:53,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.35 vs. limit=10.0 2024-06-21 18:29:05,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.54 vs. limit=10.0 2024-06-21 18:29:06,619 INFO [train.py:1028] (0/2) Epoch 24, batch 3550, loss[loss=0.1682, simple_loss=0.2259, pruned_loss=0.05526, over 13201.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2373, pruned_loss=0.06451, over 2578300.95 frames. ], batch size: 95, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:29:06,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=433114.0, ans=0.0 2024-06-21 18:29:12,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=433132.3333333333, ans=0.125 2024-06-21 18:29:30,597 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.095e+02 2.211e+02 2.403e+02 3.107e+02, threshold=4.422e+02, percent-clipped=0.0 2024-06-21 18:29:42,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=433187.3333333333, ans=0.125 2024-06-21 18:29:45,818 INFO [train.py:1028] (0/2) Epoch 24, batch 3600, loss[loss=0.1724, simple_loss=0.2306, pruned_loss=0.05708, over 13230.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2366, pruned_loss=0.06448, over 2580818.40 frames. ], batch size: 49, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:29:48,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=15.0 2024-06-21 18:29:49,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433205.6666666667, ans=0.0 2024-06-21 18:29:53,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.39 vs. limit=22.5 2024-06-21 18:29:59,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=433242.3333333333, ans=0.2 2024-06-21 18:30:14,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=433279.0, ans=0.125 2024-06-21 18:30:15,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2024-06-21 18:30:19,163 INFO [train.py:1028] (0/2) Epoch 24, batch 3650, loss[loss=0.1871, simple_loss=0.2355, pruned_loss=0.06931, over 13044.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2359, pruned_loss=0.06396, over 2580800.04 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:30:29,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2024-06-21 18:30:37,309 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.089e+02 2.208e+02 2.378e+02 3.146e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 18:30:49,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=433370.6666666667, ans=0.0 2024-06-21 18:30:52,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=15.0 2024-06-21 18:30:53,730 INFO [train.py:1028] (0/2) Epoch 24, batch 3700, loss[loss=0.1865, simple_loss=0.2437, pruned_loss=0.06471, over 13192.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2347, pruned_loss=0.06338, over 2585476.48 frames. ], batch size: 72, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:31:17,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=433444.0, ans=0.125 2024-06-21 18:31:33,425 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.55 vs. limit=15.0 2024-06-21 18:31:35,104 INFO [train.py:1028] (0/2) Epoch 24, batch 3750, loss[loss=0.1816, simple_loss=0.2381, pruned_loss=0.06256, over 12603.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2344, pruned_loss=0.06328, over 2587188.31 frames. ], batch size: 22, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:31:40,454 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.549e+01 2024-06-21 18:31:43,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=433499.0, ans=0.125 2024-06-21 18:31:47,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=433517.3333333333, ans=0.07 2024-06-21 18:31:50,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433517.3333333333, ans=0.1 2024-06-21 18:31:52,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.118e+02 2.251e+02 2.479e+02 3.160e+02, threshold=4.502e+02, percent-clipped=0.0 2024-06-21 18:31:53,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=433517.3333333333, ans=0.0 2024-06-21 18:31:59,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433535.6666666667, ans=0.0 2024-06-21 18:32:08,323 INFO [train.py:1028] (0/2) Epoch 24, batch 3800, loss[loss=0.1904, simple_loss=0.2372, pruned_loss=0.07184, over 13186.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.2343, pruned_loss=0.06316, over 2585039.94 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:32:19,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=433590.6666666667, ans=0.025 2024-06-21 18:32:22,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=433609.0, ans=0.025 2024-06-21 18:32:24,525 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.07 vs. limit=22.5 2024-06-21 18:32:41,392 INFO [train.py:1028] (0/2) Epoch 24, batch 3850, loss[loss=0.1759, simple_loss=0.2245, pruned_loss=0.06363, over 13056.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.2344, pruned_loss=0.06309, over 2584012.52 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:32:47,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=433682.3333333333, ans=0.125 2024-06-21 18:32:51,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=433682.3333333333, ans=0.2 2024-06-21 18:32:59,054 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.108e+02 2.245e+02 2.477e+02 3.503e+02, threshold=4.489e+02, percent-clipped=0.0 2024-06-21 18:32:59,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=433700.6666666667, ans=0.125 2024-06-21 18:33:01,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=433719.0, ans=0.125 2024-06-21 18:33:09,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.56 vs. limit=10.0 2024-06-21 18:33:14,006 INFO [train.py:1028] (0/2) Epoch 24, batch 3900, loss[loss=0.1775, simple_loss=0.23, pruned_loss=0.06249, over 13231.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2337, pruned_loss=0.06281, over 2587603.35 frames. ], batch size: 83, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:33:20,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=433774.0, ans=0.0 2024-06-21 18:33:20,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=433774.0, ans=0.125 2024-06-21 18:33:27,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=433792.3333333333, ans=0.0 2024-06-21 18:33:33,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.75 vs. limit=15.0 2024-06-21 18:33:40,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433810.6666666667, ans=0.1 2024-06-21 18:33:54,688 INFO [train.py:1028] (0/2) Epoch 24, batch 3950, loss[loss=0.1774, simple_loss=0.2256, pruned_loss=0.06463, over 13140.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2332, pruned_loss=0.06241, over 2590018.72 frames. ], batch size: 132, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:33:55,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=433847.3333333333, ans=0.0 2024-06-21 18:34:01,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=433865.6666666667, ans=0.125 2024-06-21 18:34:12,582 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.067e+02 2.149e+02 2.262e+02 2.872e+02, threshold=4.297e+02, percent-clipped=0.0 2024-06-21 18:34:13,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=433884.0, ans=0.125 2024-06-21 18:34:28,135 INFO [train.py:1028] (0/2) Epoch 24, batch 4000, loss[loss=0.181, simple_loss=0.2399, pruned_loss=0.06109, over 12983.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2331, pruned_loss=0.06254, over 2583750.93 frames. ], batch size: 39, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:34:28,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=433939.0, ans=0.125 2024-06-21 18:34:29,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=433939.0, ans=0.125 2024-06-21 18:34:33,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=433939.0, ans=0.0 2024-06-21 18:34:35,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=433957.3333333333, ans=0.125 2024-06-21 18:34:58,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=434012.3333333333, ans=0.125 2024-06-21 18:35:00,941 INFO [train.py:1028] (0/2) Epoch 24, batch 4050, loss[loss=0.2001, simple_loss=0.2417, pruned_loss=0.07922, over 11166.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2332, pruned_loss=0.06269, over 2581375.52 frames. ], batch size: 304, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:35:06,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=434030.6666666667, ans=0.0 2024-06-21 18:35:14,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=434067.3333333333, ans=0.125 2024-06-21 18:35:14,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434067.3333333333, ans=0.1 2024-06-21 18:35:18,305 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.072e+02 2.235e+02 2.378e+02 3.027e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 18:35:31,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=434104.0, ans=0.125 2024-06-21 18:35:33,634 INFO [train.py:1028] (0/2) Epoch 24, batch 4100, loss[loss=0.1979, simple_loss=0.2387, pruned_loss=0.07855, over 13044.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2331, pruned_loss=0.06299, over 2577934.18 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:35:48,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=434140.6666666667, ans=0.125 2024-06-21 18:35:49,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=434140.6666666667, ans=15.0 2024-06-21 18:36:02,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=434177.3333333333, ans=0.125 2024-06-21 18:36:04,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=434177.3333333333, ans=0.2 2024-06-21 18:36:06,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=434195.6666666667, ans=0.0 2024-06-21 18:36:13,404 INFO [train.py:1028] (0/2) Epoch 24, batch 4150, loss[loss=0.1809, simple_loss=0.2337, pruned_loss=0.06404, over 13152.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2325, pruned_loss=0.06252, over 2575720.20 frames. ], batch size: 55, lr: 2.42e-03, grad_scale: 32.0 2024-06-21 18:36:25,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=434232.3333333333, ans=0.125 2024-06-21 18:36:29,239 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=12.0 2024-06-21 18:36:31,297 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.063e+02 2.268e+02 2.474e+02 3.628e+02, threshold=4.536e+02, percent-clipped=0.0 2024-06-21 18:36:34,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=434269.0, ans=0.0 2024-06-21 18:36:35,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=434269.0, ans=0.2 2024-06-21 18:36:46,587 INFO [train.py:1028] (0/2) Epoch 24, batch 4200, loss[loss=0.1861, simple_loss=0.2352, pruned_loss=0.06849, over 13038.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2316, pruned_loss=0.06237, over 2578869.58 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:36:46,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=434305.6666666667, ans=0.0 2024-06-21 18:37:11,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=434379.0, ans=0.2 2024-06-21 18:37:17,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=434379.0, ans=0.0 2024-06-21 18:37:18,419 INFO [train.py:1028] (0/2) Epoch 24, batch 4250, loss[loss=0.1496, simple_loss=0.2071, pruned_loss=0.04607, over 13333.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2309, pruned_loss=0.06176, over 2581217.58 frames. ], batch size: 46, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:37:26,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=434415.6666666667, ans=0.0 2024-06-21 18:37:39,497 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 2.073e+02 2.207e+02 2.347e+02 4.161e+02, threshold=4.413e+02, percent-clipped=0.0 2024-06-21 18:37:48,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=434452.3333333333, ans=0.125 2024-06-21 18:37:58,259 INFO [train.py:1028] (0/2) Epoch 24, batch 4300, loss[loss=0.1983, simple_loss=0.2535, pruned_loss=0.07155, over 13205.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2308, pruned_loss=0.06178, over 2581577.87 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:37:59,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=434489.0, ans=0.125 2024-06-21 18:38:11,571 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:38:30,245 INFO [train.py:1028] (0/2) Epoch 24, batch 4350, loss[loss=0.1816, simple_loss=0.2371, pruned_loss=0.06307, over 13201.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2309, pruned_loss=0.06206, over 2585844.51 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:38:36,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=434599.0, ans=0.125 2024-06-21 18:38:44,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=434617.3333333333, ans=0.2 2024-06-21 18:38:47,877 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.086e+02 2.187e+02 2.344e+02 2.916e+02, threshold=4.373e+02, percent-clipped=0.0 2024-06-21 18:38:52,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=434635.6666666667, ans=0.0 2024-06-21 18:39:00,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=434654.0, ans=0.0 2024-06-21 18:39:02,909 INFO [train.py:1028] (0/2) Epoch 24, batch 4400, loss[loss=0.1776, simple_loss=0.2269, pruned_loss=0.06416, over 13251.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2305, pruned_loss=0.06197, over 2585312.79 frames. ], batch size: 83, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:39:24,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=434727.3333333333, ans=0.0 2024-06-21 18:39:38,445 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2024-06-21 18:39:42,759 INFO [train.py:1028] (0/2) Epoch 24, batch 4450, loss[loss=0.1805, simple_loss=0.2374, pruned_loss=0.06174, over 12995.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2308, pruned_loss=0.06228, over 2580489.91 frames. ], batch size: 33, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:39:46,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=434764.0, ans=0.0 2024-06-21 18:39:47,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=434764.0, ans=0.07 2024-06-21 18:39:58,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=434800.6666666667, ans=0.2 2024-06-21 18:40:00,006 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.054e+02 2.175e+02 2.325e+02 3.144e+02, threshold=4.351e+02, percent-clipped=0.0 2024-06-21 18:40:02,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2024-06-21 18:40:14,920 INFO [train.py:1028] (0/2) Epoch 24, batch 4500, loss[loss=0.1828, simple_loss=0.2363, pruned_loss=0.06467, over 13273.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2301, pruned_loss=0.06179, over 2584708.76 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:40:15,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=434855.6666666667, ans=0.125 2024-06-21 18:40:16,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=434855.6666666667, ans=0.0 2024-06-21 18:40:22,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=434874.0, ans=0.0 2024-06-21 18:40:31,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434892.3333333333, ans=0.1 2024-06-21 18:40:34,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.24 vs. limit=10.0 2024-06-21 18:40:43,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=434929.0, ans=0.1 2024-06-21 18:40:46,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=434929.0, ans=0.125 2024-06-21 18:40:48,046 INFO [train.py:1028] (0/2) Epoch 24, batch 4550, loss[loss=0.1652, simple_loss=0.2232, pruned_loss=0.0536, over 13239.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2303, pruned_loss=0.06183, over 2587985.07 frames. ], batch size: 52, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:40:52,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434947.3333333333, ans=0.1 2024-06-21 18:40:59,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=434965.6666666667, ans=0.0 2024-06-21 18:41:01,320 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:41:05,843 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.803e+02 2.057e+02 2.156e+02 2.362e+02 3.371e+02, threshold=4.313e+02, percent-clipped=0.0 2024-06-21 18:41:08,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=435002.3333333333, ans=0.125 2024-06-21 18:41:11,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=435002.3333333333, ans=0.125 2024-06-21 18:41:11,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435002.3333333333, ans=0.0 2024-06-21 18:41:16,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=435020.6666666667, ans=0.125 2024-06-21 18:41:17,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.13 vs. limit=10.0 2024-06-21 18:41:21,344 INFO [train.py:1028] (0/2) Epoch 24, batch 4600, loss[loss=0.1862, simple_loss=0.2377, pruned_loss=0.06736, over 12547.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2305, pruned_loss=0.06174, over 2582529.48 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:41:41,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=435075.6666666667, ans=0.125 2024-06-21 18:41:50,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=435094.0, ans=15.0 2024-06-21 18:41:52,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=435094.0, ans=0.125 2024-06-21 18:41:59,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435112.3333333333, ans=0.1 2024-06-21 18:42:00,482 INFO [train.py:1028] (0/2) Epoch 24, batch 4650, loss[loss=0.1763, simple_loss=0.221, pruned_loss=0.06582, over 13100.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2302, pruned_loss=0.06188, over 2585979.10 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:42:05,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=435130.6666666667, ans=0.125 2024-06-21 18:42:12,280 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2024-06-21 18:42:18,512 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.077e+02 2.210e+02 2.490e+02 3.095e+02, threshold=4.419e+02, percent-clipped=0.0 2024-06-21 18:42:28,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=435204.0, ans=0.125 2024-06-21 18:42:34,293 INFO [train.py:1028] (0/2) Epoch 24, batch 4700, loss[loss=0.1659, simple_loss=0.2223, pruned_loss=0.05475, over 12443.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2302, pruned_loss=0.06219, over 2581292.70 frames. ], batch size: 25, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:42:46,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=435240.6666666667, ans=0.125 2024-06-21 18:42:57,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2024-06-21 18:43:02,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=435295.6666666667, ans=0.2 2024-06-21 18:43:03,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435295.6666666667, ans=0.0 2024-06-21 18:43:04,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435295.6666666667, ans=0.1 2024-06-21 18:43:07,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=435314.0, ans=0.95 2024-06-21 18:43:07,912 INFO [train.py:1028] (0/2) Epoch 24, batch 4750, loss[loss=0.1751, simple_loss=0.2251, pruned_loss=0.06257, over 12556.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2299, pruned_loss=0.06215, over 2577850.86 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:43:20,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=435350.6666666667, ans=0.1 2024-06-21 18:43:23,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435350.6666666667, ans=0.1 2024-06-21 18:43:25,304 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.807e+02 2.149e+02 2.272e+02 2.512e+02 3.513e+02, threshold=4.544e+02, percent-clipped=0.0 2024-06-21 18:43:25,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=435350.6666666667, ans=0.125 2024-06-21 18:43:30,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=435369.0, ans=0.0 2024-06-21 18:43:31,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=435369.0, ans=0.0 2024-06-21 18:43:39,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=435387.3333333333, ans=0.125 2024-06-21 18:43:44,254 INFO [train.py:1028] (0/2) Epoch 24, batch 4800, loss[loss=0.1686, simple_loss=0.2284, pruned_loss=0.05443, over 13278.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2295, pruned_loss=0.06208, over 2574834.37 frames. ], batch size: 63, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:43:44,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=435405.6666666667, ans=0.125 2024-06-21 18:43:57,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435424.0, ans=0.1 2024-06-21 18:43:59,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=435424.0, ans=0.125 2024-06-21 18:44:17,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=15.0 2024-06-21 18:44:17,776 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2024-06-21 18:44:20,657 INFO [train.py:1028] (0/2) Epoch 24, batch 4850, loss[loss=0.1923, simple_loss=0.2408, pruned_loss=0.0719, over 13268.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2288, pruned_loss=0.0616, over 2573252.69 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:44:21,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435497.3333333333, ans=0.1 2024-06-21 18:44:21,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=435497.3333333333, ans=0.125 2024-06-21 18:44:38,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.043e+02 2.175e+02 2.367e+02 3.258e+02, threshold=4.350e+02, percent-clipped=0.0 2024-06-21 18:44:39,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=435534.0, ans=0.1 2024-06-21 18:44:46,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435552.3333333333, ans=0.1 2024-06-21 18:44:54,757 INFO [train.py:1028] (0/2) Epoch 24, batch 4900, loss[loss=0.1706, simple_loss=0.2263, pruned_loss=0.05744, over 13214.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2293, pruned_loss=0.06186, over 2574759.21 frames. ], batch size: 59, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:45:02,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2024-06-21 18:45:10,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=435625.6666666667, ans=0.125 2024-06-21 18:45:11,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=435625.6666666667, ans=0.025 2024-06-21 18:45:21,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=435662.3333333333, ans=0.1 2024-06-21 18:45:31,457 INFO [train.py:1028] (0/2) Epoch 24, batch 4950, loss[loss=0.1833, simple_loss=0.2261, pruned_loss=0.07022, over 11249.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2297, pruned_loss=0.06237, over 2569366.79 frames. ], batch size: 304, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:45:47,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=435717.3333333333, ans=0.2 2024-06-21 18:45:51,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=435717.3333333333, ans=0.125 2024-06-21 18:45:51,919 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.109e+02 2.238e+02 2.396e+02 2.990e+02, threshold=4.476e+02, percent-clipped=0.0 2024-06-21 18:45:53,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2024-06-21 18:46:03,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=435754.0, ans=0.04949747468305833 2024-06-21 18:46:03,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=435754.0, ans=0.0 2024-06-21 18:46:06,972 INFO [train.py:1028] (0/2) Epoch 24, batch 5000, loss[loss=0.1827, simple_loss=0.2357, pruned_loss=0.06488, over 13200.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2295, pruned_loss=0.06207, over 2573395.03 frames. ], batch size: 95, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:46:07,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=435772.3333333333, ans=0.09899494936611666 2024-06-21 18:46:07,330 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2024-06-21 18:46:09,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=8.0 2024-06-21 18:46:17,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=435790.6666666667, ans=0.0 2024-06-21 18:46:21,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=435809.0, ans=0.125 2024-06-21 18:46:31,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=435827.3333333333, ans=0.125 2024-06-21 18:46:33,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=435845.6666666667, ans=0.125 2024-06-21 18:46:34,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.60 vs. limit=22.5 2024-06-21 18:46:37,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=435845.6666666667, ans=0.0 2024-06-21 18:46:38,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=435845.6666666667, ans=0.125 2024-06-21 18:46:39,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435845.6666666667, ans=0.1 2024-06-21 18:46:40,182 INFO [train.py:1028] (0/2) Epoch 24, batch 5050, loss[loss=0.1638, simple_loss=0.225, pruned_loss=0.05137, over 12930.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2299, pruned_loss=0.06223, over 2571889.96 frames. ], batch size: 36, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:46:43,563 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:46:47,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435882.3333333333, ans=0.0 2024-06-21 18:46:57,532 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.127e+02 2.254e+02 2.556e+02 3.101e+02, threshold=4.507e+02, percent-clipped=0.0 2024-06-21 18:47:00,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=435919.0, ans=0.125 2024-06-21 18:47:03,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=12.0 2024-06-21 18:47:04,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=435919.0, ans=0.95 2024-06-21 18:47:07,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2024-06-21 18:47:12,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=435955.6666666667, ans=0.125 2024-06-21 18:47:12,806 INFO [train.py:1028] (0/2) Epoch 24, batch 5100, loss[loss=0.1798, simple_loss=0.2355, pruned_loss=0.06203, over 12830.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2297, pruned_loss=0.0623, over 2567783.20 frames. ], batch size: 39, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:47:33,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=435992.3333333333, ans=0.2 2024-06-21 18:47:44,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=436010.6666666667, ans=0.025 2024-06-21 18:47:50,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=436029.0, ans=0.125 2024-06-21 18:47:53,310 INFO [train.py:1028] (0/2) Epoch 24, batch 5150, loss[loss=0.1695, simple_loss=0.2219, pruned_loss=0.05853, over 13132.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2297, pruned_loss=0.06264, over 2570107.85 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:48:01,368 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:48:03,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=436065.6666666667, ans=0.125 2024-06-21 18:48:07,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=436084.0, ans=0.125 2024-06-21 18:48:11,130 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.126e+02 2.304e+02 2.475e+02 3.617e+02, threshold=4.607e+02, percent-clipped=0.0 2024-06-21 18:48:17,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=436102.3333333333, ans=10.0 2024-06-21 18:48:21,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.85 vs. limit=6.0 2024-06-21 18:48:26,721 INFO [train.py:1028] (0/2) Epoch 24, batch 5200, loss[loss=0.1857, simple_loss=0.2339, pruned_loss=0.06877, over 13157.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2299, pruned_loss=0.06265, over 2573471.48 frames. ], batch size: 95, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:48:28,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=436139.0, ans=0.2 2024-06-21 18:48:51,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=436194.0, ans=0.0 2024-06-21 18:48:58,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=436212.3333333333, ans=0.0 2024-06-21 18:49:00,073 INFO [train.py:1028] (0/2) Epoch 24, batch 5250, loss[loss=0.1757, simple_loss=0.2266, pruned_loss=0.06243, over 13211.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2298, pruned_loss=0.06259, over 2571490.37 frames. ], batch size: 52, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:49:04,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=436230.6666666667, ans=0.0 2024-06-21 18:49:11,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2024-06-21 18:49:13,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2024-06-21 18:49:17,996 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.067e+02 2.139e+02 2.296e+02 2.879e+02, threshold=4.278e+02, percent-clipped=0.0 2024-06-21 18:49:18,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=436267.3333333333, ans=0.04949747468305833 2024-06-21 18:49:36,740 INFO [train.py:1028] (0/2) Epoch 24, batch 5300, loss[loss=0.1836, simple_loss=0.2386, pruned_loss=0.06429, over 13060.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2297, pruned_loss=0.06225, over 2568552.15 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:49:36,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=436322.3333333333, ans=0.0 2024-06-21 18:49:38,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436322.3333333333, ans=0.1 2024-06-21 18:49:48,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=436340.6666666667, ans=0.125 2024-06-21 18:50:02,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=436377.3333333333, ans=0.5 2024-06-21 18:50:04,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.60 vs. limit=10.0 2024-06-21 18:50:08,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=436395.6666666667, ans=0.125 2024-06-21 18:50:10,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=436395.6666666667, ans=0.2 2024-06-21 18:50:15,224 INFO [train.py:1028] (0/2) Epoch 24, batch 5350, loss[loss=0.1807, simple_loss=0.2488, pruned_loss=0.05626, over 11285.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2301, pruned_loss=0.06238, over 2573898.62 frames. ], batch size: 16, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:50:18,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=436414.0, ans=0.0 2024-06-21 18:50:22,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.79 vs. limit=15.0 2024-06-21 18:50:32,900 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.071e+02 2.179e+02 2.326e+02 2.934e+02, threshold=4.357e+02, percent-clipped=0.0 2024-06-21 18:50:38,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436469.0, ans=0.1 2024-06-21 18:50:38,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=436469.0, ans=0.0 2024-06-21 18:50:45,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=436487.3333333333, ans=0.0 2024-06-21 18:50:47,622 INFO [train.py:1028] (0/2) Epoch 24, batch 5400, loss[loss=0.1974, simple_loss=0.238, pruned_loss=0.07838, over 12184.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2303, pruned_loss=0.06287, over 2566337.75 frames. ], batch size: 240, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:50:54,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=436524.0, ans=0.0 2024-06-21 18:50:55,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=436524.0, ans=0.125 2024-06-21 18:51:02,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436542.3333333333, ans=0.1 2024-06-21 18:51:23,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=436579.0, ans=0.0 2024-06-21 18:51:25,426 INFO [train.py:1028] (0/2) Epoch 24, batch 5450, loss[loss=0.184, simple_loss=0.2425, pruned_loss=0.06268, over 12336.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2301, pruned_loss=0.06255, over 2570211.99 frames. ], batch size: 25, lr: 2.41e-03, grad_scale: 64.0 2024-06-21 18:51:37,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=436615.6666666667, ans=0.125 2024-06-21 18:51:45,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436634.0, ans=0.125 2024-06-21 18:51:46,571 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.073e+02 2.187e+02 2.390e+02 3.210e+02, threshold=4.375e+02, percent-clipped=0.0 2024-06-21 18:51:55,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=436670.6666666667, ans=0.0 2024-06-21 18:51:58,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=436670.6666666667, ans=0.0 2024-06-21 18:52:00,176 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.88 vs. limit=22.5 2024-06-21 18:52:01,742 INFO [train.py:1028] (0/2) Epoch 24, batch 5500, loss[loss=0.2178, simple_loss=0.2503, pruned_loss=0.09266, over 12169.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2296, pruned_loss=0.06212, over 2562858.46 frames. ], batch size: 240, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:52:08,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=436707.3333333333, ans=0.125 2024-06-21 18:52:19,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.26 vs. limit=10.0 2024-06-21 18:52:35,163 INFO [train.py:1028] (0/2) Epoch 24, batch 5550, loss[loss=0.1735, simple_loss=0.2326, pruned_loss=0.0572, over 13283.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2292, pruned_loss=0.0619, over 2567409.92 frames. ], batch size: 43, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:52:35,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436780.6666666667, ans=0.1 2024-06-21 18:52:36,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=436780.6666666667, ans=0.125 2024-06-21 18:52:44,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=436799.0, ans=0.0 2024-06-21 18:52:45,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436799.0, ans=0.1 2024-06-21 18:52:51,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=436817.3333333333, ans=0.2 2024-06-21 18:52:53,316 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.068e+02 2.190e+02 2.424e+02 3.307e+02, threshold=4.379e+02, percent-clipped=0.0 2024-06-21 18:53:01,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=436854.0, ans=0.125 2024-06-21 18:53:07,634 INFO [train.py:1028] (0/2) Epoch 24, batch 5600, loss[loss=0.1592, simple_loss=0.2121, pruned_loss=0.05313, over 13177.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2286, pruned_loss=0.06182, over 2570497.98 frames. ], batch size: 89, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:53:15,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=436890.6666666667, ans=0.125 2024-06-21 18:53:20,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=436909.0, ans=0.125 2024-06-21 18:53:49,589 INFO [train.py:1028] (0/2) Epoch 24, batch 5650, loss[loss=0.1797, simple_loss=0.2295, pruned_loss=0.06497, over 12561.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2288, pruned_loss=0.06147, over 2576628.70 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:53:50,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436964.0, ans=0.1 2024-06-21 18:53:54,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=436964.0, ans=0.125 2024-06-21 18:53:58,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=436982.3333333333, ans=10.0 2024-06-21 18:54:07,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=437000.6666666667, ans=0.125 2024-06-21 18:54:08,346 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.103e+02 2.231e+02 2.388e+02 2.995e+02, threshold=4.462e+02, percent-clipped=0.0 2024-06-21 18:54:17,558 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:54:18,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2024-06-21 18:54:22,668 INFO [train.py:1028] (0/2) Epoch 24, batch 5700, loss[loss=0.1622, simple_loss=0.2214, pruned_loss=0.0515, over 13256.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.229, pruned_loss=0.06171, over 2580245.85 frames. ], batch size: 63, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:54:24,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=437055.6666666667, ans=0.95 2024-06-21 18:54:27,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=437055.6666666667, ans=0.125 2024-06-21 18:54:37,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=437092.3333333333, ans=0.05 2024-06-21 18:54:40,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=437092.3333333333, ans=0.2 2024-06-21 18:54:45,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=437110.6666666667, ans=0.0 2024-06-21 18:54:46,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=437110.6666666667, ans=22.5 2024-06-21 18:54:49,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=437129.0, ans=0.125 2024-06-21 18:54:49,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=437129.0, ans=0.09899494936611666 2024-06-21 18:54:54,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=437129.0, ans=0.125 2024-06-21 18:54:55,711 INFO [train.py:1028] (0/2) Epoch 24, batch 5750, loss[loss=0.2066, simple_loss=0.2566, pruned_loss=0.07831, over 12741.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2298, pruned_loss=0.06207, over 2580918.85 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:54:59,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.69 vs. limit=22.5 2024-06-21 18:55:02,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=12.0 2024-06-21 18:55:05,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437165.6666666667, ans=0.1 2024-06-21 18:55:13,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=437184.0, ans=0.95 2024-06-21 18:55:14,456 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.072e+02 2.176e+02 2.316e+02 2.997e+02, threshold=4.353e+02, percent-clipped=0.0 2024-06-21 18:55:26,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=437220.6666666667, ans=0.125 2024-06-21 18:55:32,061 INFO [train.py:1028] (0/2) Epoch 24, batch 5800, loss[loss=0.1785, simple_loss=0.2283, pruned_loss=0.06439, over 12705.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2313, pruned_loss=0.06312, over 2579712.64 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:55:33,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=437239.0, ans=0.0 2024-06-21 18:55:37,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.31 vs. limit=22.5 2024-06-21 18:55:40,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=437239.0, ans=0.125 2024-06-21 18:55:41,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=437257.3333333333, ans=0.125 2024-06-21 18:55:41,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.80 vs. limit=15.0 2024-06-21 18:55:42,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=437257.3333333333, ans=0.0 2024-06-21 18:55:43,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.15 vs. limit=15.0 2024-06-21 18:55:47,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=437257.3333333333, ans=0.04949747468305833 2024-06-21 18:56:00,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=437294.0, ans=0.0 2024-06-21 18:56:01,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=437312.3333333333, ans=0.125 2024-06-21 18:56:08,262 INFO [train.py:1028] (0/2) Epoch 24, batch 5850, loss[loss=0.2141, simple_loss=0.259, pruned_loss=0.08462, over 12553.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.2335, pruned_loss=0.06398, over 2578569.03 frames. ], batch size: 202, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:56:11,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2024-06-21 18:56:15,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=437349.0, ans=0.2 2024-06-21 18:56:17,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.74 vs. limit=15.0 2024-06-21 18:56:17,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.80 vs. limit=15.0 2024-06-21 18:56:20,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=437349.0, ans=0.125 2024-06-21 18:56:26,887 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.176e+02 2.343e+02 2.576e+02 3.487e+02, threshold=4.686e+02, percent-clipped=0.0 2024-06-21 18:56:27,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=437367.3333333333, ans=15.0 2024-06-21 18:56:34,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=437404.0, ans=0.5 2024-06-21 18:56:35,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=437404.0, ans=0.0 2024-06-21 18:56:36,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=437404.0, ans=0.125 2024-06-21 18:56:39,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=437404.0, ans=0.0 2024-06-21 18:56:41,105 INFO [train.py:1028] (0/2) Epoch 24, batch 5900, loss[loss=0.1708, simple_loss=0.2205, pruned_loss=0.06057, over 13059.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2351, pruned_loss=0.0643, over 2577401.92 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:56:41,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437422.3333333333, ans=0.1 2024-06-21 18:56:47,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=437440.6666666667, ans=0.125 2024-06-21 18:56:48,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2024-06-21 18:56:51,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.86 vs. limit=15.0 2024-06-21 18:56:52,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2024-06-21 18:56:55,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=437459.0, ans=0.125 2024-06-21 18:56:56,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=437459.0, ans=0.125 2024-06-21 18:56:58,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=437459.0, ans=0.125 2024-06-21 18:57:03,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=437477.3333333333, ans=0.125 2024-06-21 18:57:14,151 INFO [train.py:1028] (0/2) Epoch 24, batch 5950, loss[loss=0.1703, simple_loss=0.2112, pruned_loss=0.06474, over 13095.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2357, pruned_loss=0.0645, over 2582194.84 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:57:18,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=437514.0, ans=0.125 2024-06-21 18:57:18,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=437514.0, ans=0.125 2024-06-21 18:57:38,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=437550.6666666667, ans=0.5 2024-06-21 18:57:39,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.181e+02 2.407e+02 2.594e+02 3.795e+02, threshold=4.814e+02, percent-clipped=0.0 2024-06-21 18:57:53,721 INFO [train.py:1028] (0/2) Epoch 24, batch 6000, loss[loss=0.2444, simple_loss=0.2868, pruned_loss=0.1011, over 12280.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2375, pruned_loss=0.06529, over 2576890.09 frames. ], batch size: 240, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:57:53,722 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 18:58:01,877 INFO [train.py:1060] (0/2) Epoch 24, validation: loss=0.1893, simple_loss=0.2514, pruned_loss=0.06354, over 351949.00 frames. 2024-06-21 18:58:01,878 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 18:58:19,981 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:58:21,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=437660.6666666667, ans=0.025 2024-06-21 18:58:29,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437679.0, ans=0.1 2024-06-21 18:58:34,917 INFO [train.py:1028] (0/2) Epoch 24, batch 6050, loss[loss=0.1778, simple_loss=0.2331, pruned_loss=0.06126, over 12900.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2388, pruned_loss=0.06581, over 2579472.10 frames. ], batch size: 39, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:58:41,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=437715.6666666667, ans=0.0 2024-06-21 18:58:43,409 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 18:58:47,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=437715.6666666667, ans=0.125 2024-06-21 18:58:51,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.73 vs. limit=10.0 2024-06-21 18:58:53,841 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.170e+02 2.300e+02 2.500e+02 4.242e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-21 18:58:56,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=437752.3333333333, ans=0.0 2024-06-21 18:58:58,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=437752.3333333333, ans=0.0 2024-06-21 18:59:08,374 INFO [train.py:1028] (0/2) Epoch 24, batch 6100, loss[loss=0.167, simple_loss=0.2174, pruned_loss=0.05829, over 13117.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2397, pruned_loss=0.06599, over 2581407.29 frames. ], batch size: 121, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:59:09,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.77 vs. limit=22.5 2024-06-21 18:59:17,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=437807.3333333333, ans=0.125 2024-06-21 18:59:24,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=437825.6666666667, ans=0.0 2024-06-21 18:59:32,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=437844.0, ans=0.125 2024-06-21 18:59:48,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=437880.6666666667, ans=0.125 2024-06-21 18:59:48,704 INFO [train.py:1028] (0/2) Epoch 24, batch 6150, loss[loss=0.1861, simple_loss=0.236, pruned_loss=0.06812, over 10966.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2405, pruned_loss=0.06626, over 2579528.26 frames. ], batch size: 304, lr: 2.41e-03, grad_scale: 32.0 2024-06-21 18:59:49,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=437880.6666666667, ans=0.0 2024-06-21 19:00:05,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=437917.3333333333, ans=0.0 2024-06-21 19:00:07,532 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.211e+02 2.349e+02 2.648e+02 4.121e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 19:00:16,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=437954.0, ans=0.0 2024-06-21 19:00:19,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=437954.0, ans=0.0 2024-06-21 19:00:23,087 INFO [train.py:1028] (0/2) Epoch 24, batch 6200, loss[loss=0.2279, simple_loss=0.2855, pruned_loss=0.08509, over 13268.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.242, pruned_loss=0.06677, over 2575875.72 frames. ], batch size: 89, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:00:31,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=437990.6666666667, ans=0.125 2024-06-21 19:00:53,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.39 vs. limit=10.0 2024-06-21 19:00:55,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=438045.6666666667, ans=0.1 2024-06-21 19:00:59,500 INFO [train.py:1028] (0/2) Epoch 24, batch 6250, loss[loss=0.204, simple_loss=0.2541, pruned_loss=0.07691, over 13233.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2435, pruned_loss=0.06748, over 2568730.75 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:01:18,533 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.200e+02 2.375e+02 2.548e+02 4.082e+02, threshold=4.750e+02, percent-clipped=0.0 2024-06-21 19:01:21,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=438119.0, ans=0.0 2024-06-21 19:01:36,519 INFO [train.py:1028] (0/2) Epoch 24, batch 6300, loss[loss=0.1926, simple_loss=0.2502, pruned_loss=0.0675, over 11084.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2451, pruned_loss=0.06789, over 2564161.58 frames. ], batch size: 16, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:01:59,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.62 vs. limit=22.5 2024-06-21 19:02:05,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=438210.6666666667, ans=0.2 2024-06-21 19:02:14,271 INFO [train.py:1028] (0/2) Epoch 24, batch 6350, loss[loss=0.2217, simple_loss=0.2757, pruned_loss=0.08382, over 12523.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2463, pruned_loss=0.06788, over 2573480.94 frames. ], batch size: 202, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:02:24,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.20 vs. limit=15.0 2024-06-21 19:02:27,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=438284.0, ans=0.0 2024-06-21 19:02:32,208 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.174e+02 2.374e+02 2.694e+02 3.876e+02, threshold=4.748e+02, percent-clipped=0.0 2024-06-21 19:02:32,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=15.0 2024-06-21 19:02:41,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=438320.6666666667, ans=0.2 2024-06-21 19:02:41,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=438320.6666666667, ans=0.0 2024-06-21 19:02:44,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2024-06-21 19:02:47,053 INFO [train.py:1028] (0/2) Epoch 24, batch 6400, loss[loss=0.1674, simple_loss=0.2257, pruned_loss=0.05457, over 13170.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.248, pruned_loss=0.06859, over 2574553.76 frames. ], batch size: 67, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:02:47,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=438339.0, ans=10.0 2024-06-21 19:02:56,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=438357.3333333333, ans=0.025 2024-06-21 19:02:57,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=438357.3333333333, ans=0.09899494936611666 2024-06-21 19:02:58,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=438357.3333333333, ans=0.0 2024-06-21 19:02:59,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=438375.6666666667, ans=0.2 2024-06-21 19:03:13,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=438412.3333333333, ans=0.125 2024-06-21 19:03:19,873 INFO [train.py:1028] (0/2) Epoch 24, batch 6450, loss[loss=0.2349, simple_loss=0.285, pruned_loss=0.09237, over 12626.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2492, pruned_loss=0.06878, over 2580877.93 frames. ], batch size: 202, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:03:32,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=438467.3333333333, ans=0.125 2024-06-21 19:03:41,251 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.263e+02 2.564e+02 2.841e+02 4.670e+02, threshold=5.128e+02, percent-clipped=0.0 2024-06-21 19:03:42,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=438485.6666666667, ans=0.2 2024-06-21 19:03:43,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2024-06-21 19:03:45,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=438485.6666666667, ans=0.025 2024-06-21 19:03:55,789 INFO [train.py:1028] (0/2) Epoch 24, batch 6500, loss[loss=0.2146, simple_loss=0.2609, pruned_loss=0.08415, over 10896.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2504, pruned_loss=0.06888, over 2584477.50 frames. ], batch size: 305, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:04:09,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=438540.6666666667, ans=0.0 2024-06-21 19:04:11,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=438540.6666666667, ans=0.1 2024-06-21 19:04:15,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.32 vs. limit=15.0 2024-06-21 19:04:23,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438577.3333333333, ans=0.125 2024-06-21 19:04:32,367 INFO [train.py:1028] (0/2) Epoch 24, batch 6550, loss[loss=0.1896, simple_loss=0.2541, pruned_loss=0.06252, over 12683.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2514, pruned_loss=0.06905, over 2588610.62 frames. ], batch size: 22, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:04:32,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=438614.0, ans=10.0 2024-06-21 19:04:32,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438614.0, ans=0.125 2024-06-21 19:04:40,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=438632.3333333333, ans=0.0 2024-06-21 19:04:45,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=438650.6666666667, ans=0.5 2024-06-21 19:04:48,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=438650.6666666667, ans=0.125 2024-06-21 19:04:50,801 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.902e+02 2.206e+02 2.383e+02 2.546e+02 3.191e+02, threshold=4.766e+02, percent-clipped=0.0 2024-06-21 19:04:52,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=438669.0, ans=0.0 2024-06-21 19:05:04,995 INFO [train.py:1028] (0/2) Epoch 24, batch 6600, loss[loss=0.1815, simple_loss=0.2404, pruned_loss=0.06134, over 13246.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2516, pruned_loss=0.06898, over 2591086.71 frames. ], batch size: 72, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:05:10,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438724.0, ans=0.1 2024-06-21 19:05:12,998 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.99 vs. limit=15.0 2024-06-21 19:05:22,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438742.3333333333, ans=0.1 2024-06-21 19:05:35,318 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:05:37,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=438797.3333333333, ans=0.2 2024-06-21 19:05:38,369 INFO [train.py:1028] (0/2) Epoch 24, batch 6650, loss[loss=0.2146, simple_loss=0.2627, pruned_loss=0.08326, over 12988.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2531, pruned_loss=0.06955, over 2585535.10 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:05:38,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=438797.3333333333, ans=0.125 2024-06-21 19:05:50,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=438815.6666666667, ans=0.125 2024-06-21 19:05:56,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=438834.0, ans=0.1 2024-06-21 19:06:04,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.238e+02 2.444e+02 2.679e+02 4.277e+02, threshold=4.887e+02, percent-clipped=0.0 2024-06-21 19:06:05,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=438852.3333333333, ans=0.025 2024-06-21 19:06:08,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438852.3333333333, ans=0.1 2024-06-21 19:06:09,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=438852.3333333333, ans=0.02 2024-06-21 19:06:09,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=438852.3333333333, ans=0.2 2024-06-21 19:06:10,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2024-06-21 19:06:11,784 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0 2024-06-21 19:06:18,435 INFO [train.py:1028] (0/2) Epoch 24, batch 6700, loss[loss=0.2306, simple_loss=0.28, pruned_loss=0.09067, over 12719.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2544, pruned_loss=0.07023, over 2585135.84 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:06:19,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.87 vs. limit=15.0 2024-06-21 19:06:40,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=438944.0, ans=0.125 2024-06-21 19:06:44,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=438962.3333333333, ans=0.125 2024-06-21 19:06:51,786 INFO [train.py:1028] (0/2) Epoch 24, batch 6750, loss[loss=0.2423, simple_loss=0.2914, pruned_loss=0.09662, over 12191.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2556, pruned_loss=0.07104, over 2578455.30 frames. ], batch size: 241, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:06:58,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=438999.0, ans=0.0 2024-06-21 19:07:04,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439017.3333333333, ans=0.1 2024-06-21 19:07:06,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439017.3333333333, ans=0.1 2024-06-21 19:07:09,573 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.275e+02 2.404e+02 2.555e+02 3.190e+02, threshold=4.807e+02, percent-clipped=0.0 2024-06-21 19:07:11,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439035.6666666667, ans=0.125 2024-06-21 19:07:19,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439054.0, ans=0.1 2024-06-21 19:07:24,087 INFO [train.py:1028] (0/2) Epoch 24, batch 6800, loss[loss=0.2135, simple_loss=0.2745, pruned_loss=0.07625, over 13210.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2563, pruned_loss=0.0713, over 2580372.62 frames. ], batch size: 67, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:07:25,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=439072.3333333333, ans=0.025 2024-06-21 19:07:25,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.07 vs. limit=10.0 2024-06-21 19:07:48,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2024-06-21 19:07:51,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=439127.3333333333, ans=0.125 2024-06-21 19:08:00,960 INFO [train.py:1028] (0/2) Epoch 24, batch 6850, loss[loss=0.2174, simple_loss=0.2848, pruned_loss=0.07503, over 13226.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2571, pruned_loss=0.07129, over 2584047.16 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:08:05,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=439164.0, ans=0.0 2024-06-21 19:08:22,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.329e+02 2.497e+02 2.818e+02 3.585e+02, threshold=4.994e+02, percent-clipped=0.0 2024-06-21 19:08:37,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=439255.6666666667, ans=0.125 2024-06-21 19:08:37,661 INFO [train.py:1028] (0/2) Epoch 24, batch 6900, loss[loss=0.2185, simple_loss=0.2758, pruned_loss=0.08056, over 13275.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2576, pruned_loss=0.07135, over 2585975.07 frames. ], batch size: 49, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:08:47,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=439274.0, ans=0.0 2024-06-21 19:08:54,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.02 vs. limit=15.0 2024-06-21 19:08:59,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=439310.6666666667, ans=0.0 2024-06-21 19:09:01,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.53 vs. limit=22.5 2024-06-21 19:09:05,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=439329.0, ans=0.2 2024-06-21 19:09:06,102 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2024-06-21 19:09:11,175 INFO [train.py:1028] (0/2) Epoch 24, batch 6950, loss[loss=0.163, simple_loss=0.2167, pruned_loss=0.05463, over 11411.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2587, pruned_loss=0.07178, over 2581240.43 frames. ], batch size: 16, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:09:14,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=439347.3333333333, ans=0.0 2024-06-21 19:09:24,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2024-06-21 19:09:27,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=439384.0, ans=0.2 2024-06-21 19:09:29,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=439384.0, ans=0.1 2024-06-21 19:09:29,562 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.233e+02 2.405e+02 2.651e+02 3.460e+02, threshold=4.809e+02, percent-clipped=0.0 2024-06-21 19:09:39,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=439420.6666666667, ans=0.0 2024-06-21 19:09:40,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=439420.6666666667, ans=0.0 2024-06-21 19:09:42,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439420.6666666667, ans=0.1 2024-06-21 19:09:44,202 INFO [train.py:1028] (0/2) Epoch 24, batch 7000, loss[loss=0.2183, simple_loss=0.2745, pruned_loss=0.08104, over 12954.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2585, pruned_loss=0.07137, over 2576426.67 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:09:44,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=439439.0, ans=0.2 2024-06-21 19:09:48,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=439439.0, ans=0.1 2024-06-21 19:10:00,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=439475.6666666667, ans=0.125 2024-06-21 19:10:23,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=439512.3333333333, ans=0.0 2024-06-21 19:10:24,342 INFO [train.py:1028] (0/2) Epoch 24, batch 7050, loss[loss=0.2292, simple_loss=0.277, pruned_loss=0.09068, over 12769.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2597, pruned_loss=0.07157, over 2583047.72 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:10:32,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=439549.0, ans=0.125 2024-06-21 19:10:34,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2024-06-21 19:10:42,074 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.248e+02 2.420e+02 2.628e+02 3.867e+02, threshold=4.841e+02, percent-clipped=0.0 2024-06-21 19:10:42,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=439585.6666666667, ans=0.125 2024-06-21 19:10:48,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=439585.6666666667, ans=0.125 2024-06-21 19:10:56,695 INFO [train.py:1028] (0/2) Epoch 24, batch 7100, loss[loss=0.229, simple_loss=0.2896, pruned_loss=0.08422, over 13129.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2602, pruned_loss=0.07177, over 2576440.66 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:10:59,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=439622.3333333333, ans=0.0 2024-06-21 19:11:07,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=439640.6666666667, ans=0.0 2024-06-21 19:11:14,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=439659.0, ans=0.125 2024-06-21 19:11:22,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=439677.3333333333, ans=0.125 2024-06-21 19:11:23,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=439695.6666666667, ans=0.0 2024-06-21 19:11:24,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=439695.6666666667, ans=0.2 2024-06-21 19:11:29,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=439695.6666666667, ans=0.125 2024-06-21 19:11:30,979 INFO [train.py:1028] (0/2) Epoch 24, batch 7150, loss[loss=0.2212, simple_loss=0.2797, pruned_loss=0.0814, over 12564.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2612, pruned_loss=0.07179, over 2574751.93 frames. ], batch size: 202, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:11:40,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=439732.3333333333, ans=0.0 2024-06-21 19:11:40,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=439732.3333333333, ans=0.025 2024-06-21 19:11:41,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=439732.3333333333, ans=0.0 2024-06-21 19:11:49,311 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.961e+02 2.248e+02 2.454e+02 2.638e+02 3.580e+02, threshold=4.908e+02, percent-clipped=0.0 2024-06-21 19:11:49,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=439750.6666666667, ans=0.0 2024-06-21 19:11:58,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=15.0 2024-06-21 19:11:59,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=439769.0, ans=0.0 2024-06-21 19:12:00,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=439769.0, ans=0.05 2024-06-21 19:12:04,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=439787.3333333333, ans=0.125 2024-06-21 19:12:06,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=439787.3333333333, ans=0.125 2024-06-21 19:12:06,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439787.3333333333, ans=0.125 2024-06-21 19:12:07,990 INFO [train.py:1028] (0/2) Epoch 24, batch 7200, loss[loss=0.2262, simple_loss=0.2898, pruned_loss=0.08134, over 13179.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2621, pruned_loss=0.07218, over 2579364.85 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:12:12,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.16 vs. limit=15.0 2024-06-21 19:12:15,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=439824.0, ans=0.125 2024-06-21 19:12:24,069 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2024-06-21 19:12:28,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=15.0 2024-06-21 19:12:32,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=439860.6666666667, ans=0.125 2024-06-21 19:12:35,960 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:12:36,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=439879.0, ans=0.1 2024-06-21 19:12:44,288 INFO [train.py:1028] (0/2) Epoch 24, batch 7250, loss[loss=0.1836, simple_loss=0.2486, pruned_loss=0.05927, over 12960.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2623, pruned_loss=0.07225, over 2579697.31 frames. ], batch size: 36, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:12:48,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=439897.3333333333, ans=0.0 2024-06-21 19:12:51,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=439915.6666666667, ans=0.2 2024-06-21 19:13:02,716 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.253e+02 2.371e+02 2.585e+02 4.023e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-21 19:13:02,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=439934.0, ans=0.125 2024-06-21 19:13:06,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=439952.3333333333, ans=0.125 2024-06-21 19:13:16,883 INFO [train.py:1028] (0/2) Epoch 24, batch 7300, loss[loss=0.2142, simple_loss=0.2739, pruned_loss=0.07723, over 12965.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2635, pruned_loss=0.0726, over 2580252.86 frames. ], batch size: 36, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:13:20,281 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-240000.pt 2024-06-21 19:13:27,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=439989.0, ans=0.0 2024-06-21 19:13:27,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=439989.0, ans=0.0 2024-06-21 19:13:35,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=440025.6666666667, ans=0.125 2024-06-21 19:13:36,051 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:13:52,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=440062.3333333333, ans=0.2 2024-06-21 19:13:55,787 INFO [train.py:1028] (0/2) Epoch 24, batch 7350, loss[loss=0.1944, simple_loss=0.2616, pruned_loss=0.06364, over 13323.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.264, pruned_loss=0.07285, over 2581319.45 frames. ], batch size: 46, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:14:08,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440099.0, ans=0.1 2024-06-21 19:14:17,625 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.269e+02 2.422e+02 2.630e+02 3.965e+02, threshold=4.845e+02, percent-clipped=0.0 2024-06-21 19:14:19,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=440135.6666666667, ans=0.0 2024-06-21 19:14:24,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=440135.6666666667, ans=0.2 2024-06-21 19:14:27,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=440154.0, ans=0.1 2024-06-21 19:14:28,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=440154.0, ans=0.125 2024-06-21 19:14:33,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=440154.0, ans=0.125 2024-06-21 19:14:35,640 INFO [train.py:1028] (0/2) Epoch 24, batch 7400, loss[loss=0.2291, simple_loss=0.2958, pruned_loss=0.08119, over 13304.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2638, pruned_loss=0.07264, over 2585840.94 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:14:43,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=440190.6666666667, ans=0.125 2024-06-21 19:14:44,222 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2024-06-21 19:14:44,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=440190.6666666667, ans=0.0 2024-06-21 19:14:47,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440190.6666666667, ans=0.1 2024-06-21 19:14:47,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=440190.6666666667, ans=0.125 2024-06-21 19:14:48,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=440209.0, ans=0.125 2024-06-21 19:15:02,462 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:15:03,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=440245.6666666667, ans=0.2 2024-06-21 19:15:07,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=440245.6666666667, ans=0.2 2024-06-21 19:15:09,021 INFO [train.py:1028] (0/2) Epoch 24, batch 7450, loss[loss=0.1931, simple_loss=0.2526, pruned_loss=0.06682, over 12549.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2634, pruned_loss=0.07237, over 2579776.94 frames. ], batch size: 29, lr: 2.40e-03, grad_scale: 32.0 2024-06-21 19:15:09,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=440264.0, ans=0.125 2024-06-21 19:15:26,124 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:15:27,881 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.236e+02 2.388e+02 2.547e+02 3.014e+02, threshold=4.777e+02, percent-clipped=0.0 2024-06-21 19:15:35,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=440319.0, ans=0.025 2024-06-21 19:15:41,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440337.3333333333, ans=0.1 2024-06-21 19:15:42,799 INFO [train.py:1028] (0/2) Epoch 24, batch 7500, loss[loss=0.2125, simple_loss=0.2543, pruned_loss=0.08537, over 10632.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2649, pruned_loss=0.07314, over 2577244.85 frames. ], batch size: 303, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:16:07,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=440410.6666666667, ans=0.125 2024-06-21 19:16:19,257 INFO [train.py:1028] (0/2) Epoch 24, batch 7550, loss[loss=0.2237, simple_loss=0.2765, pruned_loss=0.08547, over 12938.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2662, pruned_loss=0.07409, over 2577377.24 frames. ], batch size: 158, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:16:26,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=440465.6666666667, ans=0.125 2024-06-21 19:16:26,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=440465.6666666667, ans=0.0 2024-06-21 19:16:33,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440484.0, ans=0.0 2024-06-21 19:16:40,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=440484.0, ans=0.125 2024-06-21 19:16:41,402 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.386e+02 2.562e+02 2.861e+02 3.776e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-21 19:16:41,939 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2024-06-21 19:16:46,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=440502.3333333333, ans=0.09899494936611666 2024-06-21 19:16:53,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440520.6666666667, ans=0.0 2024-06-21 19:16:54,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=440520.6666666667, ans=0.125 2024-06-21 19:16:55,787 INFO [train.py:1028] (0/2) Epoch 24, batch 7600, loss[loss=0.206, simple_loss=0.2624, pruned_loss=0.07478, over 13222.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2673, pruned_loss=0.07446, over 2577955.36 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:16:59,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=440539.0, ans=0.125 2024-06-21 19:17:04,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=440557.3333333333, ans=0.0 2024-06-21 19:17:07,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=440557.3333333333, ans=0.125 2024-06-21 19:17:10,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=440575.6666666667, ans=0.125 2024-06-21 19:17:20,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=15.0 2024-06-21 19:17:23,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=440612.3333333333, ans=0.0 2024-06-21 19:17:27,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=440612.3333333333, ans=0.07 2024-06-21 19:17:29,689 INFO [train.py:1028] (0/2) Epoch 24, batch 7650, loss[loss=0.2299, simple_loss=0.2866, pruned_loss=0.08656, over 13013.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2676, pruned_loss=0.07454, over 2572855.35 frames. ], batch size: 33, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:17:44,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=440667.3333333333, ans=0.025 2024-06-21 19:17:46,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=440667.3333333333, ans=0.125 2024-06-21 19:17:48,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.282e+02 2.496e+02 2.783e+02 3.716e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-21 19:17:59,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=440704.0, ans=0.025 2024-06-21 19:18:06,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=440722.3333333333, ans=10.0 2024-06-21 19:18:06,835 INFO [train.py:1028] (0/2) Epoch 24, batch 7700, loss[loss=0.2137, simple_loss=0.2847, pruned_loss=0.07128, over 13221.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2676, pruned_loss=0.07458, over 2569554.86 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:18:07,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=440722.3333333333, ans=0.5 2024-06-21 19:18:08,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=440722.3333333333, ans=0.0 2024-06-21 19:18:11,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=440722.3333333333, ans=0.2 2024-06-21 19:18:22,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=440759.0, ans=0.07 2024-06-21 19:18:23,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=440759.0, ans=0.2 2024-06-21 19:18:24,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=440759.0, ans=0.125 2024-06-21 19:18:34,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=440777.3333333333, ans=0.05 2024-06-21 19:18:41,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2024-06-21 19:18:45,240 INFO [train.py:1028] (0/2) Epoch 24, batch 7750, loss[loss=0.1921, simple_loss=0.252, pruned_loss=0.06616, over 13247.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2681, pruned_loss=0.07492, over 2573371.74 frames. ], batch size: 72, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:18:58,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=440850.6666666667, ans=0.2 2024-06-21 19:19:03,972 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.321e+02 2.505e+02 2.717e+02 3.883e+02, threshold=5.009e+02, percent-clipped=0.0 2024-06-21 19:19:18,795 INFO [train.py:1028] (0/2) Epoch 24, batch 7800, loss[loss=0.2088, simple_loss=0.2616, pruned_loss=0.07806, over 13169.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2688, pruned_loss=0.07504, over 2578551.90 frames. ], batch size: 95, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:19:19,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=440905.6666666667, ans=0.125 2024-06-21 19:19:23,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=440905.6666666667, ans=0.2 2024-06-21 19:19:25,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=440924.0, ans=0.125 2024-06-21 19:19:29,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=440924.0, ans=0.125 2024-06-21 19:19:43,568 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:19:48,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=440979.0, ans=0.09899494936611666 2024-06-21 19:19:49,750 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=440979.0, ans=0.1 2024-06-21 19:19:52,752 INFO [train.py:1028] (0/2) Epoch 24, batch 7850, loss[loss=0.2298, simple_loss=0.2984, pruned_loss=0.08057, over 10941.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.27, pruned_loss=0.07567, over 2572332.94 frames. ], batch size: 16, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:19:55,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=440997.3333333333, ans=0.2 2024-06-21 19:19:56,255 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.925e+01 2024-06-21 19:19:57,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440997.3333333333, ans=0.125 2024-06-21 19:20:13,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=441034.0, ans=0.2 2024-06-21 19:20:14,035 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.311e+02 2.472e+02 2.673e+02 3.323e+02, threshold=4.944e+02, percent-clipped=0.0 2024-06-21 19:20:15,015 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.03 vs. limit=15.0 2024-06-21 19:20:15,386 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:20:32,127 INFO [train.py:1028] (0/2) Epoch 24, batch 7900, loss[loss=0.1974, simple_loss=0.2572, pruned_loss=0.06883, over 13196.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2702, pruned_loss=0.07583, over 2572636.48 frames. ], batch size: 77, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:20:34,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441089.0, ans=0.125 2024-06-21 19:20:42,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=441107.3333333333, ans=0.025 2024-06-21 19:20:45,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=441125.6666666667, ans=0.125 2024-06-21 19:20:52,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=441144.0, ans=0.0 2024-06-21 19:20:55,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=441144.0, ans=0.2 2024-06-21 19:20:56,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=441144.0, ans=0.025 2024-06-21 19:20:56,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=441144.0, ans=0.125 2024-06-21 19:20:58,266 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=15.0 2024-06-21 19:21:02,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=441162.3333333333, ans=0.125 2024-06-21 19:21:05,963 INFO [train.py:1028] (0/2) Epoch 24, batch 7950, loss[loss=0.2091, simple_loss=0.2601, pruned_loss=0.07903, over 10582.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2706, pruned_loss=0.07609, over 2575803.97 frames. ], batch size: 303, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:21:24,544 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.362e+02 2.522e+02 2.856e+02 3.555e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-21 19:21:36,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441254.0, ans=0.1 2024-06-21 19:21:38,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=441272.3333333333, ans=0.125 2024-06-21 19:21:39,133 INFO [train.py:1028] (0/2) Epoch 24, batch 8000, loss[loss=0.1986, simple_loss=0.2695, pruned_loss=0.06391, over 12641.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2713, pruned_loss=0.07605, over 2572600.16 frames. ], batch size: 29, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:21:44,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=441272.3333333333, ans=0.025 2024-06-21 19:21:50,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=441290.6666666667, ans=0.125 2024-06-21 19:21:54,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=441309.0, ans=0.0 2024-06-21 19:21:55,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=441309.0, ans=0.0 2024-06-21 19:21:56,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=441309.0, ans=0.04949747468305833 2024-06-21 19:22:00,066 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:22:00,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=441309.0, ans=0.0 2024-06-21 19:22:04,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441327.3333333333, ans=0.125 2024-06-21 19:22:04,984 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.70 vs. limit=15.0 2024-06-21 19:22:07,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=441345.6666666667, ans=0.125 2024-06-21 19:22:09,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441345.6666666667, ans=0.1 2024-06-21 19:22:15,015 INFO [train.py:1028] (0/2) Epoch 24, batch 8050, loss[loss=0.2014, simple_loss=0.2639, pruned_loss=0.0694, over 13211.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2708, pruned_loss=0.0757, over 2571196.38 frames. ], batch size: 83, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:22:19,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.63 vs. limit=12.0 2024-06-21 19:22:31,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=441400.6666666667, ans=0.125 2024-06-21 19:22:32,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.86 vs. limit=22.5 2024-06-21 19:22:33,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=441400.6666666667, ans=0.125 2024-06-21 19:22:36,846 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.270e+02 2.388e+02 2.672e+02 3.699e+02, threshold=4.775e+02, percent-clipped=0.0 2024-06-21 19:22:38,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=441419.0, ans=22.5 2024-06-21 19:22:50,884 INFO [train.py:1028] (0/2) Epoch 24, batch 8100, loss[loss=0.2356, simple_loss=0.2881, pruned_loss=0.09158, over 13114.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2715, pruned_loss=0.0761, over 2576134.63 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:22:53,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=441455.6666666667, ans=0.125 2024-06-21 19:22:55,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2024-06-21 19:23:04,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=441492.3333333333, ans=0.0 2024-06-21 19:23:14,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=441510.6666666667, ans=0.0 2024-06-21 19:23:16,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=441510.6666666667, ans=0.95 2024-06-21 19:23:24,521 INFO [train.py:1028] (0/2) Epoch 24, batch 8150, loss[loss=0.2011, simple_loss=0.2614, pruned_loss=0.07042, over 13147.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2712, pruned_loss=0.07543, over 2579774.49 frames. ], batch size: 121, lr: 2.40e-03, grad_scale: 64.0 2024-06-21 19:23:32,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=441565.6666666667, ans=0.05 2024-06-21 19:23:33,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.90 vs. limit=15.0 2024-06-21 19:23:35,596 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.88 vs. limit=22.5 2024-06-21 19:23:43,110 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.292e+02 2.415e+02 2.546e+02 3.613e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-21 19:23:48,730 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2024-06-21 19:23:57,709 INFO [train.py:1028] (0/2) Epoch 24, batch 8200, loss[loss=0.2075, simple_loss=0.2667, pruned_loss=0.07418, over 13140.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2714, pruned_loss=0.07524, over 2582924.08 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:23:59,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=441639.0, ans=0.0 2024-06-21 19:24:00,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=441639.0, ans=0.1 2024-06-21 19:24:14,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=441675.6666666667, ans=0.125 2024-06-21 19:24:30,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=441712.3333333333, ans=0.125 2024-06-21 19:24:37,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=441730.6666666667, ans=0.0 2024-06-21 19:24:38,395 INFO [train.py:1028] (0/2) Epoch 24, batch 8250, loss[loss=0.2069, simple_loss=0.2729, pruned_loss=0.0705, over 13249.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2714, pruned_loss=0.07498, over 2582581.48 frames. ], batch size: 52, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:24:41,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=441730.6666666667, ans=0.125 2024-06-21 19:24:43,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2024-06-21 19:24:55,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=441767.3333333333, ans=0.125 2024-06-21 19:24:56,425 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.287e+02 2.417e+02 2.601e+02 3.005e+02, threshold=4.834e+02, percent-clipped=0.0 2024-06-21 19:25:01,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=441785.6666666667, ans=0.0 2024-06-21 19:25:02,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=441785.6666666667, ans=0.1 2024-06-21 19:25:10,783 INFO [train.py:1028] (0/2) Epoch 24, batch 8300, loss[loss=0.2218, simple_loss=0.2743, pruned_loss=0.08465, over 13146.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2706, pruned_loss=0.07458, over 2579965.73 frames. ], batch size: 103, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:25:20,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=441840.6666666667, ans=0.0 2024-06-21 19:25:43,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=441895.6666666667, ans=0.05 2024-06-21 19:25:44,435 INFO [train.py:1028] (0/2) Epoch 24, batch 8350, loss[loss=0.2181, simple_loss=0.2747, pruned_loss=0.08071, over 13178.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2701, pruned_loss=0.07432, over 2580670.05 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:25:44,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=441914.0, ans=0.125 2024-06-21 19:25:48,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=441914.0, ans=0.2 2024-06-21 19:25:53,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441932.3333333333, ans=0.1 2024-06-21 19:26:06,395 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.337e+02 2.510e+02 2.765e+02 3.654e+02, threshold=5.020e+02, percent-clipped=0.0 2024-06-21 19:26:09,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=441969.0, ans=0.0 2024-06-21 19:26:10,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=441969.0, ans=0.125 2024-06-21 19:26:14,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=441987.3333333333, ans=0.125 2024-06-21 19:26:21,055 INFO [train.py:1028] (0/2) Epoch 24, batch 8400, loss[loss=0.22, simple_loss=0.2786, pruned_loss=0.08073, over 12950.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2703, pruned_loss=0.07458, over 2577487.50 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:26:29,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2024-06-21 19:26:38,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=15.0 2024-06-21 19:26:51,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=442079.0, ans=0.0 2024-06-21 19:26:55,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442079.0, ans=0.1 2024-06-21 19:26:56,814 INFO [train.py:1028] (0/2) Epoch 24, batch 8450, loss[loss=0.2216, simple_loss=0.2731, pruned_loss=0.08504, over 13138.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2718, pruned_loss=0.07546, over 2578402.14 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:26:57,879 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2024-06-21 19:27:03,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=442115.6666666667, ans=0.125 2024-06-21 19:27:07,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=442115.6666666667, ans=0.125 2024-06-21 19:27:11,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=442134.0, ans=0.125 2024-06-21 19:27:14,707 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.350e+02 2.526e+02 2.716e+02 3.213e+02, threshold=5.052e+02, percent-clipped=0.0 2024-06-21 19:27:16,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=442152.3333333333, ans=0.125 2024-06-21 19:27:17,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2024-06-21 19:27:23,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=442170.6666666667, ans=0.0 2024-06-21 19:27:29,612 INFO [train.py:1028] (0/2) Epoch 24, batch 8500, loss[loss=0.211, simple_loss=0.2708, pruned_loss=0.07558, over 12684.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2729, pruned_loss=0.07589, over 2577809.77 frames. ], batch size: 29, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:27:32,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442189.0, ans=0.125 2024-06-21 19:27:33,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=442189.0, ans=0.125 2024-06-21 19:27:37,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.01 vs. limit=15.0 2024-06-21 19:27:37,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=442207.3333333333, ans=0.125 2024-06-21 19:27:39,003 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2024-06-21 19:27:39,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442207.3333333333, ans=0.1 2024-06-21 19:27:42,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442207.3333333333, ans=0.1 2024-06-21 19:27:46,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.38 vs. limit=22.5 2024-06-21 19:27:52,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=442244.0, ans=0.125 2024-06-21 19:27:53,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=442244.0, ans=0.025 2024-06-21 19:28:01,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=442262.3333333333, ans=0.125 2024-06-21 19:28:03,156 INFO [train.py:1028] (0/2) Epoch 24, batch 8550, loss[loss=0.2132, simple_loss=0.2722, pruned_loss=0.07714, over 12609.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2725, pruned_loss=0.07573, over 2575698.33 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:28:18,752 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:28:21,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442317.3333333333, ans=0.125 2024-06-21 19:28:24,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=442317.3333333333, ans=0.125 2024-06-21 19:28:25,090 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.373e+02 2.480e+02 2.671e+02 3.416e+02, threshold=4.961e+02, percent-clipped=0.0 2024-06-21 19:28:27,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=442335.6666666667, ans=0.0 2024-06-21 19:28:28,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=442335.6666666667, ans=0.125 2024-06-21 19:28:28,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=442335.6666666667, ans=0.02 2024-06-21 19:28:37,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=442354.0, ans=0.125 2024-06-21 19:28:43,571 INFO [train.py:1028] (0/2) Epoch 24, batch 8600, loss[loss=0.2089, simple_loss=0.2725, pruned_loss=0.07263, over 13132.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2733, pruned_loss=0.0759, over 2572701.98 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:28:47,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2024-06-21 19:28:50,228 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:28:54,994 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:28:56,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=442390.6666666667, ans=0.125 2024-06-21 19:28:57,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=442409.0, ans=0.0 2024-06-21 19:29:02,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=442409.0, ans=0.125 2024-06-21 19:29:05,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=442427.3333333333, ans=0.0 2024-06-21 19:29:06,088 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-21 19:29:12,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=442445.6666666667, ans=0.0 2024-06-21 19:29:16,611 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.76 vs. limit=15.0 2024-06-21 19:29:17,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=442464.0, ans=0.125 2024-06-21 19:29:17,665 INFO [train.py:1028] (0/2) Epoch 24, batch 8650, loss[loss=0.2098, simple_loss=0.2724, pruned_loss=0.07357, over 13201.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2733, pruned_loss=0.07554, over 2576356.78 frames. ], batch size: 103, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:29:17,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=442464.0, ans=0.2 2024-06-21 19:29:21,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=442464.0, ans=0.2 2024-06-21 19:29:22,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442464.0, ans=0.0 2024-06-21 19:29:24,625 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2024-06-21 19:29:24,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=442482.3333333333, ans=0.125 2024-06-21 19:29:26,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=442482.3333333333, ans=0.125 2024-06-21 19:29:31,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=442500.6666666667, ans=0.0 2024-06-21 19:29:33,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=442500.6666666667, ans=0.125 2024-06-21 19:29:36,080 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.348e+02 2.523e+02 2.682e+02 3.493e+02, threshold=5.046e+02, percent-clipped=0.0 2024-06-21 19:29:37,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=442519.0, ans=0.0 2024-06-21 19:29:38,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=442519.0, ans=0.125 2024-06-21 19:29:39,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=442519.0, ans=0.0 2024-06-21 19:29:41,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=442519.0, ans=0.0 2024-06-21 19:29:44,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=442537.3333333333, ans=0.0 2024-06-21 19:29:50,431 INFO [train.py:1028] (0/2) Epoch 24, batch 8700, loss[loss=0.2403, simple_loss=0.31, pruned_loss=0.0853, over 13162.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2736, pruned_loss=0.0759, over 2572460.34 frames. ], batch size: 59, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:29:51,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442555.6666666667, ans=0.125 2024-06-21 19:29:55,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=442555.6666666667, ans=0.125 2024-06-21 19:30:02,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=442574.0, ans=0.125 2024-06-21 19:30:09,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=442592.3333333333, ans=0.0 2024-06-21 19:30:17,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2024-06-21 19:30:17,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=442610.6666666667, ans=0.07 2024-06-21 19:30:27,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442629.0, ans=0.1 2024-06-21 19:30:28,364 INFO [train.py:1028] (0/2) Epoch 24, batch 8750, loss[loss=0.2132, simple_loss=0.2655, pruned_loss=0.08044, over 13110.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2734, pruned_loss=0.07621, over 2568804.52 frames. ], batch size: 121, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:30:28,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.94 vs. limit=10.0 2024-06-21 19:30:29,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=442647.3333333333, ans=0.125 2024-06-21 19:30:39,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=15.0 2024-06-21 19:30:46,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=442684.0, ans=0.0 2024-06-21 19:30:51,107 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.348e+02 2.544e+02 2.717e+02 3.609e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-21 19:30:59,505 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.70 vs. limit=15.0 2024-06-21 19:31:01,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442720.6666666667, ans=0.1 2024-06-21 19:31:05,537 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=442720.6666666667, ans=0.125 2024-06-21 19:31:06,657 INFO [train.py:1028] (0/2) Epoch 24, batch 8800, loss[loss=0.2149, simple_loss=0.2833, pruned_loss=0.07321, over 13247.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2735, pruned_loss=0.07592, over 2573781.93 frames. ], batch size: 72, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:31:06,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442739.0, ans=0.125 2024-06-21 19:31:12,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=442739.0, ans=0.125 2024-06-21 19:31:18,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=442757.3333333333, ans=0.0 2024-06-21 19:31:22,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442775.6666666667, ans=0.1 2024-06-21 19:31:30,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=442794.0, ans=0.0 2024-06-21 19:31:31,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442794.0, ans=0.125 2024-06-21 19:31:40,887 INFO [train.py:1028] (0/2) Epoch 24, batch 8850, loss[loss=0.2463, simple_loss=0.2952, pruned_loss=0.09872, over 12502.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2738, pruned_loss=0.07678, over 2562702.88 frames. ], batch size: 202, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:31:41,453 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2024-06-21 19:31:42,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442830.6666666667, ans=0.0 2024-06-21 19:31:49,019 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2024-06-21 19:31:49,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=442849.0, ans=0.1 2024-06-21 19:31:54,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=442867.3333333333, ans=0.0 2024-06-21 19:32:03,223 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.320e+02 2.490e+02 2.681e+02 3.797e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-21 19:32:03,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=442867.3333333333, ans=0.0 2024-06-21 19:32:15,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.64 vs. limit=22.5 2024-06-21 19:32:17,995 INFO [train.py:1028] (0/2) Epoch 24, batch 8900, loss[loss=0.2161, simple_loss=0.2795, pruned_loss=0.07635, over 12920.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2744, pruned_loss=0.07715, over 2560975.72 frames. ], batch size: 33, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:32:25,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=442940.6666666667, ans=0.0 2024-06-21 19:32:40,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=442959.0, ans=0.035 2024-06-21 19:32:47,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=442995.6666666667, ans=0.2 2024-06-21 19:32:55,400 INFO [train.py:1028] (0/2) Epoch 24, batch 8950, loss[loss=0.22, simple_loss=0.2779, pruned_loss=0.08105, over 12577.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2746, pruned_loss=0.07666, over 2563361.20 frames. ], batch size: 202, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:32:55,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=443014.0, ans=0.07 2024-06-21 19:33:06,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=443032.3333333333, ans=0.125 2024-06-21 19:33:09,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=443050.6666666667, ans=0.125 2024-06-21 19:33:14,112 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.407e+02 2.540e+02 2.749e+02 4.410e+02, threshold=5.081e+02, percent-clipped=0.0 2024-06-21 19:33:29,188 INFO [train.py:1028] (0/2) Epoch 24, batch 9000, loss[loss=0.1861, simple_loss=0.2483, pruned_loss=0.06194, over 13316.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2745, pruned_loss=0.07631, over 2568202.28 frames. ], batch size: 46, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:33:29,189 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 19:33:37,067 INFO [train.py:1060] (0/2) Epoch 24, validation: loss=0.1894, simple_loss=0.2515, pruned_loss=0.06369, over 351949.00 frames. 2024-06-21 19:33:37,067 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 19:33:38,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=443105.6666666667, ans=0.5 2024-06-21 19:33:47,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=443124.0, ans=0.0 2024-06-21 19:34:06,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443179.0, ans=0.125 2024-06-21 19:34:08,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443197.3333333333, ans=0.1 2024-06-21 19:34:09,205 INFO [train.py:1028] (0/2) Epoch 24, batch 9050, loss[loss=0.2184, simple_loss=0.2763, pruned_loss=0.08025, over 11589.00 frames. ], tot_loss[loss=0.214, simple_loss=0.275, pruned_loss=0.07648, over 2568650.33 frames. ], batch size: 17, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:34:20,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=443215.6666666667, ans=0.125 2024-06-21 19:34:23,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=443234.0, ans=0.05 2024-06-21 19:34:26,974 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.374e+02 2.536e+02 2.733e+02 3.653e+02, threshold=5.072e+02, percent-clipped=0.0 2024-06-21 19:34:35,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=443270.6666666667, ans=0.0 2024-06-21 19:34:43,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.44 vs. limit=22.5 2024-06-21 19:34:44,612 INFO [train.py:1028] (0/2) Epoch 24, batch 9100, loss[loss=0.239, simple_loss=0.3056, pruned_loss=0.08623, over 13052.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2748, pruned_loss=0.07632, over 2568836.91 frames. ], batch size: 71, lr: 2.39e-03, grad_scale: 64.0 2024-06-21 19:34:48,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=443289.0, ans=0.0 2024-06-21 19:34:48,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=443289.0, ans=0.0 2024-06-21 19:34:52,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=443307.3333333333, ans=0.025 2024-06-21 19:35:00,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443325.6666666667, ans=0.125 2024-06-21 19:35:02,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=443325.6666666667, ans=0.125 2024-06-21 19:35:03,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.33 vs. limit=15.0 2024-06-21 19:35:05,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=443344.0, ans=0.0 2024-06-21 19:35:07,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443344.0, ans=0.125 2024-06-21 19:35:12,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=443362.3333333333, ans=0.125 2024-06-21 19:35:16,383 INFO [train.py:1028] (0/2) Epoch 24, batch 9150, loss[loss=0.2075, simple_loss=0.2748, pruned_loss=0.07008, over 13172.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2746, pruned_loss=0.07647, over 2568991.56 frames. ], batch size: 77, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:35:17,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=443380.6666666667, ans=0.0 2024-06-21 19:35:26,186 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:35:29,507 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:35:35,045 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.280e+02 2.401e+02 2.548e+02 3.542e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-21 19:35:41,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.34 vs. limit=15.0 2024-06-21 19:35:41,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.84 vs. limit=10.0 2024-06-21 19:35:51,267 INFO [train.py:1028] (0/2) Epoch 24, batch 9200, loss[loss=0.2442, simple_loss=0.3093, pruned_loss=0.08956, over 12963.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2742, pruned_loss=0.07572, over 2570918.94 frames. ], batch size: 36, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:36:09,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=443509.0, ans=0.2 2024-06-21 19:36:16,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.80 vs. limit=6.0 2024-06-21 19:36:17,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=443545.6666666667, ans=0.125 2024-06-21 19:36:19,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443545.6666666667, ans=0.125 2024-06-21 19:36:22,655 INFO [train.py:1028] (0/2) Epoch 24, batch 9250, loss[loss=0.1895, simple_loss=0.2558, pruned_loss=0.0616, over 13267.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2742, pruned_loss=0.07571, over 2573098.55 frames. ], batch size: 67, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:36:38,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=443600.6666666667, ans=0.125 2024-06-21 19:36:38,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=443600.6666666667, ans=0.125 2024-06-21 19:36:41,026 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.833e+02 2.307e+02 2.462e+02 2.633e+02 3.050e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-21 19:36:54,465 INFO [train.py:1028] (0/2) Epoch 24, batch 9300, loss[loss=0.202, simple_loss=0.2691, pruned_loss=0.06749, over 12985.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2738, pruned_loss=0.07547, over 2570659.74 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:36:56,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=443655.6666666667, ans=0.125 2024-06-21 19:36:59,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=443655.6666666667, ans=0.05 2024-06-21 19:37:04,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=443674.0, ans=0.025 2024-06-21 19:37:06,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2024-06-21 19:37:10,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=443692.3333333333, ans=0.125 2024-06-21 19:37:11,215 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=443692.3333333333, ans=0.2 2024-06-21 19:37:21,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=443729.0, ans=0.125 2024-06-21 19:37:21,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=443729.0, ans=0.125 2024-06-21 19:37:22,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=443729.0, ans=0.125 2024-06-21 19:37:24,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=443729.0, ans=0.0 2024-06-21 19:37:26,194 INFO [train.py:1028] (0/2) Epoch 24, batch 9350, loss[loss=0.2018, simple_loss=0.262, pruned_loss=0.07076, over 12583.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2736, pruned_loss=0.07536, over 2568073.48 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:37:44,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.333e+02 2.515e+02 2.682e+02 3.328e+02, threshold=5.029e+02, percent-clipped=0.0 2024-06-21 19:37:52,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=443820.6666666667, ans=0.05 2024-06-21 19:37:56,646 INFO [train.py:1028] (0/2) Epoch 24, batch 9400, loss[loss=0.2222, simple_loss=0.2896, pruned_loss=0.07738, over 13203.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.274, pruned_loss=0.07564, over 2566816.94 frames. ], batch size: 52, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:37:58,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=443839.0, ans=0.2 2024-06-21 19:38:01,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=443839.0, ans=0.125 2024-06-21 19:38:06,367 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.88 vs. limit=22.5 2024-06-21 19:38:06,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=443857.3333333333, ans=0.0 2024-06-21 19:38:06,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=443857.3333333333, ans=0.125 2024-06-21 19:38:09,859 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.55 vs. limit=15.0 2024-06-21 19:38:10,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=443875.6666666667, ans=0.125 2024-06-21 19:38:17,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443894.0, ans=0.1 2024-06-21 19:38:22,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=443894.0, ans=0.125 2024-06-21 19:38:23,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.46 vs. limit=22.5 2024-06-21 19:38:28,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=443912.3333333333, ans=0.125 2024-06-21 19:38:29,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=22.5 2024-06-21 19:38:29,753 INFO [train.py:1028] (0/2) Epoch 24, batch 9450, loss[loss=0.2148, simple_loss=0.2754, pruned_loss=0.07706, over 12867.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2746, pruned_loss=0.07591, over 2568049.86 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:38:37,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=443949.0, ans=0.125 2024-06-21 19:38:42,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=443967.3333333333, ans=0.125 2024-06-21 19:38:45,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443967.3333333333, ans=0.1 2024-06-21 19:38:47,476 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.355e+02 2.491e+02 2.682e+02 3.504e+02, threshold=4.982e+02, percent-clipped=0.0 2024-06-21 19:39:02,561 INFO [train.py:1028] (0/2) Epoch 24, batch 9500, loss[loss=0.1858, simple_loss=0.2553, pruned_loss=0.05821, over 13272.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2746, pruned_loss=0.07576, over 2577774.12 frames. ], batch size: 43, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:39:10,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=444040.6666666667, ans=0.0 2024-06-21 19:39:22,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=444077.3333333333, ans=0.025 2024-06-21 19:39:25,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=444077.3333333333, ans=0.125 2024-06-21 19:39:25,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2024-06-21 19:39:28,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=444095.6666666667, ans=0.0 2024-06-21 19:39:33,301 INFO [train.py:1028] (0/2) Epoch 24, batch 9550, loss[loss=0.2, simple_loss=0.261, pruned_loss=0.06945, over 12879.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2746, pruned_loss=0.07595, over 2573049.55 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:39:34,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=444114.0, ans=0.2 2024-06-21 19:39:39,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=444132.3333333333, ans=0.2 2024-06-21 19:39:44,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=444132.3333333333, ans=0.125 2024-06-21 19:39:51,593 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.316e+02 2.414e+02 2.591e+02 3.190e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-21 19:40:00,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=444187.3333333333, ans=0.07 2024-06-21 19:40:04,492 INFO [train.py:1028] (0/2) Epoch 24, batch 9600, loss[loss=0.211, simple_loss=0.2668, pruned_loss=0.07763, over 10482.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2744, pruned_loss=0.07593, over 2570310.61 frames. ], batch size: 303, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:40:11,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.19 vs. limit=15.0 2024-06-21 19:40:18,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=444242.3333333333, ans=0.07 2024-06-21 19:40:23,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2024-06-21 19:40:28,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=444279.0, ans=0.125 2024-06-21 19:40:33,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=444279.0, ans=0.125 2024-06-21 19:40:35,419 INFO [train.py:1028] (0/2) Epoch 24, batch 9650, loss[loss=0.22, simple_loss=0.2706, pruned_loss=0.08465, over 13092.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2742, pruned_loss=0.07619, over 2560643.79 frames. ], batch size: 132, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:40:36,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=444297.3333333333, ans=0.2 2024-06-21 19:40:40,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=444297.3333333333, ans=0.0 2024-06-21 19:40:48,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=444315.6666666667, ans=0.125 2024-06-21 19:40:55,000 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2024-06-21 19:40:55,247 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.348e+02 2.570e+02 2.816e+02 3.829e+02, threshold=5.139e+02, percent-clipped=0.0 2024-06-21 19:40:59,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=444352.3333333333, ans=0.0 2024-06-21 19:41:08,328 INFO [train.py:1028] (0/2) Epoch 24, batch 9700, loss[loss=0.2089, simple_loss=0.2625, pruned_loss=0.0777, over 12973.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2732, pruned_loss=0.07584, over 2556309.65 frames. ], batch size: 144, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:41:08,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=444389.0, ans=0.125 2024-06-21 19:41:29,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=444444.0, ans=0.0 2024-06-21 19:41:35,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444462.3333333333, ans=0.1 2024-06-21 19:41:36,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=444462.3333333333, ans=0.0 2024-06-21 19:41:41,432 INFO [train.py:1028] (0/2) Epoch 24, batch 9750, loss[loss=0.202, simple_loss=0.2648, pruned_loss=0.06963, over 13092.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2721, pruned_loss=0.07518, over 2552383.02 frames. ], batch size: 132, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:41:57,468 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:41:57,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=444517.3333333333, ans=0.125 2024-06-21 19:41:59,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.741e+02 2.253e+02 2.395e+02 2.635e+02 3.379e+02, threshold=4.790e+02, percent-clipped=0.0 2024-06-21 19:42:03,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=444535.6666666667, ans=0.125 2024-06-21 19:42:05,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=444554.0, ans=0.025 2024-06-21 19:42:12,636 INFO [train.py:1028] (0/2) Epoch 24, batch 9800, loss[loss=0.2052, simple_loss=0.2687, pruned_loss=0.07088, over 12979.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2713, pruned_loss=0.07459, over 2545087.34 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:42:13,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=444572.3333333333, ans=0.95 2024-06-21 19:42:16,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=444572.3333333333, ans=0.2 2024-06-21 19:42:38,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=444645.6666666667, ans=0.95 2024-06-21 19:42:43,910 INFO [train.py:1028] (0/2) Epoch 24, batch 9850, loss[loss=0.1991, simple_loss=0.2582, pruned_loss=0.06998, over 13069.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.271, pruned_loss=0.07445, over 2538092.69 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:42:44,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.97 vs. limit=22.5 2024-06-21 19:42:49,852 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.36 vs. limit=15.0 2024-06-21 19:43:02,983 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.325e+02 2.452e+02 2.627e+02 3.314e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-21 19:43:08,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=444737.3333333333, ans=0.025 2024-06-21 19:43:12,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=444737.3333333333, ans=0.125 2024-06-21 19:43:15,389 INFO [train.py:1028] (0/2) Epoch 24, batch 9900, loss[loss=0.2006, simple_loss=0.2682, pruned_loss=0.06647, over 12944.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2708, pruned_loss=0.07467, over 2531208.05 frames. ], batch size: 39, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:43:22,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=444774.0, ans=0.0 2024-06-21 19:43:22,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=444774.0, ans=0.125 2024-06-21 19:43:41,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=444829.0, ans=0.025 2024-06-21 19:43:47,500 INFO [train.py:1028] (0/2) Epoch 24, batch 9950, loss[loss=0.2214, simple_loss=0.2789, pruned_loss=0.08194, over 12982.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2704, pruned_loss=0.07505, over 2527060.40 frames. ], batch size: 30, lr: 2.39e-03, grad_scale: 16.0 2024-06-21 19:44:06,209 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.304e+02 2.432e+02 2.633e+02 3.200e+02, threshold=4.863e+02, percent-clipped=0.0 2024-06-21 19:44:10,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=444902.3333333333, ans=0.0 2024-06-21 19:44:19,364 INFO [train.py:1028] (0/2) Epoch 24, batch 10000, loss[loss=0.2, simple_loss=0.2718, pruned_loss=0.06409, over 12466.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2705, pruned_loss=0.07525, over 2488745.48 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:44:20,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=444939.0, ans=0.2 2024-06-21 19:44:20,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.57 vs. limit=15.0 2024-06-21 19:44:25,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=444957.3333333333, ans=0.0 2024-06-21 19:44:39,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444994.0, ans=0.1 2024-06-21 19:44:42,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=444994.0, ans=0.0 2024-06-21 19:44:42,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=444994.0, ans=0.0 2024-06-21 19:44:45,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=445012.3333333333, ans=0.0 2024-06-21 19:44:50,221 INFO [train.py:1028] (0/2) Epoch 24, batch 10050, loss[loss=0.1941, simple_loss=0.2618, pruned_loss=0.06315, over 12677.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2711, pruned_loss=0.076, over 2447019.84 frames. ], batch size: 22, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:44:57,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=445049.0, ans=0.0 2024-06-21 19:45:05,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445067.3333333333, ans=0.1 2024-06-21 19:45:08,050 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.338e+02 2.538e+02 2.839e+02 4.307e+02, threshold=5.076e+02, percent-clipped=0.0 2024-06-21 19:45:09,001 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.00 vs. limit=22.5 2024-06-21 19:45:13,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.58 vs. limit=22.5 2024-06-21 19:45:16,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=445104.0, ans=0.0 2024-06-21 19:45:20,747 INFO [train.py:1028] (0/2) Epoch 24, batch 10100, loss[loss=0.173, simple_loss=0.2345, pruned_loss=0.05571, over 11009.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2698, pruned_loss=0.07529, over 2426466.82 frames. ], batch size: 16, lr: 2.39e-03, grad_scale: 32.0 2024-06-21 19:45:24,132 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:45:25,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=445122.3333333333, ans=0.0 2024-06-21 19:45:26,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=445140.6666666667, ans=0.05 2024-06-21 19:45:28,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.88 vs. limit=15.0 2024-06-21 19:45:34,421 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-24.pt 2024-06-21 19:47:37,740 INFO [train.py:1028] (0/2) Epoch 25, batch 0, loss[loss=0.1671, simple_loss=0.2315, pruned_loss=0.05139, over 12923.00 frames. ], tot_loss[loss=0.1671, simple_loss=0.2315, pruned_loss=0.05139, over 12923.00 frames. ], batch size: 36, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:47:37,742 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 19:47:44,613 INFO [train.py:1060] (0/2) Epoch 25, validation: loss=0.1898, simple_loss=0.2523, pruned_loss=0.06367, over 351949.00 frames. 2024-06-21 19:47:44,613 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 19:48:07,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=445210.3333333333, ans=0.0 2024-06-21 19:48:09,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=445210.3333333333, ans=0.125 2024-06-21 19:48:15,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=445228.6666666667, ans=0.125 2024-06-21 19:48:20,335 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=15.0 2024-06-21 19:48:21,487 INFO [train.py:1028] (0/2) Epoch 25, batch 50, loss[loss=0.1899, simple_loss=0.2582, pruned_loss=0.06078, over 12973.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2546, pruned_loss=0.07071, over 574812.09 frames. ], batch size: 30, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:48:22,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.99 vs. limit=12.0 2024-06-21 19:48:29,105 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.162e+02 2.321e+02 2.491e+02 3.123e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-21 19:48:34,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=445283.6666666667, ans=0.09899494936611666 2024-06-21 19:48:42,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=445302.0, ans=0.125 2024-06-21 19:48:54,832 INFO [train.py:1028] (0/2) Epoch 25, batch 100, loss[loss=0.1936, simple_loss=0.2526, pruned_loss=0.06728, over 13272.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2536, pruned_loss=0.0701, over 1017535.54 frames. ], batch size: 46, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:49:06,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=445357.0, ans=0.125 2024-06-21 19:49:09,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=445375.3333333333, ans=0.09899494936611666 2024-06-21 19:49:18,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.34 vs. limit=15.0 2024-06-21 19:49:24,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.98 vs. limit=10.0 2024-06-21 19:49:25,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.39 vs. limit=12.0 2024-06-21 19:49:26,578 INFO [train.py:1028] (0/2) Epoch 25, batch 150, loss[loss=0.175, simple_loss=0.2395, pruned_loss=0.05525, over 12569.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2522, pruned_loss=0.06824, over 1365435.88 frames. ], batch size: 29, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:49:33,976 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.146e+02 2.262e+02 2.491e+02 3.058e+02, threshold=4.524e+02, percent-clipped=0.0 2024-06-21 19:49:34,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=445448.6666666667, ans=0.125 2024-06-21 19:49:36,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=445448.6666666667, ans=0.2 2024-06-21 19:49:37,491 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:49:38,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=445467.0, ans=0.0 2024-06-21 19:49:53,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=445485.3333333333, ans=0.125 2024-06-21 19:50:01,357 INFO [train.py:1028] (0/2) Epoch 25, batch 200, loss[loss=0.2158, simple_loss=0.2644, pruned_loss=0.08361, over 12585.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2521, pruned_loss=0.06803, over 1635182.72 frames. ], batch size: 202, lr: 2.34e-03, grad_scale: 32.0 2024-06-21 19:50:04,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.75 vs. limit=10.0 2024-06-21 19:50:18,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=445558.6666666667, ans=0.0 2024-06-21 19:50:18,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=445558.6666666667, ans=0.0 2024-06-21 19:50:24,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=445577.0, ans=0.0 2024-06-21 19:50:29,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=445595.3333333333, ans=0.125 2024-06-21 19:50:30,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=445595.3333333333, ans=0.125 2024-06-21 19:50:33,268 INFO [train.py:1028] (0/2) Epoch 25, batch 250, loss[loss=0.1791, simple_loss=0.2331, pruned_loss=0.06256, over 13053.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2519, pruned_loss=0.06788, over 1846680.11 frames. ], batch size: 144, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:50:35,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=445613.6666666667, ans=0.2 2024-06-21 19:50:38,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=445613.6666666667, ans=0.125 2024-06-21 19:50:40,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.168e+02 2.263e+02 2.467e+02 3.296e+02, threshold=4.526e+02, percent-clipped=0.0 2024-06-21 19:50:53,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=445668.6666666667, ans=0.125 2024-06-21 19:50:55,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=445668.6666666667, ans=0.0 2024-06-21 19:51:04,190 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2024-06-21 19:51:09,414 INFO [train.py:1028] (0/2) Epoch 25, batch 300, loss[loss=0.2204, simple_loss=0.2743, pruned_loss=0.08329, over 13170.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2518, pruned_loss=0.06737, over 2009980.72 frames. ], batch size: 112, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:51:11,077 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.46 vs. limit=15.0 2024-06-21 19:51:16,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=445723.6666666667, ans=0.125 2024-06-21 19:51:20,275 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:51:25,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445742.0, ans=0.1 2024-06-21 19:51:30,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=445760.3333333333, ans=0.125 2024-06-21 19:51:32,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=445760.3333333333, ans=0.125 2024-06-21 19:51:35,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=445778.6666666667, ans=0.125 2024-06-21 19:51:35,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=445778.6666666667, ans=0.025 2024-06-21 19:51:39,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=445778.6666666667, ans=0.125 2024-06-21 19:51:41,700 INFO [train.py:1028] (0/2) Epoch 25, batch 350, loss[loss=0.1792, simple_loss=0.2466, pruned_loss=0.05587, over 12906.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2514, pruned_loss=0.06714, over 2139107.28 frames. ], batch size: 33, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:51:52,150 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.926e+02 2.191e+02 2.350e+02 2.550e+02 3.526e+02, threshold=4.700e+02, percent-clipped=0.0 2024-06-21 19:51:54,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=445815.3333333333, ans=0.015 2024-06-21 19:51:59,579 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.13 vs. limit=15.0 2024-06-21 19:52:07,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=445852.0, ans=0.125 2024-06-21 19:52:16,175 INFO [train.py:1028] (0/2) Epoch 25, batch 400, loss[loss=0.1787, simple_loss=0.2459, pruned_loss=0.05577, over 13307.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2518, pruned_loss=0.06727, over 2239687.64 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:52:25,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=445907.0, ans=0.125 2024-06-21 19:52:26,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=445907.0, ans=0.025 2024-06-21 19:52:27,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=445907.0, ans=0.125 2024-06-21 19:52:29,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=445925.3333333333, ans=0.09899494936611666 2024-06-21 19:52:31,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=445925.3333333333, ans=0.2 2024-06-21 19:52:32,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=445925.3333333333, ans=0.125 2024-06-21 19:52:33,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2024-06-21 19:52:34,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.78 vs. limit=22.5 2024-06-21 19:52:34,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=445943.6666666667, ans=0.2 2024-06-21 19:52:38,571 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 19:52:43,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=445962.0, ans=0.125 2024-06-21 19:52:43,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=445962.0, ans=0.2 2024-06-21 19:52:48,270 INFO [train.py:1028] (0/2) Epoch 25, batch 450, loss[loss=0.1931, simple_loss=0.2589, pruned_loss=0.06364, over 13300.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2519, pruned_loss=0.06742, over 2314410.01 frames. ], batch size: 67, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:52:56,059 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.152e+02 2.270e+02 2.439e+02 3.581e+02, threshold=4.541e+02, percent-clipped=0.0 2024-06-21 19:53:10,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=446035.3333333333, ans=0.2 2024-06-21 19:53:23,767 INFO [train.py:1028] (0/2) Epoch 25, batch 500, loss[loss=0.1893, simple_loss=0.242, pruned_loss=0.06832, over 13122.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2524, pruned_loss=0.06746, over 2376347.41 frames. ], batch size: 121, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:53:33,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=446090.3333333333, ans=0.0 2024-06-21 19:53:58,818 INFO [train.py:1028] (0/2) Epoch 25, batch 550, loss[loss=0.1903, simple_loss=0.2398, pruned_loss=0.07041, over 12944.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2518, pruned_loss=0.06734, over 2421363.92 frames. ], batch size: 158, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:53:59,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=446163.6666666667, ans=0.0 2024-06-21 19:54:03,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2024-06-21 19:54:06,545 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.233e+02 2.339e+02 2.564e+02 3.148e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-21 19:54:08,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=446182.0, ans=0.125 2024-06-21 19:54:12,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2024-06-21 19:54:18,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=446218.6666666667, ans=0.0 2024-06-21 19:54:26,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=446237.0, ans=0.125 2024-06-21 19:54:30,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.11 vs. limit=22.5 2024-06-21 19:54:30,299 INFO [train.py:1028] (0/2) Epoch 25, batch 600, loss[loss=0.1935, simple_loss=0.234, pruned_loss=0.07646, over 13033.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2517, pruned_loss=0.06741, over 2460694.03 frames. ], batch size: 144, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:54:35,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=446273.6666666667, ans=0.1 2024-06-21 19:54:36,356 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2024-06-21 19:54:47,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=446292.0, ans=0.025 2024-06-21 19:54:48,450 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2024-06-21 19:54:54,820 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.78 vs. limit=15.0 2024-06-21 19:55:02,406 INFO [train.py:1028] (0/2) Epoch 25, batch 650, loss[loss=0.1919, simple_loss=0.2597, pruned_loss=0.06205, over 13183.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2517, pruned_loss=0.06709, over 2491782.12 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:55:02,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2024-06-21 19:55:09,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=446347.0, ans=0.0 2024-06-21 19:55:14,018 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.166e+02 2.281e+02 2.405e+02 3.067e+02, threshold=4.562e+02, percent-clipped=0.0 2024-06-21 19:55:15,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=446365.3333333333, ans=0.125 2024-06-21 19:55:17,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=446365.3333333333, ans=0.125 2024-06-21 19:55:23,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=446383.6666666667, ans=0.0 2024-06-21 19:55:29,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.17 vs. limit=15.0 2024-06-21 19:55:30,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=446402.0, ans=0.125 2024-06-21 19:55:32,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=446420.3333333333, ans=0.0 2024-06-21 19:55:33,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=446420.3333333333, ans=0.2 2024-06-21 19:55:35,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=446420.3333333333, ans=0.0 2024-06-21 19:55:38,481 INFO [train.py:1028] (0/2) Epoch 25, batch 700, loss[loss=0.2007, simple_loss=0.2657, pruned_loss=0.06783, over 13303.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2516, pruned_loss=0.06712, over 2513649.42 frames. ], batch size: 46, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:55:56,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=446475.3333333333, ans=0.125 2024-06-21 19:56:03,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446493.6666666667, ans=0.0 2024-06-21 19:56:08,353 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=446512.0, ans=0.125 2024-06-21 19:56:13,169 INFO [train.py:1028] (0/2) Epoch 25, batch 750, loss[loss=0.165, simple_loss=0.2274, pruned_loss=0.05128, over 13192.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2514, pruned_loss=0.06664, over 2529166.21 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:56:13,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446530.3333333333, ans=0.1 2024-06-21 19:56:16,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=446530.3333333333, ans=0.0 2024-06-21 19:56:20,589 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.134e+02 2.285e+02 2.445e+02 2.856e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 19:56:32,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=446585.3333333333, ans=0.125 2024-06-21 19:56:45,036 INFO [train.py:1028] (0/2) Epoch 25, batch 800, loss[loss=0.1985, simple_loss=0.2657, pruned_loss=0.06561, over 12893.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2514, pruned_loss=0.06674, over 2541933.17 frames. ], batch size: 36, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:56:55,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=446640.3333333333, ans=0.1 2024-06-21 19:56:59,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=446658.6666666667, ans=0.0 2024-06-21 19:57:17,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=446695.3333333333, ans=0.125 2024-06-21 19:57:19,789 INFO [train.py:1028] (0/2) Epoch 25, batch 850, loss[loss=0.1862, simple_loss=0.2485, pruned_loss=0.06198, over 13139.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2518, pruned_loss=0.06674, over 2552883.37 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:57:27,245 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.149e+02 2.247e+02 2.412e+02 3.418e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 19:57:27,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=446732.0, ans=0.0 2024-06-21 19:57:27,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=446732.0, ans=0.125 2024-06-21 19:57:39,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=446768.6666666667, ans=0.0 2024-06-21 19:57:41,141 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=446768.6666666667, ans=0.09899494936611666 2024-06-21 19:57:54,362 INFO [train.py:1028] (0/2) Epoch 25, batch 900, loss[loss=0.1799, simple_loss=0.2402, pruned_loss=0.05977, over 12935.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2511, pruned_loss=0.0665, over 2557931.83 frames. ], batch size: 36, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:57:57,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=446805.3333333333, ans=0.125 2024-06-21 19:57:59,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=446805.3333333333, ans=0.2 2024-06-21 19:58:01,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=446823.6666666667, ans=0.125 2024-06-21 19:58:14,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446860.3333333333, ans=0.1 2024-06-21 19:58:14,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=446860.3333333333, ans=0.04949747468305833 2024-06-21 19:58:16,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=446860.3333333333, ans=0.125 2024-06-21 19:58:25,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=446878.6666666667, ans=0.1 2024-06-21 19:58:27,188 INFO [train.py:1028] (0/2) Epoch 25, batch 950, loss[loss=0.1866, simple_loss=0.2435, pruned_loss=0.06489, over 12961.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.251, pruned_loss=0.06656, over 2560769.97 frames. ], batch size: 39, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:58:27,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446897.0, ans=0.1 2024-06-21 19:58:34,688 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.180e+02 2.343e+02 2.592e+02 3.113e+02, threshold=4.686e+02, percent-clipped=0.0 2024-06-21 19:58:38,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=446915.3333333333, ans=0.0 2024-06-21 19:58:39,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2024-06-21 19:58:46,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446952.0, ans=0.1 2024-06-21 19:58:50,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=446952.0, ans=0.125 2024-06-21 19:58:52,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446970.3333333333, ans=0.1 2024-06-21 19:58:53,064 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2024-06-21 19:58:53,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446970.3333333333, ans=0.125 2024-06-21 19:58:59,275 INFO [train.py:1028] (0/2) Epoch 25, batch 1000, loss[loss=0.2135, simple_loss=0.268, pruned_loss=0.07947, over 13265.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.251, pruned_loss=0.06699, over 2562727.55 frames. ], batch size: 49, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:59:02,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.36 vs. limit=22.5 2024-06-21 19:59:09,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=447007.0, ans=0.125 2024-06-21 19:59:11,058 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=15.0 2024-06-21 19:59:17,197 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.445e+01 2024-06-21 19:59:32,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=447062.0, ans=0.125 2024-06-21 19:59:34,309 INFO [train.py:1028] (0/2) Epoch 25, batch 1050, loss[loss=0.1844, simple_loss=0.247, pruned_loss=0.06089, over 13157.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2516, pruned_loss=0.06707, over 2565423.03 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 19:59:44,915 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.193e+02 2.284e+02 2.427e+02 3.148e+02, threshold=4.569e+02, percent-clipped=0.0 2024-06-21 19:59:55,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=447117.0, ans=0.0 2024-06-21 19:59:56,486 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2024-06-21 20:00:08,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=447153.6666666667, ans=0.125 2024-06-21 20:00:09,364 INFO [train.py:1028] (0/2) Epoch 25, batch 1100, loss[loss=0.1894, simple_loss=0.2552, pruned_loss=0.0618, over 13265.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2525, pruned_loss=0.06723, over 2571876.67 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:00:15,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2024-06-21 20:00:18,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=447190.3333333333, ans=0.0 2024-06-21 20:00:24,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=447208.6666666667, ans=0.2 2024-06-21 20:00:27,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.46 vs. limit=15.0 2024-06-21 20:00:42,358 INFO [train.py:1028] (0/2) Epoch 25, batch 1150, loss[loss=0.2016, simple_loss=0.2606, pruned_loss=0.07129, over 13270.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2522, pruned_loss=0.06732, over 2572657.71 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:00:47,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=447263.6666666667, ans=0.125 2024-06-21 20:00:49,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=447282.0, ans=0.0 2024-06-21 20:00:50,152 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.182e+02 2.315e+02 2.456e+02 2.888e+02, threshold=4.629e+02, percent-clipped=0.0 2024-06-21 20:00:59,762 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.86 vs. limit=22.5 2024-06-21 20:01:09,368 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-244000.pt 2024-06-21 20:01:17,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=447318.6666666667, ans=0.0 2024-06-21 20:01:19,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=447337.0, ans=0.2 2024-06-21 20:01:20,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=447337.0, ans=0.125 2024-06-21 20:01:20,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=447337.0, ans=0.07 2024-06-21 20:01:26,154 INFO [train.py:1028] (0/2) Epoch 25, batch 1200, loss[loss=0.1882, simple_loss=0.26, pruned_loss=0.05818, over 13123.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2519, pruned_loss=0.0674, over 2574762.11 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:01:27,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=447355.3333333333, ans=0.125 2024-06-21 20:01:29,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=447355.3333333333, ans=0.1 2024-06-21 20:01:31,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=447355.3333333333, ans=0.0 2024-06-21 20:01:35,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=447373.6666666667, ans=0.1 2024-06-21 20:01:51,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447428.6666666667, ans=0.125 2024-06-21 20:01:54,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=447428.6666666667, ans=0.125 2024-06-21 20:02:00,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=15.0 2024-06-21 20:02:01,696 INFO [train.py:1028] (0/2) Epoch 25, batch 1250, loss[loss=0.1656, simple_loss=0.2253, pruned_loss=0.05298, over 13171.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2515, pruned_loss=0.06694, over 2583668.66 frames. ], batch size: 112, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:02:07,222 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2024-06-21 20:02:08,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=447465.3333333333, ans=0.125 2024-06-21 20:02:09,330 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.197e+02 2.286e+02 2.506e+02 3.473e+02, threshold=4.573e+02, percent-clipped=0.0 2024-06-21 20:02:10,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.60 vs. limit=22.5 2024-06-21 20:02:19,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=447483.6666666667, ans=0.125 2024-06-21 20:02:21,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=447502.0, ans=0.125 2024-06-21 20:02:30,707 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:02:30,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=447520.3333333333, ans=0.125 2024-06-21 20:02:33,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.46 vs. limit=15.0 2024-06-21 20:02:33,759 INFO [train.py:1028] (0/2) Epoch 25, batch 1300, loss[loss=0.1988, simple_loss=0.2534, pruned_loss=0.07208, over 12752.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2518, pruned_loss=0.06704, over 2584240.21 frames. ], batch size: 176, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:02:53,992 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2024-06-21 20:02:55,721 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:03:05,820 INFO [train.py:1028] (0/2) Epoch 25, batch 1350, loss[loss=0.1952, simple_loss=0.2555, pruned_loss=0.06745, over 13194.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2514, pruned_loss=0.06682, over 2586156.85 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:03:13,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.171e+02 2.310e+02 2.528e+02 3.139e+02, threshold=4.621e+02, percent-clipped=0.0 2024-06-21 20:03:23,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=447667.0, ans=0.0 2024-06-21 20:03:26,788 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.24 vs. limit=10.0 2024-06-21 20:03:30,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=447685.3333333333, ans=0.125 2024-06-21 20:03:31,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=447685.3333333333, ans=0.125 2024-06-21 20:03:31,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2024-06-21 20:03:36,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447703.6666666667, ans=0.1 2024-06-21 20:03:41,222 INFO [train.py:1028] (0/2) Epoch 25, batch 1400, loss[loss=0.2141, simple_loss=0.2794, pruned_loss=0.07442, over 12775.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2519, pruned_loss=0.06721, over 2587966.75 frames. ], batch size: 26, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:03:43,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=447722.0, ans=0.125 2024-06-21 20:03:49,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=447740.3333333333, ans=0.125 2024-06-21 20:03:50,391 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.08 vs. limit=15.0 2024-06-21 20:04:06,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=447777.0, ans=15.0 2024-06-21 20:04:16,148 INFO [train.py:1028] (0/2) Epoch 25, batch 1450, loss[loss=0.183, simple_loss=0.2366, pruned_loss=0.06471, over 13128.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.251, pruned_loss=0.06708, over 2588240.42 frames. ], batch size: 121, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:04:16,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447813.6666666667, ans=0.0 2024-06-21 20:04:24,120 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.201e+02 2.312e+02 2.519e+02 3.489e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-21 20:04:33,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.09 vs. limit=10.0 2024-06-21 20:04:45,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=447887.0, ans=0.0 2024-06-21 20:04:48,746 INFO [train.py:1028] (0/2) Epoch 25, batch 1500, loss[loss=0.199, simple_loss=0.2573, pruned_loss=0.07039, over 13211.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2512, pruned_loss=0.06715, over 2590124.13 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:04:53,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=447905.3333333333, ans=0.125 2024-06-21 20:04:58,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2024-06-21 20:05:01,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.20 vs. limit=15.0 2024-06-21 20:05:03,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=447942.0, ans=0.0 2024-06-21 20:05:07,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=447960.3333333333, ans=0.125 2024-06-21 20:05:17,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=447978.6666666667, ans=0.2 2024-06-21 20:05:17,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=447978.6666666667, ans=0.125 2024-06-21 20:05:21,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=447978.6666666667, ans=0.125 2024-06-21 20:05:24,102 INFO [train.py:1028] (0/2) Epoch 25, batch 1550, loss[loss=0.1872, simple_loss=0.2477, pruned_loss=0.06336, over 13041.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.252, pruned_loss=0.06756, over 2585141.96 frames. ], batch size: 102, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:05:27,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=447997.0, ans=0.0 2024-06-21 20:05:32,039 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.179e+02 2.296e+02 2.461e+02 3.419e+02, threshold=4.591e+02, percent-clipped=0.0 2024-06-21 20:05:38,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=448033.6666666667, ans=0.0 2024-06-21 20:05:41,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=448033.6666666667, ans=0.1 2024-06-21 20:05:49,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=448052.0, ans=0.0 2024-06-21 20:05:59,280 INFO [train.py:1028] (0/2) Epoch 25, batch 1600, loss[loss=0.1781, simple_loss=0.2355, pruned_loss=0.06031, over 13158.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2522, pruned_loss=0.06767, over 2579008.85 frames. ], batch size: 77, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:06:16,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448125.3333333333, ans=0.1 2024-06-21 20:06:16,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=448125.3333333333, ans=0.2 2024-06-21 20:06:18,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=448143.6666666667, ans=0.125 2024-06-21 20:06:25,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=448162.0, ans=0.125 2024-06-21 20:06:31,163 INFO [train.py:1028] (0/2) Epoch 25, batch 1650, loss[loss=0.2009, simple_loss=0.2527, pruned_loss=0.07456, over 13185.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2522, pruned_loss=0.06778, over 2575373.99 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:06:37,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=448198.6666666667, ans=0.125 2024-06-21 20:06:37,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=448198.6666666667, ans=0.125 2024-06-21 20:06:37,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=448198.6666666667, ans=0.09899494936611666 2024-06-21 20:06:38,836 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.191e+02 2.307e+02 2.446e+02 3.117e+02, threshold=4.613e+02, percent-clipped=0.0 2024-06-21 20:06:39,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=448198.6666666667, ans=0.2 2024-06-21 20:06:46,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=448217.0, ans=0.2 2024-06-21 20:06:46,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.48 vs. limit=15.0 2024-06-21 20:06:48,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=448217.0, ans=0.0 2024-06-21 20:06:54,819 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2024-06-21 20:06:56,515 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:07:01,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2024-06-21 20:07:03,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448272.0, ans=0.1 2024-06-21 20:07:04,506 INFO [train.py:1028] (0/2) Epoch 25, batch 1700, loss[loss=0.2091, simple_loss=0.2755, pruned_loss=0.07129, over 12405.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2519, pruned_loss=0.06736, over 2581092.12 frames. ], batch size: 25, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:07:09,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=448272.0, ans=0.0 2024-06-21 20:07:14,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.24 vs. limit=22.5 2024-06-21 20:07:16,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=448290.3333333333, ans=0.125 2024-06-21 20:07:16,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=448290.3333333333, ans=0.125 2024-06-21 20:07:19,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.30 vs. limit=12.0 2024-06-21 20:07:38,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448345.3333333333, ans=0.1 2024-06-21 20:07:39,952 INFO [train.py:1028] (0/2) Epoch 25, batch 1750, loss[loss=0.2152, simple_loss=0.2747, pruned_loss=0.07781, over 12524.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2526, pruned_loss=0.06762, over 2581797.71 frames. ], batch size: 22, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:07:41,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=448363.6666666667, ans=0.125 2024-06-21 20:07:47,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=448382.0, ans=0.125 2024-06-21 20:07:47,588 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.245e+02 2.363e+02 2.492e+02 3.175e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 20:08:02,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=448400.3333333333, ans=0.125 2024-06-21 20:08:03,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=448418.6666666667, ans=0.0 2024-06-21 20:08:11,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=448437.0, ans=0.125 2024-06-21 20:08:12,725 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2024-06-21 20:08:16,652 INFO [train.py:1028] (0/2) Epoch 25, batch 1800, loss[loss=0.1911, simple_loss=0.2524, pruned_loss=0.06489, over 13247.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2533, pruned_loss=0.06826, over 2582515.00 frames. ], batch size: 67, lr: 2.33e-03, grad_scale: 32.0 2024-06-21 20:08:16,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=448455.3333333333, ans=0.0 2024-06-21 20:08:17,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2024-06-21 20:08:24,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=448473.6666666667, ans=0.125 2024-06-21 20:08:37,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=448510.3333333333, ans=0.125 2024-06-21 20:08:42,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=448528.6666666667, ans=0.025 2024-06-21 20:08:49,373 INFO [train.py:1028] (0/2) Epoch 25, batch 1850, loss[loss=0.1981, simple_loss=0.2526, pruned_loss=0.07175, over 13194.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2529, pruned_loss=0.06807, over 2584348.93 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:08:56,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=448565.3333333333, ans=0.025 2024-06-21 20:08:57,064 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.166e+02 2.270e+02 2.447e+02 3.342e+02, threshold=4.540e+02, percent-clipped=0.0 2024-06-21 20:08:59,891 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:09:02,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448583.6666666667, ans=0.1 2024-06-21 20:09:07,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=448583.6666666667, ans=0.0 2024-06-21 20:09:07,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2024-06-21 20:09:08,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=448602.0, ans=0.125 2024-06-21 20:09:15,861 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2024-06-21 20:09:18,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2024-06-21 20:09:22,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448620.3333333333, ans=0.1 2024-06-21 20:09:23,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=448620.3333333333, ans=0.0 2024-06-21 20:09:24,884 INFO [train.py:1028] (0/2) Epoch 25, batch 1900, loss[loss=0.2045, simple_loss=0.2594, pruned_loss=0.07484, over 13162.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2525, pruned_loss=0.06802, over 2586909.11 frames. ], batch size: 95, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:09:45,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=448693.6666666667, ans=0.0 2024-06-21 20:09:48,650 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2024-06-21 20:09:55,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=448712.0, ans=0.125 2024-06-21 20:10:01,199 INFO [train.py:1028] (0/2) Epoch 25, batch 1950, loss[loss=0.1751, simple_loss=0.233, pruned_loss=0.05861, over 13291.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.252, pruned_loss=0.06812, over 2592664.38 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:10:02,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=448730.3333333333, ans=0.2 2024-06-21 20:10:09,072 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.243e+02 2.372e+02 2.540e+02 2.975e+02, threshold=4.744e+02, percent-clipped=0.0 2024-06-21 20:10:21,404 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:10:24,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=448785.3333333333, ans=0.125 2024-06-21 20:10:25,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=448785.3333333333, ans=0.0 2024-06-21 20:10:26,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448803.6666666667, ans=0.1 2024-06-21 20:10:33,612 INFO [train.py:1028] (0/2) Epoch 25, batch 2000, loss[loss=0.1974, simple_loss=0.2568, pruned_loss=0.06899, over 12538.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2524, pruned_loss=0.06819, over 2588074.89 frames. ], batch size: 22, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:10:39,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=448840.3333333333, ans=0.0 2024-06-21 20:11:05,013 INFO [train.py:1028] (0/2) Epoch 25, batch 2050, loss[loss=0.1904, simple_loss=0.2517, pruned_loss=0.06451, over 12684.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2529, pruned_loss=0.06854, over 2584207.69 frames. ], batch size: 29, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:11:05,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448913.6666666667, ans=0.1 2024-06-21 20:11:05,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.52 vs. limit=6.0 2024-06-21 20:11:15,606 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.180e+02 2.292e+02 2.432e+02 2.900e+02, threshold=4.584e+02, percent-clipped=0.0 2024-06-21 20:11:21,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=448950.3333333333, ans=0.1 2024-06-21 20:11:32,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=448968.6666666667, ans=0.125 2024-06-21 20:11:35,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=448987.0, ans=0.1 2024-06-21 20:11:37,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=448987.0, ans=0.125 2024-06-21 20:11:40,298 INFO [train.py:1028] (0/2) Epoch 25, batch 2100, loss[loss=0.1972, simple_loss=0.259, pruned_loss=0.06773, over 13198.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.253, pruned_loss=0.06808, over 2586127.31 frames. ], batch size: 59, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:11:43,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449005.3333333333, ans=0.125 2024-06-21 20:11:47,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.26 vs. limit=22.5 2024-06-21 20:11:47,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449023.6666666667, ans=0.1 2024-06-21 20:11:57,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=449042.0, ans=0.0 2024-06-21 20:11:57,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.99 vs. limit=15.0 2024-06-21 20:12:01,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.72 vs. limit=12.0 2024-06-21 20:12:07,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=449060.3333333333, ans=15.0 2024-06-21 20:12:08,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=449060.3333333333, ans=0.0 2024-06-21 20:12:12,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2024-06-21 20:12:15,825 INFO [train.py:1028] (0/2) Epoch 25, batch 2150, loss[loss=0.1726, simple_loss=0.2334, pruned_loss=0.05589, over 13305.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2528, pruned_loss=0.06787, over 2587970.85 frames. ], batch size: 52, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:12:23,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=449115.3333333333, ans=0.07 2024-06-21 20:12:23,825 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.181e+02 2.277e+02 2.467e+02 3.878e+02, threshold=4.555e+02, percent-clipped=0.0 2024-06-21 20:12:30,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=449133.6666666667, ans=0.125 2024-06-21 20:12:34,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=449133.6666666667, ans=0.0 2024-06-21 20:12:41,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=449170.3333333333, ans=0.0 2024-06-21 20:12:45,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.83 vs. limit=15.0 2024-06-21 20:12:48,598 INFO [train.py:1028] (0/2) Epoch 25, batch 2200, loss[loss=0.1984, simple_loss=0.2478, pruned_loss=0.07452, over 13271.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2525, pruned_loss=0.06799, over 2588195.96 frames. ], batch size: 83, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:12:50,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=449188.6666666667, ans=0.0 2024-06-21 20:12:55,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=15.0 2024-06-21 20:13:05,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=449225.3333333333, ans=0.5 2024-06-21 20:13:23,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.50 vs. limit=22.5 2024-06-21 20:13:23,268 INFO [train.py:1028] (0/2) Epoch 25, batch 2250, loss[loss=0.1809, simple_loss=0.2412, pruned_loss=0.06035, over 13236.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2519, pruned_loss=0.06759, over 2585709.10 frames. ], batch size: 63, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:13:27,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=449280.3333333333, ans=0.04949747468305833 2024-06-21 20:13:31,113 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.225e+02 2.391e+02 2.732e+02 3.836e+02, threshold=4.783e+02, percent-clipped=0.0 2024-06-21 20:13:37,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=449317.0, ans=0.125 2024-06-21 20:13:37,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=449317.0, ans=0.0 2024-06-21 20:13:37,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449317.0, ans=0.125 2024-06-21 20:13:39,389 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2024-06-21 20:13:39,894 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:13:40,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=449317.0, ans=0.0 2024-06-21 20:13:46,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=449335.3333333333, ans=0.125 2024-06-21 20:13:48,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=12.0 2024-06-21 20:13:49,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=449353.6666666667, ans=0.0 2024-06-21 20:13:50,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449353.6666666667, ans=0.1 2024-06-21 20:13:51,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=449353.6666666667, ans=0.125 2024-06-21 20:13:58,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=449353.6666666667, ans=0.125 2024-06-21 20:14:00,829 INFO [train.py:1028] (0/2) Epoch 25, batch 2300, loss[loss=0.173, simple_loss=0.2325, pruned_loss=0.05676, over 12810.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2516, pruned_loss=0.06714, over 2579956.05 frames. ], batch size: 33, lr: 2.33e-03, grad_scale: 64.0 2024-06-21 20:14:09,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=449390.3333333333, ans=0.09899494936611666 2024-06-21 20:14:09,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=449390.3333333333, ans=0.2 2024-06-21 20:14:12,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=449390.3333333333, ans=0.1 2024-06-21 20:14:17,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2024-06-21 20:14:24,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=449427.0, ans=0.2 2024-06-21 20:14:25,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=449427.0, ans=0.1 2024-06-21 20:14:29,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=449445.3333333333, ans=0.125 2024-06-21 20:14:33,758 INFO [train.py:1028] (0/2) Epoch 25, batch 2350, loss[loss=0.1916, simple_loss=0.2567, pruned_loss=0.0633, over 13205.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2519, pruned_loss=0.06733, over 2583483.79 frames. ], batch size: 67, lr: 2.32e-03, grad_scale: 64.0 2024-06-21 20:14:41,882 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.190e+02 2.326e+02 2.570e+02 3.067e+02, threshold=4.652e+02, percent-clipped=0.0 2024-06-21 20:14:43,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=449482.0, ans=0.125 2024-06-21 20:15:05,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.08 vs. limit=22.5 2024-06-21 20:15:05,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449555.3333333333, ans=0.125 2024-06-21 20:15:06,100 INFO [train.py:1028] (0/2) Epoch 25, batch 2400, loss[loss=0.1878, simple_loss=0.2436, pruned_loss=0.06598, over 13287.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2515, pruned_loss=0.06725, over 2586639.27 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:15:06,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=449555.3333333333, ans=0.0 2024-06-21 20:15:28,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449610.3333333333, ans=0.1 2024-06-21 20:15:35,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=449628.6666666667, ans=0.2 2024-06-21 20:15:40,750 INFO [train.py:1028] (0/2) Epoch 25, batch 2450, loss[loss=0.1788, simple_loss=0.234, pruned_loss=0.06182, over 13257.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2507, pruned_loss=0.06729, over 2582663.11 frames. ], batch size: 63, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:15:51,880 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.191e+02 2.307e+02 2.431e+02 3.621e+02, threshold=4.615e+02, percent-clipped=0.0 2024-06-21 20:16:06,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.68 vs. limit=6.0 2024-06-21 20:16:06,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449702.0, ans=0.1 2024-06-21 20:16:08,955 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=449720.3333333333, ans=0.125 2024-06-21 20:16:12,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.32 vs. limit=22.5 2024-06-21 20:16:13,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=449720.3333333333, ans=0.2 2024-06-21 20:16:13,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=449720.3333333333, ans=0.0 2024-06-21 20:16:15,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=449738.6666666667, ans=0.0 2024-06-21 20:16:16,460 INFO [train.py:1028] (0/2) Epoch 25, batch 2500, loss[loss=0.2051, simple_loss=0.2563, pruned_loss=0.07691, over 13202.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2496, pruned_loss=0.06696, over 2586129.05 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:16:20,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=449738.6666666667, ans=0.0 2024-06-21 20:16:23,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=449757.0, ans=0.125 2024-06-21 20:16:28,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=449757.0, ans=0.125 2024-06-21 20:16:35,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=449793.6666666667, ans=0.04949747468305833 2024-06-21 20:16:49,362 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.49 vs. limit=15.0 2024-06-21 20:16:49,594 INFO [train.py:1028] (0/2) Epoch 25, batch 2550, loss[loss=0.1898, simple_loss=0.2557, pruned_loss=0.06196, over 12727.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2487, pruned_loss=0.06662, over 2586320.58 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:16:58,157 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.175e+02 2.285e+02 2.422e+02 2.710e+02, threshold=4.570e+02, percent-clipped=0.0 2024-06-21 20:17:00,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=449848.6666666667, ans=0.125 2024-06-21 20:17:02,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=449867.0, ans=0.125 2024-06-21 20:17:09,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449867.0, ans=0.1 2024-06-21 20:17:21,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449903.6666666667, ans=0.125 2024-06-21 20:17:22,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=449903.6666666667, ans=0.0 2024-06-21 20:17:25,837 INFO [train.py:1028] (0/2) Epoch 25, batch 2600, loss[loss=0.1763, simple_loss=0.241, pruned_loss=0.05578, over 13221.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2475, pruned_loss=0.06629, over 2586512.07 frames. ], batch size: 52, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:17:26,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=449922.0, ans=0.125 2024-06-21 20:17:44,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=449958.6666666667, ans=0.125 2024-06-21 20:17:54,923 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.24 vs. limit=22.5 2024-06-21 20:17:56,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=449995.3333333333, ans=0.125 2024-06-21 20:18:00,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=449995.3333333333, ans=0.125 2024-06-21 20:18:01,718 INFO [train.py:1028] (0/2) Epoch 25, batch 2650, loss[loss=0.1971, simple_loss=0.2489, pruned_loss=0.07264, over 13046.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2466, pruned_loss=0.06623, over 2586905.53 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:18:08,028 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.65 vs. limit=15.0 2024-06-21 20:18:10,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.166e+02 2.299e+02 2.413e+02 2.978e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 20:18:12,740 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2024-06-21 20:18:33,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=450087.0, ans=0.125 2024-06-21 20:18:34,599 INFO [train.py:1028] (0/2) Epoch 25, batch 2700, loss[loss=0.1762, simple_loss=0.2319, pruned_loss=0.06028, over 13268.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2452, pruned_loss=0.06621, over 2585346.41 frames. ], batch size: 89, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:18:36,724 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:18:45,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.24 vs. limit=22.5 2024-06-21 20:18:53,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=450160.3333333333, ans=0.0 2024-06-21 20:18:55,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=450160.3333333333, ans=0.0 2024-06-21 20:19:09,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=450178.6666666667, ans=0.1 2024-06-21 20:19:10,387 INFO [train.py:1028] (0/2) Epoch 25, batch 2750, loss[loss=0.185, simple_loss=0.2558, pruned_loss=0.05713, over 13249.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2441, pruned_loss=0.06546, over 2581926.80 frames. ], batch size: 43, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:19:13,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=450197.0, ans=0.125 2024-06-21 20:19:13,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=450197.0, ans=0.125 2024-06-21 20:19:18,810 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.132e+02 2.307e+02 2.548e+02 5.189e+02, threshold=4.614e+02, percent-clipped=1.0 2024-06-21 20:19:21,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2024-06-21 20:19:23,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450233.6666666667, ans=0.1 2024-06-21 20:19:23,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.13 vs. limit=15.0 2024-06-21 20:19:26,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=450233.6666666667, ans=0.0 2024-06-21 20:19:45,488 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:19:47,287 INFO [train.py:1028] (0/2) Epoch 25, batch 2800, loss[loss=0.2, simple_loss=0.2459, pruned_loss=0.07709, over 10853.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2436, pruned_loss=0.0656, over 2580427.05 frames. ], batch size: 303, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:19:49,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=450288.6666666667, ans=0.125 2024-06-21 20:20:03,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2024-06-21 20:20:04,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=450325.3333333333, ans=0.125 2024-06-21 20:20:06,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=450343.6666666667, ans=0.125 2024-06-21 20:20:07,772 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.93 vs. limit=10.0 2024-06-21 20:20:13,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450362.0, ans=0.1 2024-06-21 20:20:13,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.77 vs. limit=22.5 2024-06-21 20:20:17,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=450362.0, ans=0.0 2024-06-21 20:20:19,464 INFO [train.py:1028] (0/2) Epoch 25, batch 2850, loss[loss=0.1893, simple_loss=0.247, pruned_loss=0.06579, over 13281.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2425, pruned_loss=0.06528, over 2578246.42 frames. ], batch size: 49, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:20:27,482 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.281e+02 2.409e+02 2.625e+02 3.138e+02, threshold=4.817e+02, percent-clipped=0.0 2024-06-21 20:20:31,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=450417.0, ans=0.09899494936611666 2024-06-21 20:20:38,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=450435.3333333333, ans=0.02 2024-06-21 20:20:47,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450453.6666666667, ans=0.125 2024-06-21 20:20:51,225 INFO [train.py:1028] (0/2) Epoch 25, batch 2900, loss[loss=0.1857, simple_loss=0.244, pruned_loss=0.06374, over 13180.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2413, pruned_loss=0.06493, over 2586086.68 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:20:59,767 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:21:08,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=450508.6666666667, ans=0.125 2024-06-21 20:21:11,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=450508.6666666667, ans=0.2 2024-06-21 20:21:13,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450508.6666666667, ans=0.1 2024-06-21 20:21:16,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=450527.0, ans=0.0 2024-06-21 20:21:28,674 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2024-06-21 20:21:29,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=450545.3333333333, ans=0.125 2024-06-21 20:21:30,683 INFO [train.py:1028] (0/2) Epoch 25, batch 2950, loss[loss=0.1792, simple_loss=0.2379, pruned_loss=0.06028, over 13289.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2413, pruned_loss=0.06488, over 2579427.38 frames. ], batch size: 43, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:21:34,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=450563.6666666667, ans=0.125 2024-06-21 20:21:35,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=450563.6666666667, ans=0.0 2024-06-21 20:21:36,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=450582.0, ans=0.2 2024-06-21 20:21:39,407 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.116e+02 2.230e+02 2.402e+02 3.505e+02, threshold=4.460e+02, percent-clipped=0.0 2024-06-21 20:21:45,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=450600.3333333333, ans=0.025 2024-06-21 20:21:47,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=450600.3333333333, ans=0.0 2024-06-21 20:21:47,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=450600.3333333333, ans=0.125 2024-06-21 20:21:51,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=450618.6666666667, ans=0.05 2024-06-21 20:21:57,223 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.06 vs. limit=10.0 2024-06-21 20:21:58,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=450637.0, ans=0.125 2024-06-21 20:22:00,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.33 vs. limit=10.0 2024-06-21 20:22:04,233 INFO [train.py:1028] (0/2) Epoch 25, batch 3000, loss[loss=0.1695, simple_loss=0.231, pruned_loss=0.05401, over 13188.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2404, pruned_loss=0.06446, over 2576813.42 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:22:04,234 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 20:22:12,045 INFO [train.py:1060] (0/2) Epoch 25, validation: loss=0.1883, simple_loss=0.2506, pruned_loss=0.06299, over 351949.00 frames. 2024-06-21 20:22:12,045 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 20:22:27,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=450692.0, ans=0.2 2024-06-21 20:22:29,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=450692.0, ans=0.125 2024-06-21 20:22:30,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=450692.0, ans=0.0 2024-06-21 20:22:38,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2024-06-21 20:22:45,244 INFO [train.py:1028] (0/2) Epoch 25, batch 3050, loss[loss=0.1783, simple_loss=0.2337, pruned_loss=0.06143, over 13317.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2404, pruned_loss=0.06505, over 2577344.24 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:22:46,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=450747.0, ans=0.125 2024-06-21 20:22:48,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=450747.0, ans=0.125 2024-06-21 20:22:57,504 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.160e+02 2.353e+02 2.529e+02 3.412e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 20:22:59,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=450765.3333333333, ans=0.0 2024-06-21 20:23:01,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=450783.6666666667, ans=0.125 2024-06-21 20:23:02,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=450783.6666666667, ans=0.125 2024-06-21 20:23:04,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=450783.6666666667, ans=0.125 2024-06-21 20:23:17,299 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.10 vs. limit=15.0 2024-06-21 20:23:21,617 INFO [train.py:1028] (0/2) Epoch 25, batch 3100, loss[loss=0.1719, simple_loss=0.2242, pruned_loss=0.05982, over 13026.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2403, pruned_loss=0.06502, over 2578251.96 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:23:43,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=450875.3333333333, ans=0.125 2024-06-21 20:23:51,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2024-06-21 20:23:53,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=450912.0, ans=0.125 2024-06-21 20:23:55,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450912.0, ans=0.1 2024-06-21 20:23:57,914 INFO [train.py:1028] (0/2) Epoch 25, batch 3150, loss[loss=0.1919, simple_loss=0.2378, pruned_loss=0.07302, over 12899.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2395, pruned_loss=0.06459, over 2580755.53 frames. ], batch size: 158, lr: 2.32e-03, grad_scale: 16.0 2024-06-21 20:24:07,159 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.158e+02 2.321e+02 2.501e+02 3.137e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 20:24:08,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450948.6666666667, ans=0.125 2024-06-21 20:24:18,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=450985.3333333333, ans=0.125 2024-06-21 20:24:24,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=451003.6666666667, ans=0.025 2024-06-21 20:24:26,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451003.6666666667, ans=0.1 2024-06-21 20:24:26,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=451003.6666666667, ans=0.0 2024-06-21 20:24:31,444 INFO [train.py:1028] (0/2) Epoch 25, batch 3200, loss[loss=0.1761, simple_loss=0.2359, pruned_loss=0.05815, over 13156.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2384, pruned_loss=0.06386, over 2580977.77 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:24:34,397 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:24:36,761 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:24:40,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=451040.3333333333, ans=0.0 2024-06-21 20:24:49,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=12.0 2024-06-21 20:24:52,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=451077.0, ans=0.125 2024-06-21 20:25:01,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=451095.3333333333, ans=0.2 2024-06-21 20:25:01,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=451095.3333333333, ans=0.2 2024-06-21 20:25:06,680 INFO [train.py:1028] (0/2) Epoch 25, batch 3250, loss[loss=0.168, simple_loss=0.2272, pruned_loss=0.05443, over 13298.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2375, pruned_loss=0.06349, over 2584888.94 frames. ], batch size: 72, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:25:16,493 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.170e+02 2.282e+02 2.473e+02 3.444e+02, threshold=4.564e+02, percent-clipped=0.0 2024-06-21 20:25:35,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2024-06-21 20:25:41,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=451187.0, ans=0.125 2024-06-21 20:25:43,790 INFO [train.py:1028] (0/2) Epoch 25, batch 3300, loss[loss=0.198, simple_loss=0.2443, pruned_loss=0.07587, over 12759.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2369, pruned_loss=0.06313, over 2581957.63 frames. ], batch size: 176, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:25:45,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=451205.3333333333, ans=0.2 2024-06-21 20:26:04,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=451260.3333333333, ans=0.0 2024-06-21 20:26:10,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=451278.6666666667, ans=0.2 2024-06-21 20:26:15,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=451297.0, ans=0.125 2024-06-21 20:26:16,272 INFO [train.py:1028] (0/2) Epoch 25, batch 3350, loss[loss=0.1756, simple_loss=0.2293, pruned_loss=0.06097, over 12869.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2366, pruned_loss=0.06326, over 2576905.32 frames. ], batch size: 158, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:26:21,057 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=12.0 2024-06-21 20:26:25,569 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 2.123e+02 2.235e+02 2.440e+02 3.127e+02, threshold=4.470e+02, percent-clipped=0.0 2024-06-21 20:26:29,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.92 vs. limit=12.0 2024-06-21 20:26:46,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=451370.3333333333, ans=0.0 2024-06-21 20:26:48,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=451370.3333333333, ans=0.0 2024-06-21 20:26:51,794 INFO [train.py:1028] (0/2) Epoch 25, batch 3400, loss[loss=0.1959, simple_loss=0.259, pruned_loss=0.0664, over 12591.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2363, pruned_loss=0.06356, over 2575856.57 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:26:53,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451388.6666666667, ans=0.1 2024-06-21 20:27:03,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=451407.0, ans=0.125 2024-06-21 20:27:04,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=451425.3333333333, ans=0.125 2024-06-21 20:27:16,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2024-06-21 20:27:28,464 INFO [train.py:1028] (0/2) Epoch 25, batch 3450, loss[loss=0.2004, simple_loss=0.25, pruned_loss=0.0754, over 12723.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2356, pruned_loss=0.0633, over 2576381.90 frames. ], batch size: 176, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:27:30,270 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2024-06-21 20:27:36,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=451498.6666666667, ans=0.125 2024-06-21 20:27:37,859 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.157e+02 2.281e+02 2.461e+02 3.280e+02, threshold=4.561e+02, percent-clipped=0.0 2024-06-21 20:27:48,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=451535.3333333333, ans=0.1 2024-06-21 20:27:51,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.69 vs. limit=5.0 2024-06-21 20:28:00,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=451572.0, ans=0.125 2024-06-21 20:28:01,146 INFO [train.py:1028] (0/2) Epoch 25, batch 3500, loss[loss=0.1776, simple_loss=0.2298, pruned_loss=0.06268, over 12850.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2356, pruned_loss=0.06311, over 2575290.41 frames. ], batch size: 33, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:28:02,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=451572.0, ans=0.125 2024-06-21 20:28:05,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=451572.0, ans=0.125 2024-06-21 20:28:06,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=451572.0, ans=0.125 2024-06-21 20:28:13,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=451590.3333333333, ans=0.0 2024-06-21 20:28:17,290 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.631e+01 2024-06-21 20:28:19,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=451608.6666666667, ans=0.2 2024-06-21 20:28:25,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=451627.0, ans=0.07 2024-06-21 20:28:31,737 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:28:33,781 INFO [train.py:1028] (0/2) Epoch 25, batch 3550, loss[loss=0.1681, simple_loss=0.2176, pruned_loss=0.05932, over 13150.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2352, pruned_loss=0.06279, over 2577224.29 frames. ], batch size: 95, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:28:36,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451663.6666666667, ans=0.1 2024-06-21 20:28:42,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.101e+02 2.217e+02 2.330e+02 3.028e+02, threshold=4.434e+02, percent-clipped=0.0 2024-06-21 20:28:53,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=451700.3333333333, ans=0.0 2024-06-21 20:29:00,574 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:29:08,185 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.99 vs. limit=22.5 2024-06-21 20:29:09,132 INFO [train.py:1028] (0/2) Epoch 25, batch 3600, loss[loss=0.1759, simple_loss=0.2394, pruned_loss=0.05619, over 13346.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.235, pruned_loss=0.06282, over 2580619.68 frames. ], batch size: 49, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:29:22,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451773.6666666667, ans=0.1 2024-06-21 20:29:23,826 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2024-06-21 20:29:31,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=451792.0, ans=0.09899494936611666 2024-06-21 20:29:36,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451810.3333333333, ans=0.1 2024-06-21 20:29:44,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=451847.0, ans=10.0 2024-06-21 20:29:44,993 INFO [train.py:1028] (0/2) Epoch 25, batch 3650, loss[loss=0.1782, simple_loss=0.2267, pruned_loss=0.06485, over 13019.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.235, pruned_loss=0.06258, over 2579435.83 frames. ], batch size: 102, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:29:53,759 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.166e+02 2.291e+02 2.425e+02 3.154e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-21 20:29:55,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451865.3333333333, ans=0.1 2024-06-21 20:30:06,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=451902.0, ans=0.1 2024-06-21 20:30:09,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=451902.0, ans=0.125 2024-06-21 20:30:13,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=451920.3333333333, ans=0.125 2024-06-21 20:30:17,831 INFO [train.py:1028] (0/2) Epoch 25, batch 3700, loss[loss=0.1709, simple_loss=0.2247, pruned_loss=0.0585, over 13134.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2345, pruned_loss=0.06251, over 2583752.04 frames. ], batch size: 72, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:30:18,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.22 vs. limit=22.5 2024-06-21 20:30:30,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=451957.0, ans=0.0 2024-06-21 20:30:37,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.59 vs. limit=15.0 2024-06-21 20:30:48,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=452012.0, ans=0.125 2024-06-21 20:30:51,466 INFO [train.py:1028] (0/2) Epoch 25, batch 3750, loss[loss=0.1652, simple_loss=0.2286, pruned_loss=0.05088, over 12297.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2332, pruned_loss=0.06202, over 2585796.99 frames. ], batch size: 22, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:31:02,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=452048.6666666667, ans=0.0 2024-06-21 20:31:04,646 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.092e+02 2.221e+02 2.408e+02 3.356e+02, threshold=4.441e+02, percent-clipped=0.0 2024-06-21 20:31:09,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=452067.0, ans=0.025 2024-06-21 20:31:09,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.15 vs. limit=15.0 2024-06-21 20:31:12,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=452067.0, ans=0.125 2024-06-21 20:31:13,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.03 vs. limit=15.0 2024-06-21 20:31:15,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=452085.3333333333, ans=0.07 2024-06-21 20:31:20,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=452085.3333333333, ans=0.025 2024-06-21 20:31:31,734 INFO [train.py:1028] (0/2) Epoch 25, batch 3800, loss[loss=0.1686, simple_loss=0.2195, pruned_loss=0.05884, over 13180.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2328, pruned_loss=0.06167, over 2583391.21 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:31:34,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=452122.0, ans=0.05 2024-06-21 20:32:05,220 INFO [train.py:1028] (0/2) Epoch 25, batch 3850, loss[loss=0.1894, simple_loss=0.2413, pruned_loss=0.06875, over 13030.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2325, pruned_loss=0.0613, over 2582123.54 frames. ], batch size: 144, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:32:11,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-06-21 20:32:14,737 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.147e+02 2.332e+02 2.596e+02 3.213e+02, threshold=4.663e+02, percent-clipped=0.0 2024-06-21 20:32:27,988 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:32:29,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=452268.6666666667, ans=0.0 2024-06-21 20:32:30,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=452287.0, ans=0.125 2024-06-21 20:32:38,118 INFO [train.py:1028] (0/2) Epoch 25, batch 3900, loss[loss=0.1818, simple_loss=0.2316, pruned_loss=0.06596, over 13189.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2318, pruned_loss=0.06122, over 2586362.95 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:32:39,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=452305.3333333333, ans=0.1 2024-06-21 20:32:58,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=452360.3333333333, ans=0.09899494936611666 2024-06-21 20:33:08,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=452378.6666666667, ans=0.2 2024-06-21 20:33:10,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=452378.6666666667, ans=0.015 2024-06-21 20:33:11,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.95 vs. limit=10.0 2024-06-21 20:33:14,701 INFO [train.py:1028] (0/2) Epoch 25, batch 3950, loss[loss=0.1838, simple_loss=0.2255, pruned_loss=0.07104, over 13077.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2311, pruned_loss=0.06093, over 2588185.80 frames. ], batch size: 132, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:33:24,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.120e+02 2.228e+02 2.476e+02 3.835e+02, threshold=4.456e+02, percent-clipped=0.0 2024-06-21 20:33:24,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452415.3333333333, ans=0.1 2024-06-21 20:33:36,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=452433.6666666667, ans=0.125 2024-06-21 20:33:38,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=452452.0, ans=0.0 2024-06-21 20:33:47,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=452470.3333333333, ans=0.0 2024-06-21 20:33:49,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=452470.3333333333, ans=0.015 2024-06-21 20:33:51,786 INFO [train.py:1028] (0/2) Epoch 25, batch 4000, loss[loss=0.1984, simple_loss=0.2533, pruned_loss=0.0717, over 12966.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2308, pruned_loss=0.06085, over 2582865.29 frames. ], batch size: 39, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:33:57,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=452488.6666666667, ans=0.2 2024-06-21 20:33:57,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=452488.6666666667, ans=0.0 2024-06-21 20:34:11,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452543.6666666667, ans=0.1 2024-06-21 20:34:14,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=452543.6666666667, ans=0.0 2024-06-21 20:34:17,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=452543.6666666667, ans=0.125 2024-06-21 20:34:17,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.57 vs. limit=22.5 2024-06-21 20:34:18,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=452562.0, ans=0.0 2024-06-21 20:34:24,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452580.3333333333, ans=0.1 2024-06-21 20:34:25,445 INFO [train.py:1028] (0/2) Epoch 25, batch 4050, loss[loss=0.1972, simple_loss=0.2376, pruned_loss=0.07841, over 10981.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2308, pruned_loss=0.06098, over 2579946.34 frames. ], batch size: 303, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:34:32,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=452598.6666666667, ans=0.2 2024-06-21 20:34:35,011 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.122e+02 2.253e+02 2.490e+02 3.072e+02, threshold=4.506e+02, percent-clipped=0.0 2024-06-21 20:34:35,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=452598.6666666667, ans=0.125 2024-06-21 20:34:38,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=452617.0, ans=0.125 2024-06-21 20:34:58,983 INFO [train.py:1028] (0/2) Epoch 25, batch 4100, loss[loss=0.1695, simple_loss=0.2123, pruned_loss=0.06334, over 13206.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2303, pruned_loss=0.06113, over 2576597.80 frames. ], batch size: 103, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:35:06,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=452672.0, ans=0.125 2024-06-21 20:35:16,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452708.6666666667, ans=0.1 2024-06-21 20:35:27,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=452727.0, ans=0.0 2024-06-21 20:35:36,407 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.50 vs. limit=12.0 2024-06-21 20:35:40,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=452763.6666666667, ans=0.0 2024-06-21 20:35:40,910 INFO [train.py:1028] (0/2) Epoch 25, batch 4150, loss[loss=0.161, simple_loss=0.2224, pruned_loss=0.04978, over 13162.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2302, pruned_loss=0.06112, over 2575916.67 frames. ], batch size: 55, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:35:50,444 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.204e+02 2.371e+02 2.503e+02 3.689e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-21 20:35:51,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=452782.0, ans=0.025 2024-06-21 20:36:01,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=452818.6666666667, ans=0.5 2024-06-21 20:36:03,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=452818.6666666667, ans=0.125 2024-06-21 20:36:11,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452837.0, ans=0.0 2024-06-21 20:36:14,656 INFO [train.py:1028] (0/2) Epoch 25, batch 4200, loss[loss=0.201, simple_loss=0.2458, pruned_loss=0.07813, over 13091.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2305, pruned_loss=0.06133, over 2578195.95 frames. ], batch size: 102, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:36:17,369 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2024-06-21 20:36:17,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452855.3333333333, ans=0.1 2024-06-21 20:36:27,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=452892.0, ans=0.125 2024-06-21 20:36:31,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.75 vs. limit=12.0 2024-06-21 20:36:36,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=452910.3333333333, ans=0.125 2024-06-21 20:36:38,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=452910.3333333333, ans=0.04949747468305833 2024-06-21 20:36:41,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2024-06-21 20:36:42,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=452928.6666666667, ans=0.125 2024-06-21 20:36:47,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.06 vs. limit=15.0 2024-06-21 20:36:48,108 INFO [train.py:1028] (0/2) Epoch 25, batch 4250, loss[loss=0.1861, simple_loss=0.249, pruned_loss=0.06157, over 13315.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.23, pruned_loss=0.06079, over 2580303.61 frames. ], batch size: 46, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:36:54,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=452965.3333333333, ans=0.125 2024-06-21 20:36:57,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.069e+02 2.210e+02 2.379e+02 4.416e+02, threshold=4.421e+02, percent-clipped=0.0 2024-06-21 20:36:59,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=452965.3333333333, ans=0.125 2024-06-21 20:37:24,410 INFO [train.py:1028] (0/2) Epoch 25, batch 4300, loss[loss=0.1756, simple_loss=0.2304, pruned_loss=0.06044, over 13253.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2299, pruned_loss=0.06084, over 2581566.37 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:37:27,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=453038.6666666667, ans=0.2 2024-06-21 20:37:45,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=453075.3333333333, ans=0.015 2024-06-21 20:37:49,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=453093.6666666667, ans=0.2 2024-06-21 20:37:50,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=453093.6666666667, ans=0.05 2024-06-21 20:37:52,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453093.6666666667, ans=0.1 2024-06-21 20:37:55,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=453112.0, ans=0.0 2024-06-21 20:37:59,937 INFO [train.py:1028] (0/2) Epoch 25, batch 4350, loss[loss=0.1696, simple_loss=0.2299, pruned_loss=0.05468, over 13221.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2296, pruned_loss=0.06079, over 2585756.29 frames. ], batch size: 59, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:38:04,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=453130.3333333333, ans=0.125 2024-06-21 20:38:08,739 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.121e+02 2.264e+02 2.544e+02 3.532e+02, threshold=4.528e+02, percent-clipped=0.0 2024-06-21 20:38:15,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=453167.0, ans=0.125 2024-06-21 20:38:21,940 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.26 vs. limit=22.5 2024-06-21 20:38:24,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=453203.6666666667, ans=0.035 2024-06-21 20:38:24,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=453203.6666666667, ans=0.125 2024-06-21 20:38:26,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453203.6666666667, ans=0.1 2024-06-21 20:38:32,033 INFO [train.py:1028] (0/2) Epoch 25, batch 4400, loss[loss=0.1632, simple_loss=0.209, pruned_loss=0.05867, over 13256.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.229, pruned_loss=0.06096, over 2585902.45 frames. ], batch size: 83, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:38:35,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=453222.0, ans=0.0 2024-06-21 20:38:37,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453222.0, ans=0.1 2024-06-21 20:38:54,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2024-06-21 20:38:57,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=453277.0, ans=0.09899494936611666 2024-06-21 20:39:05,235 INFO [train.py:1028] (0/2) Epoch 25, batch 4450, loss[loss=0.1652, simple_loss=0.2302, pruned_loss=0.05006, over 12895.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2295, pruned_loss=0.06136, over 2580498.89 frames. ], batch size: 33, lr: 2.32e-03, grad_scale: 32.0 2024-06-21 20:39:05,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=453313.6666666667, ans=0.125 2024-06-21 20:39:06,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=453313.6666666667, ans=0.125 2024-06-21 20:39:18,120 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=12.0 2024-06-21 20:39:18,433 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.048e+02 2.153e+02 2.317e+02 3.246e+02, threshold=4.305e+02, percent-clipped=0.0 2024-06-21 20:39:30,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=453368.6666666667, ans=0.0 2024-06-21 20:39:37,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=453368.6666666667, ans=0.04949747468305833 2024-06-21 20:39:45,126 INFO [train.py:1028] (0/2) Epoch 25, batch 4500, loss[loss=0.1759, simple_loss=0.2262, pruned_loss=0.06283, over 13255.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2295, pruned_loss=0.06108, over 2584949.96 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:39:52,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=453423.6666666667, ans=0.125 2024-06-21 20:40:04,178 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:40:06,503 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2024-06-21 20:40:07,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=453460.3333333333, ans=0.1 2024-06-21 20:40:08,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=453460.3333333333, ans=0.2 2024-06-21 20:40:10,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=453478.6666666667, ans=0.0 2024-06-21 20:40:17,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=453478.6666666667, ans=0.0 2024-06-21 20:40:18,173 INFO [train.py:1028] (0/2) Epoch 25, batch 4550, loss[loss=0.1705, simple_loss=0.2323, pruned_loss=0.0543, over 13202.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2295, pruned_loss=0.06101, over 2588293.50 frames. ], batch size: 52, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:40:25,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=453515.3333333333, ans=0.2 2024-06-21 20:40:27,636 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.141e+02 2.299e+02 2.529e+02 3.471e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 20:40:41,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=453552.0, ans=0.2 2024-06-21 20:40:50,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=453570.3333333333, ans=0.125 2024-06-21 20:40:51,871 INFO [train.py:1028] (0/2) Epoch 25, batch 4600, loss[loss=0.1798, simple_loss=0.2274, pruned_loss=0.06604, over 12502.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2297, pruned_loss=0.06105, over 2583672.56 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:40:54,147 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:40:58,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=453607.0, ans=0.125 2024-06-21 20:41:00,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=453607.0, ans=0.0 2024-06-21 20:41:21,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=453662.0, ans=0.0 2024-06-21 20:41:31,492 INFO [train.py:1028] (0/2) Epoch 25, batch 4650, loss[loss=0.1855, simple_loss=0.2277, pruned_loss=0.0716, over 13112.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2295, pruned_loss=0.06109, over 2587282.66 frames. ], batch size: 132, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:41:36,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=453680.3333333333, ans=0.2 2024-06-21 20:41:37,323 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2024-06-21 20:41:40,801 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.797e+02 2.102e+02 2.240e+02 2.514e+02 3.063e+02, threshold=4.479e+02, percent-clipped=0.0 2024-06-21 20:41:42,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=453698.6666666667, ans=0.125 2024-06-21 20:41:49,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=453717.0, ans=0.025 2024-06-21 20:41:53,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=453735.3333333333, ans=0.0 2024-06-21 20:41:56,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=453735.3333333333, ans=0.0 2024-06-21 20:42:04,960 INFO [train.py:1028] (0/2) Epoch 25, batch 4700, loss[loss=0.1837, simple_loss=0.2381, pruned_loss=0.06467, over 12462.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2293, pruned_loss=0.06099, over 2582794.26 frames. ], batch size: 25, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:42:16,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=453790.3333333333, ans=0.025 2024-06-21 20:42:20,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=453808.6666666667, ans=0.0 2024-06-21 20:42:27,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=453827.0, ans=0.125 2024-06-21 20:42:33,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=453845.3333333333, ans=0.125 2024-06-21 20:42:35,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453845.3333333333, ans=0.1 2024-06-21 20:42:38,040 INFO [train.py:1028] (0/2) Epoch 25, batch 4750, loss[loss=0.1913, simple_loss=0.2405, pruned_loss=0.07099, over 12538.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2289, pruned_loss=0.06096, over 2580492.89 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:42:41,230 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.20 vs. limit=22.5 2024-06-21 20:42:46,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=453882.0, ans=0.0 2024-06-21 20:42:47,468 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.137e+02 2.275e+02 2.431e+02 3.911e+02, threshold=4.549e+02, percent-clipped=0.0 2024-06-21 20:42:49,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=453882.0, ans=0.1 2024-06-21 20:42:50,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453882.0, ans=0.1 2024-06-21 20:42:53,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2024-06-21 20:42:53,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2024-06-21 20:42:54,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=453900.3333333333, ans=0.0 2024-06-21 20:42:57,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2024-06-21 20:43:00,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=453918.6666666667, ans=0.2 2024-06-21 20:43:03,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=453918.6666666667, ans=0.2 2024-06-21 20:43:15,050 INFO [train.py:1028] (0/2) Epoch 25, batch 4800, loss[loss=0.1662, simple_loss=0.2205, pruned_loss=0.05597, over 13277.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2282, pruned_loss=0.06072, over 2577614.88 frames. ], batch size: 63, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:43:17,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=453955.3333333333, ans=0.0 2024-06-21 20:43:17,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2024-06-21 20:43:25,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=453973.6666666667, ans=0.125 2024-06-21 20:43:28,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.62 vs. limit=15.0 2024-06-21 20:43:38,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=454010.3333333333, ans=0.04949747468305833 2024-06-21 20:43:38,833 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.85 vs. limit=10.0 2024-06-21 20:43:49,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=454028.6666666667, ans=0.0 2024-06-21 20:43:50,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=454047.0, ans=0.05 2024-06-21 20:43:51,125 INFO [train.py:1028] (0/2) Epoch 25, batch 4850, loss[loss=0.1614, simple_loss=0.2131, pruned_loss=0.05483, over 13226.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.2281, pruned_loss=0.06058, over 2575071.99 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:43:53,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=15.0 2024-06-21 20:43:53,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=454047.0, ans=0.125 2024-06-21 20:43:55,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=454047.0, ans=0.0 2024-06-21 20:43:58,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=454065.3333333333, ans=0.0 2024-06-21 20:44:00,624 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.130e+02 2.237e+02 2.416e+02 3.039e+02, threshold=4.474e+02, percent-clipped=0.0 2024-06-21 20:44:22,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=454120.3333333333, ans=0.125 2024-06-21 20:44:25,089 INFO [train.py:1028] (0/2) Epoch 25, batch 4900, loss[loss=0.1673, simple_loss=0.225, pruned_loss=0.05479, over 13248.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2286, pruned_loss=0.06062, over 2575625.71 frames. ], batch size: 59, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:44:25,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=454138.6666666667, ans=0.0 2024-06-21 20:44:29,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=454138.6666666667, ans=0.125 2024-06-21 20:44:31,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=454157.0, ans=0.125 2024-06-21 20:44:32,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=454157.0, ans=0.0 2024-06-21 20:44:32,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=454157.0, ans=0.0 2024-06-21 20:44:36,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=454157.0, ans=0.0 2024-06-21 20:44:43,458 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2024-06-21 20:44:44,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=454193.6666666667, ans=0.125 2024-06-21 20:44:51,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=454212.0, ans=0.125 2024-06-21 20:44:51,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=454212.0, ans=0.125 2024-06-21 20:44:57,912 INFO [train.py:1028] (0/2) Epoch 25, batch 4950, loss[loss=0.1933, simple_loss=0.2344, pruned_loss=0.07615, over 10995.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2291, pruned_loss=0.06116, over 2570306.76 frames. ], batch size: 303, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:44:59,938 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:45:02,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=454230.3333333333, ans=0.0 2024-06-21 20:45:04,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=454248.6666666667, ans=0.1 2024-06-21 20:45:05,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=454248.6666666667, ans=0.0 2024-06-21 20:45:07,048 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.062e+02 2.224e+02 2.350e+02 3.118e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 20:45:11,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=454248.6666666667, ans=0.125 2024-06-21 20:45:13,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=454267.0, ans=0.0 2024-06-21 20:45:32,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.95 vs. limit=15.0 2024-06-21 20:45:33,689 INFO [train.py:1028] (0/2) Epoch 25, batch 5000, loss[loss=0.1814, simple_loss=0.2266, pruned_loss=0.06804, over 13116.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2291, pruned_loss=0.0612, over 2574568.05 frames. ], batch size: 95, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:45:40,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=454322.0, ans=0.025 2024-06-21 20:45:55,431 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:45:56,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2024-06-21 20:45:58,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=454377.0, ans=0.1 2024-06-21 20:46:00,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454377.0, ans=0.1 2024-06-21 20:46:01,140 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.33 vs. limit=22.5 2024-06-21 20:46:05,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=454395.3333333333, ans=0.0 2024-06-21 20:46:08,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=454395.3333333333, ans=0.035 2024-06-21 20:46:10,308 INFO [train.py:1028] (0/2) Epoch 25, batch 5050, loss[loss=0.1674, simple_loss=0.2307, pruned_loss=0.052, over 12821.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.229, pruned_loss=0.061, over 2573443.06 frames. ], batch size: 36, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:46:11,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=454413.6666666667, ans=0.125 2024-06-21 20:46:16,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.06 vs. limit=15.0 2024-06-21 20:46:19,491 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.103e+02 2.227e+02 2.382e+02 3.210e+02, threshold=4.455e+02, percent-clipped=0.0 2024-06-21 20:46:28,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=454450.3333333333, ans=0.125 2024-06-21 20:46:43,186 INFO [train.py:1028] (0/2) Epoch 25, batch 5100, loss[loss=0.1686, simple_loss=0.2234, pruned_loss=0.05694, over 12953.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.229, pruned_loss=0.06135, over 2569986.14 frames. ], batch size: 39, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:46:43,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=454505.3333333333, ans=0.0 2024-06-21 20:46:48,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=454505.3333333333, ans=0.2 2024-06-21 20:46:58,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=454542.0, ans=0.2 2024-06-21 20:47:00,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.84 vs. limit=10.0 2024-06-21 20:47:09,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=454560.3333333333, ans=0.125 2024-06-21 20:47:10,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=454560.3333333333, ans=0.1 2024-06-21 20:47:22,226 INFO [train.py:1028] (0/2) Epoch 25, batch 5150, loss[loss=0.1754, simple_loss=0.2222, pruned_loss=0.06431, over 13092.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2292, pruned_loss=0.06147, over 2573131.45 frames. ], batch size: 132, lr: 2.31e-03, grad_scale: 64.0 2024-06-21 20:47:23,382 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.76 vs. limit=15.0 2024-06-21 20:47:23,846 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=454597.0, ans=0.125 2024-06-21 20:47:23,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=454597.0, ans=0.125 2024-06-21 20:47:25,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=454597.0, ans=0.125 2024-06-21 20:47:31,389 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.828e+02 2.139e+02 2.274e+02 2.485e+02 3.522e+02, threshold=4.548e+02, percent-clipped=0.0 2024-06-21 20:47:46,067 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-248000.pt 2024-06-21 20:47:59,315 INFO [train.py:1028] (0/2) Epoch 25, batch 5200, loss[loss=0.1926, simple_loss=0.2395, pruned_loss=0.07288, over 13180.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2292, pruned_loss=0.06135, over 2576342.31 frames. ], batch size: 95, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:48:00,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.40 vs. limit=22.5 2024-06-21 20:48:09,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=454707.0, ans=0.02 2024-06-21 20:48:13,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=454725.3333333333, ans=10.0 2024-06-21 20:48:21,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=454743.6666666667, ans=0.5 2024-06-21 20:48:30,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2024-06-21 20:48:31,871 INFO [train.py:1028] (0/2) Epoch 25, batch 5250, loss[loss=0.178, simple_loss=0.2269, pruned_loss=0.0646, over 13307.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.229, pruned_loss=0.06133, over 2572179.20 frames. ], batch size: 52, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:48:40,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=454798.6666666667, ans=0.125 2024-06-21 20:48:40,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=454798.6666666667, ans=0.125 2024-06-21 20:48:41,228 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.140e+02 2.307e+02 2.518e+02 3.065e+02, threshold=4.614e+02, percent-clipped=0.0 2024-06-21 20:48:43,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2024-06-21 20:48:45,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=454817.0, ans=0.125 2024-06-21 20:48:49,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=454817.0, ans=0.025 2024-06-21 20:48:55,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=454835.3333333333, ans=0.125 2024-06-21 20:48:57,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=454853.6666666667, ans=0.2 2024-06-21 20:49:01,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=454853.6666666667, ans=0.2 2024-06-21 20:49:03,944 INFO [train.py:1028] (0/2) Epoch 25, batch 5300, loss[loss=0.1762, simple_loss=0.2226, pruned_loss=0.06488, over 13057.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2285, pruned_loss=0.06118, over 2568748.85 frames. ], batch size: 144, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:49:20,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=454908.6666666667, ans=0.125 2024-06-21 20:49:29,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=454927.0, ans=0.125 2024-06-21 20:49:42,725 INFO [train.py:1028] (0/2) Epoch 25, batch 5350, loss[loss=0.1701, simple_loss=0.2168, pruned_loss=0.06168, over 11836.00 frames. ], tot_loss[loss=0.175, simple_loss=0.2282, pruned_loss=0.06091, over 2575143.12 frames. ], batch size: 17, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:49:52,080 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.784e+02 2.057e+02 2.158e+02 2.308e+02 3.442e+02, threshold=4.315e+02, percent-clipped=0.0 2024-06-21 20:49:52,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454982.0, ans=0.125 2024-06-21 20:49:54,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=455000.3333333333, ans=0.0 2024-06-21 20:50:14,219 INFO [train.py:1028] (0/2) Epoch 25, batch 5400, loss[loss=0.1818, simple_loss=0.2248, pruned_loss=0.0694, over 12129.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2282, pruned_loss=0.06134, over 2567165.40 frames. ], batch size: 240, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:50:30,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=455092.0, ans=0.025 2024-06-21 20:50:31,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2024-06-21 20:50:36,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=455110.3333333333, ans=0.0 2024-06-21 20:50:41,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=455128.6666666667, ans=0.035 2024-06-21 20:50:46,782 INFO [train.py:1028] (0/2) Epoch 25, batch 5450, loss[loss=0.1738, simple_loss=0.2262, pruned_loss=0.06071, over 12389.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2284, pruned_loss=0.06125, over 2570731.50 frames. ], batch size: 25, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:50:52,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455165.3333333333, ans=0.1 2024-06-21 20:50:56,525 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.115e+02 2.252e+02 2.391e+02 3.560e+02, threshold=4.505e+02, percent-clipped=0.0 2024-06-21 20:51:06,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455183.6666666667, ans=0.1 2024-06-21 20:51:07,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=455183.6666666667, ans=0.2 2024-06-21 20:51:11,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=455202.0, ans=0.025 2024-06-21 20:51:20,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=455220.3333333333, ans=0.2 2024-06-21 20:51:25,897 INFO [train.py:1028] (0/2) Epoch 25, batch 5500, loss[loss=0.194, simple_loss=0.2385, pruned_loss=0.07473, over 12269.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2282, pruned_loss=0.06106, over 2563068.99 frames. ], batch size: 241, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:51:31,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=455257.0, ans=0.125 2024-06-21 20:51:41,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=455275.3333333333, ans=0.125 2024-06-21 20:51:42,155 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2024-06-21 20:51:58,612 INFO [train.py:1028] (0/2) Epoch 25, batch 5550, loss[loss=0.1668, simple_loss=0.228, pruned_loss=0.05281, over 13287.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.228, pruned_loss=0.06074, over 2566721.34 frames. ], batch size: 43, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:52:02,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455330.3333333333, ans=0.0 2024-06-21 20:52:02,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455330.3333333333, ans=0.1 2024-06-21 20:52:03,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=455330.3333333333, ans=0.125 2024-06-21 20:52:06,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=455348.6666666667, ans=0.125 2024-06-21 20:52:08,386 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.728e+02 2.083e+02 2.224e+02 2.415e+02 3.176e+02, threshold=4.448e+02, percent-clipped=0.0 2024-06-21 20:52:12,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=455367.0, ans=0.0 2024-06-21 20:52:25,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.65 vs. limit=10.0 2024-06-21 20:52:29,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=455403.6666666667, ans=0.2 2024-06-21 20:52:30,742 INFO [train.py:1028] (0/2) Epoch 25, batch 5600, loss[loss=0.1709, simple_loss=0.2212, pruned_loss=0.06032, over 13222.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2279, pruned_loss=0.06099, over 2568992.92 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:52:42,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.17 vs. limit=10.0 2024-06-21 20:52:44,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0 2024-06-21 20:52:47,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=455458.6666666667, ans=0.025 2024-06-21 20:53:07,384 INFO [train.py:1028] (0/2) Epoch 25, batch 5650, loss[loss=0.185, simple_loss=0.2349, pruned_loss=0.06756, over 12519.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.2281, pruned_loss=0.06053, over 2573673.98 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:53:08,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=455513.6666666667, ans=0.125 2024-06-21 20:53:11,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=455513.6666666667, ans=0.0 2024-06-21 20:53:20,516 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.850e+02 2.089e+02 2.205e+02 2.368e+02 2.955e+02, threshold=4.409e+02, percent-clipped=0.0 2024-06-21 20:53:29,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=455568.6666666667, ans=0.125 2024-06-21 20:53:31,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=455568.6666666667, ans=0.2 2024-06-21 20:53:31,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=455568.6666666667, ans=0.125 2024-06-21 20:53:34,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=455568.6666666667, ans=0.0 2024-06-21 20:53:39,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=455587.0, ans=0.125 2024-06-21 20:53:39,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=455587.0, ans=0.0 2024-06-21 20:53:40,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2024-06-21 20:53:43,734 INFO [train.py:1028] (0/2) Epoch 25, batch 5700, loss[loss=0.1616, simple_loss=0.2224, pruned_loss=0.05042, over 13260.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.228, pruned_loss=0.06063, over 2577569.42 frames. ], batch size: 63, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:53:54,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=455623.6666666667, ans=0.2 2024-06-21 20:54:05,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=455660.3333333333, ans=0.125 2024-06-21 20:54:15,647 INFO [train.py:1028] (0/2) Epoch 25, batch 5750, loss[loss=0.194, simple_loss=0.2459, pruned_loss=0.07109, over 12770.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2287, pruned_loss=0.0608, over 2577441.62 frames. ], batch size: 176, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:54:25,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.042e+02 2.208e+02 2.378e+02 3.358e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 20:54:32,418 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:54:33,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=455733.6666666667, ans=0.125 2024-06-21 20:54:48,745 INFO [train.py:1028] (0/2) Epoch 25, batch 5800, loss[loss=0.1952, simple_loss=0.2495, pruned_loss=0.0704, over 12761.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2307, pruned_loss=0.06179, over 2576743.97 frames. ], batch size: 176, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:54:49,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=455788.6666666667, ans=0.125 2024-06-21 20:54:56,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=455788.6666666667, ans=0.125 2024-06-21 20:55:21,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.24 vs. limit=15.0 2024-06-21 20:55:21,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=455862.0, ans=0.0 2024-06-21 20:55:28,085 INFO [train.py:1028] (0/2) Epoch 25, batch 5850, loss[loss=0.1835, simple_loss=0.2361, pruned_loss=0.06548, over 12522.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2317, pruned_loss=0.06236, over 2575676.21 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:55:28,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=455880.3333333333, ans=0.125 2024-06-21 20:55:35,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=455898.6666666667, ans=0.1 2024-06-21 20:55:35,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=455898.6666666667, ans=0.125 2024-06-21 20:55:37,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.106e+02 2.254e+02 2.391e+02 2.959e+02, threshold=4.507e+02, percent-clipped=0.0 2024-06-21 20:55:50,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=455935.3333333333, ans=0.07 2024-06-21 20:55:52,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455935.3333333333, ans=0.1 2024-06-21 20:56:01,454 INFO [train.py:1028] (0/2) Epoch 25, batch 5900, loss[loss=0.1721, simple_loss=0.2254, pruned_loss=0.05944, over 13064.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2332, pruned_loss=0.06272, over 2575220.78 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:56:05,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=455972.0, ans=0.125 2024-06-21 20:56:14,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=456008.6666666667, ans=0.125 2024-06-21 20:56:17,483 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.57 vs. limit=15.0 2024-06-21 20:56:18,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=456008.6666666667, ans=0.125 2024-06-21 20:56:25,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=456027.0, ans=0.0 2024-06-21 20:56:27,710 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 20:56:30,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=456045.3333333333, ans=0.0 2024-06-21 20:56:32,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=456045.3333333333, ans=0.07 2024-06-21 20:56:33,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=456045.3333333333, ans=0.125 2024-06-21 20:56:33,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=456045.3333333333, ans=0.0 2024-06-21 20:56:34,391 INFO [train.py:1028] (0/2) Epoch 25, batch 5950, loss[loss=0.1543, simple_loss=0.207, pruned_loss=0.05082, over 13092.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2347, pruned_loss=0.06322, over 2580632.02 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:56:43,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=456082.0, ans=0.125 2024-06-21 20:56:44,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.212e+02 2.411e+02 2.606e+02 3.285e+02, threshold=4.821e+02, percent-clipped=0.0 2024-06-21 20:57:06,910 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=22.5 2024-06-21 20:57:11,845 INFO [train.py:1028] (0/2) Epoch 25, batch 6000, loss[loss=0.2123, simple_loss=0.2564, pruned_loss=0.0841, over 12174.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2364, pruned_loss=0.06393, over 2573830.92 frames. ], batch size: 240, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:57:11,846 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 20:57:20,654 INFO [train.py:1060] (0/2) Epoch 25, validation: loss=0.1898, simple_loss=0.2515, pruned_loss=0.06411, over 351949.00 frames. 2024-06-21 20:57:20,654 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 20:57:22,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=456155.3333333333, ans=0.125 2024-06-21 20:57:36,476 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2024-06-21 20:57:45,936 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.58 vs. limit=6.0 2024-06-21 20:57:54,437 INFO [train.py:1028] (0/2) Epoch 25, batch 6050, loss[loss=0.2155, simple_loss=0.2881, pruned_loss=0.07146, over 12925.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2381, pruned_loss=0.06437, over 2577315.26 frames. ], batch size: 39, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:57:54,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=456247.0, ans=0.0 2024-06-21 20:58:04,168 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.179e+02 2.294e+02 2.441e+02 3.429e+02, threshold=4.587e+02, percent-clipped=0.0 2024-06-21 20:58:07,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=456283.6666666667, ans=0.0 2024-06-21 20:58:09,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=456283.6666666667, ans=0.125 2024-06-21 20:58:14,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=456302.0, ans=0.125 2024-06-21 20:58:26,980 INFO [train.py:1028] (0/2) Epoch 25, batch 6100, loss[loss=0.1743, simple_loss=0.2276, pruned_loss=0.06045, over 13128.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2387, pruned_loss=0.06444, over 2579353.03 frames. ], batch size: 121, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:58:28,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=456338.6666666667, ans=0.2 2024-06-21 20:58:33,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=456357.0, ans=0.125 2024-06-21 20:58:33,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=456357.0, ans=0.2 2024-06-21 20:58:35,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=456357.0, ans=0.025 2024-06-21 20:58:46,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456393.6666666667, ans=0.1 2024-06-21 20:58:49,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=456393.6666666667, ans=0.125 2024-06-21 20:58:53,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456412.0, ans=0.1 2024-06-21 20:58:55,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=456412.0, ans=0.125 2024-06-21 20:58:55,836 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2024-06-21 20:58:58,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=456412.0, ans=0.0 2024-06-21 20:59:00,280 INFO [train.py:1028] (0/2) Epoch 25, batch 6150, loss[loss=0.1912, simple_loss=0.2319, pruned_loss=0.07522, over 10845.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2402, pruned_loss=0.06494, over 2577458.59 frames. ], batch size: 304, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:59:06,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=456430.3333333333, ans=0.125 2024-06-21 20:59:07,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2024-06-21 20:59:17,777 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.216e+02 2.434e+02 2.756e+02 3.812e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-21 20:59:22,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456467.0, ans=0.125 2024-06-21 20:59:41,390 INFO [train.py:1028] (0/2) Epoch 25, batch 6200, loss[loss=0.2056, simple_loss=0.264, pruned_loss=0.07358, over 13293.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2418, pruned_loss=0.06559, over 2575857.85 frames. ], batch size: 89, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 20:59:41,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=456522.0, ans=0.025 2024-06-21 20:59:42,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=456522.0, ans=0.125 2024-06-21 20:59:44,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=456522.0, ans=0.125 2024-06-21 20:59:50,466 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.01 vs. limit=15.0 2024-06-21 20:59:58,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.78 vs. limit=22.5 2024-06-21 21:00:12,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=456595.3333333333, ans=0.2 2024-06-21 21:00:15,671 INFO [train.py:1028] (0/2) Epoch 25, batch 6250, loss[loss=0.2006, simple_loss=0.2583, pruned_loss=0.07149, over 13212.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2434, pruned_loss=0.0662, over 2569987.57 frames. ], batch size: 83, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:00:15,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=456613.6666666667, ans=0.0 2024-06-21 21:00:21,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456613.6666666667, ans=0.1 2024-06-21 21:00:21,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2024-06-21 21:00:26,238 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.281e+02 2.459e+02 2.826e+02 4.417e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 21:00:26,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=456632.0, ans=0.125 2024-06-21 21:00:28,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=456632.0, ans=0.125 2024-06-21 21:00:37,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=456668.6666666667, ans=0.0 2024-06-21 21:00:38,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2024-06-21 21:00:42,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=456687.0, ans=10.0 2024-06-21 21:00:49,385 INFO [train.py:1028] (0/2) Epoch 25, batch 6300, loss[loss=0.19, simple_loss=0.2461, pruned_loss=0.06692, over 11220.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2444, pruned_loss=0.06625, over 2565010.81 frames. ], batch size: 16, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:00:55,844 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=22.5 2024-06-21 21:01:30,610 INFO [train.py:1028] (0/2) Epoch 25, batch 6350, loss[loss=0.2179, simple_loss=0.2729, pruned_loss=0.08145, over 12494.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2453, pruned_loss=0.06594, over 2574359.08 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:01:30,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=456797.0, ans=0.2 2024-06-21 21:01:35,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456797.0, ans=0.1 2024-06-21 21:01:40,805 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.03 vs. limit=15.0 2024-06-21 21:01:40,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.217e+02 2.393e+02 2.606e+02 3.577e+02, threshold=4.786e+02, percent-clipped=0.0 2024-06-21 21:01:42,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=456815.3333333333, ans=0.125 2024-06-21 21:01:47,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.95 vs. limit=15.0 2024-06-21 21:01:52,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2024-06-21 21:01:54,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=456852.0, ans=0.0 2024-06-21 21:02:04,504 INFO [train.py:1028] (0/2) Epoch 25, batch 6400, loss[loss=0.174, simple_loss=0.2304, pruned_loss=0.05885, over 13227.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2471, pruned_loss=0.06676, over 2574409.26 frames. ], batch size: 67, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:02:07,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2024-06-21 21:02:07,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2024-06-21 21:02:16,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=456907.0, ans=0.025 2024-06-21 21:02:26,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=456943.6666666667, ans=0.0 2024-06-21 21:02:37,664 INFO [train.py:1028] (0/2) Epoch 25, batch 6450, loss[loss=0.2082, simple_loss=0.2634, pruned_loss=0.07648, over 12529.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2489, pruned_loss=0.06757, over 2580580.82 frames. ], batch size: 202, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:02:45,168 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:02:47,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.292e+02 2.450e+02 2.726e+02 3.688e+02, threshold=4.900e+02, percent-clipped=0.0 2024-06-21 21:02:51,487 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.29 vs. limit=10.0 2024-06-21 21:02:58,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=457035.3333333333, ans=0.025 2024-06-21 21:03:02,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=457035.3333333333, ans=0.0 2024-06-21 21:03:03,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=457035.3333333333, ans=0.125 2024-06-21 21:03:04,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=457053.6666666667, ans=0.125 2024-06-21 21:03:10,842 INFO [train.py:1028] (0/2) Epoch 25, batch 6500, loss[loss=0.2157, simple_loss=0.2579, pruned_loss=0.08672, over 10680.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2496, pruned_loss=0.0674, over 2583368.07 frames. ], batch size: 303, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:03:13,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=457072.0, ans=0.125 2024-06-21 21:03:32,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457108.6666666667, ans=0.1 2024-06-21 21:03:35,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457108.6666666667, ans=0.1 2024-06-21 21:03:40,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457127.0, ans=0.125 2024-06-21 21:03:40,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457127.0, ans=0.1 2024-06-21 21:03:50,681 INFO [train.py:1028] (0/2) Epoch 25, batch 6550, loss[loss=0.178, simple_loss=0.2359, pruned_loss=0.06, over 12769.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2506, pruned_loss=0.06777, over 2587601.27 frames. ], batch size: 22, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:03:50,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457163.6666666667, ans=0.1 2024-06-21 21:03:51,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=457163.6666666667, ans=0.1 2024-06-21 21:04:00,508 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.254e+02 2.352e+02 2.591e+02 3.160e+02, threshold=4.705e+02, percent-clipped=0.0 2024-06-21 21:04:01,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=457182.0, ans=0.125 2024-06-21 21:04:06,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2024-06-21 21:04:07,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=457200.3333333333, ans=0.025 2024-06-21 21:04:13,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=457218.6666666667, ans=0.09899494936611666 2024-06-21 21:04:14,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2024-06-21 21:04:16,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.38 vs. limit=15.0 2024-06-21 21:04:23,576 INFO [train.py:1028] (0/2) Epoch 25, batch 6600, loss[loss=0.2057, simple_loss=0.2654, pruned_loss=0.07298, over 13284.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2511, pruned_loss=0.06784, over 2589666.35 frames. ], batch size: 72, lr: 2.31e-03, grad_scale: 32.0 2024-06-21 21:04:24,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.62 vs. limit=15.0 2024-06-21 21:04:27,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2024-06-21 21:04:28,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457255.3333333333, ans=0.1 2024-06-21 21:04:30,403 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457273.6666666667, ans=0.1 2024-06-21 21:04:39,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=457292.0, ans=0.125 2024-06-21 21:04:45,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=457310.3333333333, ans=0.2 2024-06-21 21:04:52,251 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2024-06-21 21:04:55,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=457328.6666666667, ans=0.07 2024-06-21 21:04:57,512 INFO [train.py:1028] (0/2) Epoch 25, batch 6650, loss[loss=0.2098, simple_loss=0.2654, pruned_loss=0.07715, over 12881.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2524, pruned_loss=0.06817, over 2584093.56 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:05:00,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=457347.0, ans=0.0 2024-06-21 21:05:07,862 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.274e+02 2.509e+02 2.793e+02 3.537e+02, threshold=5.019e+02, percent-clipped=0.0 2024-06-21 21:05:11,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=457383.6666666667, ans=0.0 2024-06-21 21:05:13,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=457383.6666666667, ans=0.125 2024-06-21 21:05:17,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=457402.0, ans=0.125 2024-06-21 21:05:39,549 INFO [train.py:1028] (0/2) Epoch 25, batch 6700, loss[loss=0.1883, simple_loss=0.2458, pruned_loss=0.06541, over 12739.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2527, pruned_loss=0.0685, over 2584564.62 frames. ], batch size: 176, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:05:51,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=457457.0, ans=15.0 2024-06-21 21:05:52,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=457475.3333333333, ans=0.125 2024-06-21 21:05:55,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=457475.3333333333, ans=0.04949747468305833 2024-06-21 21:06:07,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=457512.0, ans=0.95 2024-06-21 21:06:13,147 INFO [train.py:1028] (0/2) Epoch 25, batch 6750, loss[loss=0.2635, simple_loss=0.3033, pruned_loss=0.1119, over 12321.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2539, pruned_loss=0.06943, over 2578718.78 frames. ], batch size: 241, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:06:20,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=457548.6666666667, ans=0.125 2024-06-21 21:06:22,803 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.346e+02 2.496e+02 2.734e+02 3.276e+02, threshold=4.993e+02, percent-clipped=0.0 2024-06-21 21:06:33,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457585.3333333333, ans=0.1 2024-06-21 21:06:34,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457585.3333333333, ans=0.1 2024-06-21 21:06:34,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=457585.3333333333, ans=0.0 2024-06-21 21:06:43,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2024-06-21 21:06:45,950 INFO [train.py:1028] (0/2) Epoch 25, batch 6800, loss[loss=0.1789, simple_loss=0.2384, pruned_loss=0.05964, over 13271.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2546, pruned_loss=0.06929, over 2580539.05 frames. ], batch size: 67, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:06:52,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457640.3333333333, ans=0.125 2024-06-21 21:07:14,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2024-06-21 21:07:19,208 INFO [train.py:1028] (0/2) Epoch 25, batch 6850, loss[loss=0.1953, simple_loss=0.2619, pruned_loss=0.06429, over 13261.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2553, pruned_loss=0.06937, over 2584355.07 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:07:19,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457713.6666666667, ans=0.1 2024-06-21 21:07:28,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=457732.0, ans=0.0 2024-06-21 21:07:32,392 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.335e+02 2.514e+02 2.756e+02 3.546e+02, threshold=5.027e+02, percent-clipped=0.0 2024-06-21 21:07:32,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.40 vs. limit=22.5 2024-06-21 21:07:59,947 INFO [train.py:1028] (0/2) Epoch 25, batch 6900, loss[loss=0.2119, simple_loss=0.2776, pruned_loss=0.0731, over 13013.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2568, pruned_loss=0.07, over 2585496.42 frames. ], batch size: 48, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:08:01,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=457805.3333333333, ans=0.0 2024-06-21 21:08:02,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=457805.3333333333, ans=10.0 2024-06-21 21:08:03,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=457805.3333333333, ans=0.125 2024-06-21 21:08:04,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.09 vs. limit=15.0 2024-06-21 21:08:24,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.05 vs. limit=15.0 2024-06-21 21:08:30,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=457878.6666666667, ans=0.2 2024-06-21 21:08:33,514 INFO [train.py:1028] (0/2) Epoch 25, batch 6950, loss[loss=0.1812, simple_loss=0.2547, pruned_loss=0.05384, over 11633.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2577, pruned_loss=0.07009, over 2579798.07 frames. ], batch size: 17, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:08:43,114 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.867e+02 2.330e+02 2.504e+02 2.691e+02 3.500e+02, threshold=5.008e+02, percent-clipped=0.0 2024-06-21 21:08:46,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=457933.6666666667, ans=0.125 2024-06-21 21:08:47,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=457933.6666666667, ans=0.125 2024-06-21 21:08:53,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457952.0, ans=0.1 2024-06-21 21:08:57,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=457952.0, ans=0.05 2024-06-21 21:09:04,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.14 vs. limit=15.0 2024-06-21 21:09:05,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=457970.3333333333, ans=0.2 2024-06-21 21:09:06,335 INFO [train.py:1028] (0/2) Epoch 25, batch 7000, loss[loss=0.2075, simple_loss=0.2566, pruned_loss=0.07925, over 12908.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2576, pruned_loss=0.06978, over 2575642.58 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:09:06,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=457988.6666666667, ans=0.125 2024-06-21 21:09:10,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=457988.6666666667, ans=0.125 2024-06-21 21:09:12,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=458007.0, ans=0.125 2024-06-21 21:09:13,027 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.300e+00 2024-06-21 21:09:17,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458007.0, ans=0.1 2024-06-21 21:09:23,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=458025.3333333333, ans=0.0 2024-06-21 21:09:26,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=458025.3333333333, ans=0.0 2024-06-21 21:09:34,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458043.6666666667, ans=0.1 2024-06-21 21:09:39,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=458043.6666666667, ans=0.125 2024-06-21 21:09:39,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=458043.6666666667, ans=10.0 2024-06-21 21:09:47,791 INFO [train.py:1028] (0/2) Epoch 25, batch 7050, loss[loss=0.187, simple_loss=0.2476, pruned_loss=0.06325, over 12753.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2588, pruned_loss=0.07018, over 2582161.78 frames. ], batch size: 176, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:09:52,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=458080.3333333333, ans=15.0 2024-06-21 21:09:55,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=458098.6666666667, ans=0.2 2024-06-21 21:09:57,377 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.373e+02 2.536e+02 2.844e+02 4.084e+02, threshold=5.073e+02, percent-clipped=0.0 2024-06-21 21:09:57,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=458098.6666666667, ans=0.125 2024-06-21 21:09:58,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.45 vs. limit=15.0 2024-06-21 21:09:58,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=458098.6666666667, ans=0.0 2024-06-21 21:10:00,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=458117.0, ans=0.0 2024-06-21 21:10:02,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=458117.0, ans=0.125 2024-06-21 21:10:09,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=458135.3333333333, ans=0.125 2024-06-21 21:10:11,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=458135.3333333333, ans=0.125 2024-06-21 21:10:11,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458135.3333333333, ans=0.1 2024-06-21 21:10:13,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.16 vs. limit=22.5 2024-06-21 21:10:19,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458153.6666666667, ans=0.1 2024-06-21 21:10:20,306 INFO [train.py:1028] (0/2) Epoch 25, batch 7100, loss[loss=0.2201, simple_loss=0.2831, pruned_loss=0.07861, over 13144.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2591, pruned_loss=0.07057, over 2574534.28 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:10:21,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=458172.0, ans=0.125 2024-06-21 21:10:27,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=458190.3333333333, ans=0.125 2024-06-21 21:10:27,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458190.3333333333, ans=0.1 2024-06-21 21:10:29,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=458190.3333333333, ans=0.0 2024-06-21 21:10:33,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=458208.6666666667, ans=0.2 2024-06-21 21:10:43,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0 2024-06-21 21:10:43,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2024-06-21 21:10:50,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=458245.3333333333, ans=0.2 2024-06-21 21:10:53,975 INFO [train.py:1028] (0/2) Epoch 25, batch 7150, loss[loss=0.2395, simple_loss=0.2914, pruned_loss=0.09385, over 12560.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2594, pruned_loss=0.07043, over 2573899.97 frames. ], batch size: 202, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:10:54,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=458263.6666666667, ans=0.0 2024-06-21 21:11:00,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=458282.0, ans=0.125 2024-06-21 21:11:00,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=458282.0, ans=0.05 2024-06-21 21:11:03,839 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.248e+02 2.445e+02 2.669e+02 3.399e+02, threshold=4.891e+02, percent-clipped=0.0 2024-06-21 21:11:07,644 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.95 vs. limit=22.5 2024-06-21 21:11:09,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=458300.3333333333, ans=0.0 2024-06-21 21:11:20,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=458337.0, ans=0.0 2024-06-21 21:11:22,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458337.0, ans=0.1 2024-06-21 21:11:27,346 INFO [train.py:1028] (0/2) Epoch 25, batch 7200, loss[loss=0.2201, simple_loss=0.2817, pruned_loss=0.0792, over 13197.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2606, pruned_loss=0.07092, over 2578273.78 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:11:34,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=458373.6666666667, ans=0.125 2024-06-21 21:11:34,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=15.0 2024-06-21 21:11:37,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=458373.6666666667, ans=0.125 2024-06-21 21:11:38,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=458373.6666666667, ans=0.125 2024-06-21 21:11:47,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458392.0, ans=0.1 2024-06-21 21:12:07,339 INFO [train.py:1028] (0/2) Epoch 25, batch 7250, loss[loss=0.1865, simple_loss=0.2551, pruned_loss=0.05891, over 12889.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2608, pruned_loss=0.07076, over 2578486.10 frames. ], batch size: 36, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:12:08,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=458447.0, ans=0.125 2024-06-21 21:12:17,399 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.294e+02 2.462e+02 2.674e+02 3.557e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-21 21:12:22,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=458483.6666666667, ans=0.125 2024-06-21 21:12:26,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458502.0, ans=0.1 2024-06-21 21:12:28,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458502.0, ans=0.1 2024-06-21 21:12:32,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458502.0, ans=0.1 2024-06-21 21:12:39,945 INFO [train.py:1028] (0/2) Epoch 25, batch 7300, loss[loss=0.1963, simple_loss=0.2652, pruned_loss=0.06366, over 12965.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2622, pruned_loss=0.07131, over 2578516.53 frames. ], batch size: 36, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:12:40,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=458538.6666666667, ans=0.2 2024-06-21 21:12:47,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=458557.0, ans=0.1 2024-06-21 21:12:48,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=458557.0, ans=0.125 2024-06-21 21:12:54,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=458575.3333333333, ans=0.125 2024-06-21 21:12:56,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=458575.3333333333, ans=0.0 2024-06-21 21:12:57,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=458575.3333333333, ans=0.0 2024-06-21 21:12:58,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458575.3333333333, ans=0.1 2024-06-21 21:13:12,792 INFO [train.py:1028] (0/2) Epoch 25, batch 7350, loss[loss=0.2107, simple_loss=0.2687, pruned_loss=0.0764, over 13289.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2632, pruned_loss=0.07189, over 2581439.27 frames. ], batch size: 46, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:13:15,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=458630.3333333333, ans=0.025 2024-06-21 21:13:21,021 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.13 vs. limit=12.0 2024-06-21 21:13:22,549 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.351e+02 2.469e+02 2.728e+02 3.618e+02, threshold=4.938e+02, percent-clipped=0.0 2024-06-21 21:13:23,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=458648.6666666667, ans=0.125 2024-06-21 21:13:32,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=458685.3333333333, ans=0.125 2024-06-21 21:13:35,360 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.43 vs. limit=10.0 2024-06-21 21:13:37,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2024-06-21 21:13:45,857 INFO [train.py:1028] (0/2) Epoch 25, batch 7400, loss[loss=0.2099, simple_loss=0.2754, pruned_loss=0.07222, over 13292.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2635, pruned_loss=0.07176, over 2586334.94 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:13:52,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=458722.0, ans=0.2 2024-06-21 21:13:55,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458740.3333333333, ans=0.1 2024-06-21 21:14:03,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=458740.3333333333, ans=0.125 2024-06-21 21:14:03,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=458740.3333333333, ans=0.125 2024-06-21 21:14:11,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.49 vs. limit=10.0 2024-06-21 21:14:16,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.86 vs. limit=15.0 2024-06-21 21:14:19,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=458795.3333333333, ans=0.2 2024-06-21 21:14:21,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458795.3333333333, ans=0.125 2024-06-21 21:14:22,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.36 vs. limit=22.5 2024-06-21 21:14:26,861 INFO [train.py:1028] (0/2) Epoch 25, batch 7450, loss[loss=0.1827, simple_loss=0.2395, pruned_loss=0.06289, over 12680.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2633, pruned_loss=0.07145, over 2579404.13 frames. ], batch size: 29, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:14:28,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=458813.6666666667, ans=0.2 2024-06-21 21:14:31,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=458813.6666666667, ans=0.0 2024-06-21 21:14:34,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.24 vs. limit=15.0 2024-06-21 21:14:37,004 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.318e+02 2.456e+02 2.708e+02 4.154e+02, threshold=4.912e+02, percent-clipped=0.0 2024-06-21 21:14:49,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=458868.6666666667, ans=0.125 2024-06-21 21:14:56,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=458887.0, ans=0.125 2024-06-21 21:14:56,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=458887.0, ans=0.04949747468305833 2024-06-21 21:15:00,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=458905.3333333333, ans=0.0 2024-06-21 21:15:00,499 INFO [train.py:1028] (0/2) Epoch 25, batch 7500, loss[loss=0.2239, simple_loss=0.2683, pruned_loss=0.08976, over 10538.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2639, pruned_loss=0.07208, over 2576397.84 frames. ], batch size: 303, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:15:03,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=458905.3333333333, ans=0.125 2024-06-21 21:15:13,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=458942.0, ans=0.025 2024-06-21 21:15:19,159 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2024-06-21 21:15:19,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=458942.0, ans=0.0 2024-06-21 21:15:20,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=458960.3333333333, ans=0.125 2024-06-21 21:15:21,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458960.3333333333, ans=0.1 2024-06-21 21:15:23,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458960.3333333333, ans=0.1 2024-06-21 21:15:25,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=458960.3333333333, ans=0.2 2024-06-21 21:15:30,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=458978.6666666667, ans=0.125 2024-06-21 21:15:33,412 INFO [train.py:1028] (0/2) Epoch 25, batch 7550, loss[loss=0.2168, simple_loss=0.2623, pruned_loss=0.08566, over 12930.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2641, pruned_loss=0.07263, over 2575062.82 frames. ], batch size: 158, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:15:43,311 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.394e+02 2.563e+02 2.773e+02 3.568e+02, threshold=5.125e+02, percent-clipped=0.0 2024-06-21 21:15:44,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=459015.3333333333, ans=0.125 2024-06-21 21:16:13,326 INFO [train.py:1028] (0/2) Epoch 25, batch 7600, loss[loss=0.2211, simple_loss=0.2779, pruned_loss=0.08213, over 13217.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2651, pruned_loss=0.07304, over 2573811.47 frames. ], batch size: 83, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:16:16,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.28 vs. limit=15.0 2024-06-21 21:16:27,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=459125.3333333333, ans=10.0 2024-06-21 21:16:39,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=459162.0, ans=0.0 2024-06-21 21:16:39,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=459162.0, ans=0.0 2024-06-21 21:16:46,606 INFO [train.py:1028] (0/2) Epoch 25, batch 7650, loss[loss=0.2111, simple_loss=0.2759, pruned_loss=0.07311, over 13001.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2655, pruned_loss=0.07308, over 2569504.30 frames. ], batch size: 33, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:16:50,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=459180.3333333333, ans=0.125 2024-06-21 21:16:50,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=459180.3333333333, ans=0.125 2024-06-21 21:16:56,684 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.347e+02 2.534e+02 2.831e+02 4.293e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 21:16:58,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=459198.6666666667, ans=0.0 2024-06-21 21:17:04,144 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:17:06,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.62 vs. limit=22.5 2024-06-21 21:17:07,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=459235.3333333333, ans=0.2 2024-06-21 21:17:10,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=459235.3333333333, ans=0.125 2024-06-21 21:17:17,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=459253.6666666667, ans=0.125 2024-06-21 21:17:19,864 INFO [train.py:1028] (0/2) Epoch 25, batch 7700, loss[loss=0.2044, simple_loss=0.2706, pruned_loss=0.06911, over 13270.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.266, pruned_loss=0.07316, over 2566820.40 frames. ], batch size: 63, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:17:20,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459272.0, ans=0.1 2024-06-21 21:17:33,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=459308.6666666667, ans=0.2 2024-06-21 21:17:49,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=459327.0, ans=0.125 2024-06-21 21:17:49,651 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:17:50,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=15.0 2024-06-21 21:17:53,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.39 vs. limit=15.0 2024-06-21 21:17:59,895 INFO [train.py:1028] (0/2) Epoch 25, batch 7750, loss[loss=0.1887, simple_loss=0.2609, pruned_loss=0.05823, over 13225.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2666, pruned_loss=0.07347, over 2570576.85 frames. ], batch size: 72, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:18:09,770 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.324e+02 2.449e+02 2.639e+02 3.398e+02, threshold=4.899e+02, percent-clipped=0.0 2024-06-21 21:18:12,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=459400.3333333333, ans=0.0 2024-06-21 21:18:13,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=459400.3333333333, ans=0.125 2024-06-21 21:18:18,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.89 vs. limit=15.0 2024-06-21 21:18:19,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=459418.6666666667, ans=0.125 2024-06-21 21:18:21,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=459418.6666666667, ans=0.125 2024-06-21 21:18:23,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=459418.6666666667, ans=0.025 2024-06-21 21:18:32,761 INFO [train.py:1028] (0/2) Epoch 25, batch 7800, loss[loss=0.2102, simple_loss=0.2618, pruned_loss=0.0793, over 13125.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2667, pruned_loss=0.0732, over 2576080.80 frames. ], batch size: 95, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:18:39,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459473.6666666667, ans=0.125 2024-06-21 21:18:41,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459473.6666666667, ans=0.1 2024-06-21 21:18:58,485 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2024-06-21 21:19:01,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=459528.6666666667, ans=0.125 2024-06-21 21:19:05,707 INFO [train.py:1028] (0/2) Epoch 25, batch 7850, loss[loss=0.2025, simple_loss=0.2601, pruned_loss=0.07246, over 11685.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2672, pruned_loss=0.0735, over 2571406.92 frames. ], batch size: 17, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:19:12,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=15.0 2024-06-21 21:19:15,598 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.383e+02 2.508e+02 2.750e+02 3.302e+02, threshold=5.017e+02, percent-clipped=0.0 2024-06-21 21:19:22,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=459583.6666666667, ans=0.0 2024-06-21 21:19:23,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=459583.6666666667, ans=0.0 2024-06-21 21:19:31,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=10.44 vs. limit=15.0 2024-06-21 21:19:45,193 INFO [train.py:1028] (0/2) Epoch 25, batch 7900, loss[loss=0.2136, simple_loss=0.2756, pruned_loss=0.07576, over 13147.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2678, pruned_loss=0.07377, over 2570073.84 frames. ], batch size: 77, lr: 2.30e-03, grad_scale: 64.0 2024-06-21 21:19:47,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=459638.6666666667, ans=0.0 2024-06-21 21:19:48,995 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=15.0 2024-06-21 21:19:50,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=459638.6666666667, ans=0.0 2024-06-21 21:19:54,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=459657.0, ans=0.07 2024-06-21 21:20:03,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=459675.3333333333, ans=0.0 2024-06-21 21:20:14,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=459712.0, ans=0.025 2024-06-21 21:20:18,595 INFO [train.py:1028] (0/2) Epoch 25, batch 7950, loss[loss=0.233, simple_loss=0.2768, pruned_loss=0.0946, over 10336.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2679, pruned_loss=0.07396, over 2572826.45 frames. ], batch size: 303, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:20:23,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=459730.3333333333, ans=0.2 2024-06-21 21:20:25,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=459748.6666666667, ans=0.0 2024-06-21 21:20:26,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=459748.6666666667, ans=0.125 2024-06-21 21:20:29,504 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.353e+02 2.495e+02 2.656e+02 3.689e+02, threshold=4.991e+02, percent-clipped=0.0 2024-06-21 21:20:31,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459748.6666666667, ans=0.1 2024-06-21 21:20:31,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=459748.6666666667, ans=0.0 2024-06-21 21:20:32,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=459767.0, ans=0.125 2024-06-21 21:20:40,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=459785.3333333333, ans=0.125 2024-06-21 21:20:52,359 INFO [train.py:1028] (0/2) Epoch 25, batch 8000, loss[loss=0.1862, simple_loss=0.2537, pruned_loss=0.05936, over 12658.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2683, pruned_loss=0.07399, over 2569620.93 frames. ], batch size: 29, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:20:54,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2024-06-21 21:20:58,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=459840.3333333333, ans=0.025 2024-06-21 21:21:07,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=459858.6666666667, ans=0.2 2024-06-21 21:21:14,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.58 vs. limit=22.5 2024-06-21 21:21:17,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=459877.0, ans=0.0 2024-06-21 21:21:25,835 INFO [train.py:1028] (0/2) Epoch 25, batch 8050, loss[loss=0.2157, simple_loss=0.2788, pruned_loss=0.07636, over 13248.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2682, pruned_loss=0.07379, over 2570755.79 frames. ], batch size: 83, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:21:40,126 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.356e+02 2.506e+02 2.722e+02 3.626e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 21:21:49,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.31 vs. limit=22.5 2024-06-21 21:21:53,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.21 vs. limit=15.0 2024-06-21 21:21:53,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459968.6666666667, ans=0.125 2024-06-21 21:22:04,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=460005.3333333333, ans=0.035 2024-06-21 21:22:05,344 INFO [train.py:1028] (0/2) Epoch 25, batch 8100, loss[loss=0.2044, simple_loss=0.2667, pruned_loss=0.07106, over 13121.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2687, pruned_loss=0.07389, over 2576132.88 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:22:13,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=460023.6666666667, ans=0.0 2024-06-21 21:22:23,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=460042.0, ans=0.125 2024-06-21 21:22:27,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460060.3333333333, ans=0.125 2024-06-21 21:22:33,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460078.6666666667, ans=0.125 2024-06-21 21:22:39,088 INFO [train.py:1028] (0/2) Epoch 25, batch 8150, loss[loss=0.1844, simple_loss=0.238, pruned_loss=0.06539, over 13139.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2688, pruned_loss=0.07362, over 2579509.43 frames. ], batch size: 121, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:22:41,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=460097.0, ans=0.125 2024-06-21 21:22:44,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=15.0 2024-06-21 21:22:47,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460115.3333333333, ans=0.0 2024-06-21 21:22:50,133 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.342e+02 2.457e+02 2.669e+02 3.414e+02, threshold=4.913e+02, percent-clipped=0.0 2024-06-21 21:22:50,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=460115.3333333333, ans=0.2 2024-06-21 21:22:50,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.95 vs. limit=15.0 2024-06-21 21:22:52,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=460133.6666666667, ans=0.025 2024-06-21 21:23:10,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=460170.3333333333, ans=0.125 2024-06-21 21:23:12,957 INFO [train.py:1028] (0/2) Epoch 25, batch 8200, loss[loss=0.2171, simple_loss=0.279, pruned_loss=0.07759, over 13140.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2695, pruned_loss=0.07393, over 2583231.87 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:23:18,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=460188.6666666667, ans=0.125 2024-06-21 21:23:25,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=460207.0, ans=0.1 2024-06-21 21:23:30,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=460225.3333333333, ans=0.1 2024-06-21 21:23:34,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=460243.6666666667, ans=0.125 2024-06-21 21:23:34,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=460243.6666666667, ans=0.2 2024-06-21 21:23:35,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460243.6666666667, ans=0.125 2024-06-21 21:23:46,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2024-06-21 21:23:49,567 INFO [train.py:1028] (0/2) Epoch 25, batch 8250, loss[loss=0.1975, simple_loss=0.263, pruned_loss=0.06597, over 13219.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2697, pruned_loss=0.07408, over 2582911.21 frames. ], batch size: 52, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:23:55,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=460280.3333333333, ans=0.02 2024-06-21 21:23:57,174 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460280.3333333333, ans=0.0 2024-06-21 21:24:03,330 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.331e+02 2.503e+02 2.715e+02 4.068e+02, threshold=5.006e+02, percent-clipped=0.0 2024-06-21 21:24:08,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.92 vs. limit=15.0 2024-06-21 21:24:25,359 INFO [train.py:1028] (0/2) Epoch 25, batch 8300, loss[loss=0.2223, simple_loss=0.2807, pruned_loss=0.08196, over 13125.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2696, pruned_loss=0.0739, over 2580317.30 frames. ], batch size: 103, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:24:31,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=460390.3333333333, ans=0.125 2024-06-21 21:24:31,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=460390.3333333333, ans=0.125 2024-06-21 21:24:36,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=460390.3333333333, ans=0.2 2024-06-21 21:24:49,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=460427.0, ans=0.035 2024-06-21 21:24:51,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=460445.3333333333, ans=0.125 2024-06-21 21:24:58,463 INFO [train.py:1028] (0/2) Epoch 25, batch 8350, loss[loss=0.2439, simple_loss=0.2982, pruned_loss=0.09481, over 13178.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2698, pruned_loss=0.07367, over 2582319.82 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:25:06,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=460482.0, ans=0.125 2024-06-21 21:25:09,290 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.364e+02 2.492e+02 2.693e+02 3.725e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-21 21:25:12,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=460500.3333333333, ans=0.125 2024-06-21 21:25:32,364 INFO [train.py:1028] (0/2) Epoch 25, batch 8400, loss[loss=0.2186, simple_loss=0.2765, pruned_loss=0.08034, over 12995.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2702, pruned_loss=0.07418, over 2577606.13 frames. ], batch size: 39, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:25:32,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=460555.3333333333, ans=0.1 2024-06-21 21:26:06,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=460628.6666666667, ans=0.0 2024-06-21 21:26:09,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460628.6666666667, ans=0.0 2024-06-21 21:26:09,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=460628.6666666667, ans=0.125 2024-06-21 21:26:11,625 INFO [train.py:1028] (0/2) Epoch 25, batch 8450, loss[loss=0.221, simple_loss=0.2876, pruned_loss=0.07717, over 13151.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2707, pruned_loss=0.07422, over 2579019.03 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:26:22,116 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.348e+02 2.633e+02 2.920e+02 3.877e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-21 21:26:28,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=460683.6666666667, ans=0.125 2024-06-21 21:26:32,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=460702.0, ans=0.1 2024-06-21 21:26:34,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.25 vs. limit=15.0 2024-06-21 21:26:35,490 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:26:36,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=460702.0, ans=0.125 2024-06-21 21:26:36,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=460702.0, ans=0.025 2024-06-21 21:26:44,762 INFO [train.py:1028] (0/2) Epoch 25, batch 8500, loss[loss=0.2037, simple_loss=0.2708, pruned_loss=0.06823, over 12536.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2719, pruned_loss=0.07472, over 2577855.09 frames. ], batch size: 29, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:26:46,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2024-06-21 21:26:48,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460738.6666666667, ans=0.1 2024-06-21 21:26:58,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=460775.3333333333, ans=0.0 2024-06-21 21:27:13,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=460812.0, ans=0.125 2024-06-21 21:27:15,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=460812.0, ans=0.125 2024-06-21 21:27:16,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=22.5 2024-06-21 21:27:17,876 INFO [train.py:1028] (0/2) Epoch 25, batch 8550, loss[loss=0.2043, simple_loss=0.2656, pruned_loss=0.07153, over 12498.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2712, pruned_loss=0.07428, over 2575087.76 frames. ], batch size: 22, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:27:17,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=460830.3333333333, ans=0.125 2024-06-21 21:27:21,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=460830.3333333333, ans=0.025 2024-06-21 21:27:21,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460830.3333333333, ans=0.1 2024-06-21 21:27:23,228 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:27:28,309 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.383e+02 2.565e+02 2.842e+02 3.625e+02, threshold=5.131e+02, percent-clipped=0.0 2024-06-21 21:27:29,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=460848.6666666667, ans=0.0 2024-06-21 21:27:37,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=460885.3333333333, ans=0.0 2024-06-21 21:27:47,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=460903.6666666667, ans=0.025 2024-06-21 21:27:57,348 INFO [train.py:1028] (0/2) Epoch 25, batch 8600, loss[loss=0.1939, simple_loss=0.2508, pruned_loss=0.0685, over 13145.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2718, pruned_loss=0.07437, over 2572454.45 frames. ], batch size: 112, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:28:01,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=460922.0, ans=0.0 2024-06-21 21:28:09,707 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.73 vs. limit=22.5 2024-06-21 21:28:20,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=460977.0, ans=0.0 2024-06-21 21:28:28,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460995.3333333333, ans=0.0 2024-06-21 21:28:29,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=460995.3333333333, ans=0.0 2024-06-21 21:28:30,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=461013.6666666667, ans=0.125 2024-06-21 21:28:30,757 INFO [train.py:1028] (0/2) Epoch 25, batch 8650, loss[loss=0.1801, simple_loss=0.2353, pruned_loss=0.06242, over 13046.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2715, pruned_loss=0.07406, over 2575631.60 frames. ], batch size: 102, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:28:31,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=461013.6666666667, ans=0.2 2024-06-21 21:28:38,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2024-06-21 21:28:41,147 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.301e+02 2.464e+02 2.674e+02 4.058e+02, threshold=4.927e+02, percent-clipped=0.0 2024-06-21 21:28:43,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2024-06-21 21:29:00,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=461087.0, ans=0.025 2024-06-21 21:29:00,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=461087.0, ans=0.2 2024-06-21 21:29:03,641 INFO [train.py:1028] (0/2) Epoch 25, batch 8700, loss[loss=0.2199, simple_loss=0.2852, pruned_loss=0.0773, over 13222.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2718, pruned_loss=0.07451, over 2572750.57 frames. ], batch size: 59, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:29:05,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=461105.3333333333, ans=0.125 2024-06-21 21:29:08,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=461105.3333333333, ans=0.025 2024-06-21 21:29:15,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=461123.6666666667, ans=0.125 2024-06-21 21:29:16,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=461142.0, ans=0.0 2024-06-21 21:29:18,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=461142.0, ans=0.2 2024-06-21 21:29:22,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=15.0 2024-06-21 21:29:33,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=461178.6666666667, ans=0.125 2024-06-21 21:29:37,057 INFO [train.py:1028] (0/2) Epoch 25, batch 8750, loss[loss=0.2025, simple_loss=0.259, pruned_loss=0.07295, over 13058.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2719, pruned_loss=0.07468, over 2569753.98 frames. ], batch size: 121, lr: 2.30e-03, grad_scale: 32.0 2024-06-21 21:29:44,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=461197.0, ans=0.125 2024-06-21 21:29:46,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=461215.3333333333, ans=0.125 2024-06-21 21:29:52,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.27 vs. limit=10.0 2024-06-21 21:29:54,219 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.404e+02 2.582e+02 2.752e+02 5.680e+02, threshold=5.165e+02, percent-clipped=1.0 2024-06-21 21:29:58,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=461233.6666666667, ans=0.125 2024-06-21 21:29:58,967 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=12.0 2024-06-21 21:30:06,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=461252.0, ans=0.125 2024-06-21 21:30:08,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=461252.0, ans=0.125 2024-06-21 21:30:14,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=461270.3333333333, ans=0.0 2024-06-21 21:30:17,542 INFO [train.py:1028] (0/2) Epoch 25, batch 8800, loss[loss=0.2124, simple_loss=0.2777, pruned_loss=0.07351, over 13088.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2721, pruned_loss=0.07474, over 2573386.95 frames. ], batch size: 71, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:30:30,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=461307.0, ans=0.125 2024-06-21 21:30:32,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=461325.3333333333, ans=0.0 2024-06-21 21:30:39,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=461343.6666666667, ans=0.09899494936611666 2024-06-21 21:30:43,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=461343.6666666667, ans=0.0 2024-06-21 21:30:44,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=461343.6666666667, ans=0.125 2024-06-21 21:30:52,390 INFO [train.py:1028] (0/2) Epoch 25, batch 8850, loss[loss=0.218, simple_loss=0.2739, pruned_loss=0.08112, over 12526.00 frames. ], tot_loss[loss=0.211, simple_loss=0.272, pruned_loss=0.075, over 2562101.10 frames. ], batch size: 202, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:30:59,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=461398.6666666667, ans=0.0 2024-06-21 21:31:03,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.387e+02 2.507e+02 2.660e+02 3.786e+02, threshold=5.013e+02, percent-clipped=0.0 2024-06-21 21:31:04,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=461398.6666666667, ans=0.125 2024-06-21 21:31:07,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=461417.0, ans=0.125 2024-06-21 21:31:08,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=461417.0, ans=0.125 2024-06-21 21:31:13,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461435.3333333333, ans=0.1 2024-06-21 21:31:16,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=461435.3333333333, ans=0.125 2024-06-21 21:31:26,610 INFO [train.py:1028] (0/2) Epoch 25, batch 8900, loss[loss=0.2313, simple_loss=0.2982, pruned_loss=0.08217, over 12922.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2724, pruned_loss=0.0753, over 2560068.52 frames. ], batch size: 33, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:31:30,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461472.0, ans=0.1 2024-06-21 21:31:43,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461508.6666666667, ans=0.125 2024-06-21 21:31:43,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=461508.6666666667, ans=0.025 2024-06-21 21:31:45,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2024-06-21 21:31:47,226 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2024-06-21 21:32:04,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=461545.3333333333, ans=0.125 2024-06-21 21:32:07,167 INFO [train.py:1028] (0/2) Epoch 25, batch 8950, loss[loss=0.2329, simple_loss=0.288, pruned_loss=0.0889, over 12514.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2731, pruned_loss=0.0751, over 2560372.44 frames. ], batch size: 202, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:32:17,831 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.356e+02 2.522e+02 2.764e+02 3.436e+02, threshold=5.045e+02, percent-clipped=0.0 2024-06-21 21:32:35,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=15.0 2024-06-21 21:32:39,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=461637.0, ans=0.0 2024-06-21 21:32:41,106 INFO [train.py:1028] (0/2) Epoch 25, batch 9000, loss[loss=0.2225, simple_loss=0.2858, pruned_loss=0.07956, over 13336.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2734, pruned_loss=0.0751, over 2567568.13 frames. ], batch size: 46, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:32:41,108 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 21:32:49,003 INFO [train.py:1060] (0/2) Epoch 25, validation: loss=0.19, simple_loss=0.2509, pruned_loss=0.06457, over 351949.00 frames. 2024-06-21 21:32:49,004 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 21:32:51,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=461655.3333333333, ans=0.125 2024-06-21 21:33:08,492 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2024-06-21 21:33:12,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461710.3333333333, ans=0.1 2024-06-21 21:33:22,022 INFO [train.py:1028] (0/2) Epoch 25, batch 9050, loss[loss=0.199, simple_loss=0.2629, pruned_loss=0.06757, over 11594.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2737, pruned_loss=0.07535, over 2567113.55 frames. ], batch size: 17, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:33:26,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=461747.0, ans=0.125 2024-06-21 21:33:32,179 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.342e+02 2.449e+02 2.640e+02 3.471e+02, threshold=4.898e+02, percent-clipped=0.0 2024-06-21 21:33:32,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.21 vs. limit=10.0 2024-06-21 21:33:33,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2024-06-21 21:33:54,079 INFO [train.py:1028] (0/2) Epoch 25, batch 9100, loss[loss=0.2062, simple_loss=0.2719, pruned_loss=0.07028, over 13237.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2734, pruned_loss=0.0752, over 2568984.08 frames. ], batch size: 72, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:33:57,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=461838.6666666667, ans=0.125 2024-06-21 21:34:04,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=461857.0, ans=0.2 2024-06-21 21:34:06,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=461875.3333333333, ans=0.125 2024-06-21 21:34:07,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2024-06-21 21:34:08,195 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:34:09,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=461875.3333333333, ans=0.2 2024-06-21 21:34:19,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=461912.0, ans=0.025 2024-06-21 21:34:25,785 INFO [train.py:1028] (0/2) Epoch 25, batch 9150, loss[loss=0.1971, simple_loss=0.2633, pruned_loss=0.06543, over 13201.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2731, pruned_loss=0.07486, over 2570016.54 frames. ], batch size: 77, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:34:33,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=461948.6666666667, ans=0.125 2024-06-21 21:34:37,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=461948.6666666667, ans=0.125 2024-06-21 21:34:37,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2024-06-21 21:34:38,770 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.320e+02 2.479e+02 2.654e+02 3.424e+02, threshold=4.957e+02, percent-clipped=0.0 2024-06-21 21:34:44,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=461967.0, ans=0.125 2024-06-21 21:34:52,344 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-252000.pt 2024-06-21 21:35:00,007 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:35:08,387 INFO [train.py:1028] (0/2) Epoch 25, batch 9200, loss[loss=0.1889, simple_loss=0.251, pruned_loss=0.06342, over 12817.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2731, pruned_loss=0.07462, over 2572018.10 frames. ], batch size: 36, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:35:09,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=462022.0, ans=0.0 2024-06-21 21:35:11,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462022.0, ans=0.1 2024-06-21 21:35:16,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=462040.3333333333, ans=0.2 2024-06-21 21:35:23,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=462058.6666666667, ans=0.0 2024-06-21 21:35:27,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=462077.0, ans=0.125 2024-06-21 21:35:34,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=462095.3333333333, ans=0.125 2024-06-21 21:35:39,848 INFO [train.py:1028] (0/2) Epoch 25, batch 9250, loss[loss=0.2295, simple_loss=0.2927, pruned_loss=0.08316, over 13196.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2733, pruned_loss=0.07448, over 2575480.68 frames. ], batch size: 67, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:35:49,868 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.363e+02 2.490e+02 2.609e+02 3.303e+02, threshold=4.981e+02, percent-clipped=0.0 2024-06-21 21:35:54,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=462150.3333333333, ans=0.125 2024-06-21 21:35:59,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=462168.6666666667, ans=0.0 2024-06-21 21:36:03,659 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.665e+01 2024-06-21 21:36:11,413 INFO [train.py:1028] (0/2) Epoch 25, batch 9300, loss[loss=0.1763, simple_loss=0.2394, pruned_loss=0.05659, over 12867.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2732, pruned_loss=0.07458, over 2570673.03 frames. ], batch size: 39, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:36:14,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462205.3333333333, ans=0.1 2024-06-21 21:36:15,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=462205.3333333333, ans=0.125 2024-06-21 21:36:16,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=462205.3333333333, ans=0.125 2024-06-21 21:36:17,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462223.6666666667, ans=0.1 2024-06-21 21:36:24,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=462242.0, ans=0.125 2024-06-21 21:36:38,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462278.6666666667, ans=0.125 2024-06-21 21:36:42,718 INFO [train.py:1028] (0/2) Epoch 25, batch 9350, loss[loss=0.1847, simple_loss=0.2588, pruned_loss=0.05526, over 12487.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2735, pruned_loss=0.07467, over 2567517.93 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:36:46,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=15.0 2024-06-21 21:36:52,661 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.359e+02 2.535e+02 2.821e+02 3.794e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-21 21:36:57,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=462333.6666666667, ans=0.125 2024-06-21 21:37:11,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=462370.3333333333, ans=0.125 2024-06-21 21:37:13,598 INFO [train.py:1028] (0/2) Epoch 25, batch 9400, loss[loss=0.205, simple_loss=0.2716, pruned_loss=0.06915, over 13283.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2737, pruned_loss=0.07474, over 2566823.83 frames. ], batch size: 52, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:37:16,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=462388.6666666667, ans=0.125 2024-06-21 21:37:31,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=462443.6666666667, ans=0.0 2024-06-21 21:37:38,343 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.36 vs. limit=15.0 2024-06-21 21:37:43,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462462.0, ans=0.1 2024-06-21 21:37:44,250 INFO [train.py:1028] (0/2) Epoch 25, batch 9450, loss[loss=0.2062, simple_loss=0.2739, pruned_loss=0.06922, over 12642.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2745, pruned_loss=0.07522, over 2566796.40 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:37:45,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=462480.3333333333, ans=0.125 2024-06-21 21:37:54,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=462498.6666666667, ans=0.0 2024-06-21 21:37:56,794 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.336e+02 2.451e+02 2.684e+02 4.201e+02, threshold=4.901e+02, percent-clipped=0.0 2024-06-21 21:37:57,131 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2024-06-21 21:38:08,372 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:38:19,571 INFO [train.py:1028] (0/2) Epoch 25, batch 9500, loss[loss=0.2039, simple_loss=0.272, pruned_loss=0.06791, over 13272.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2737, pruned_loss=0.07448, over 2576976.31 frames. ], batch size: 43, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:38:30,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=462590.3333333333, ans=0.07 2024-06-21 21:38:32,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=462608.6666666667, ans=0.07 2024-06-21 21:38:39,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=462627.0, ans=0.125 2024-06-21 21:38:40,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=462627.0, ans=0.125 2024-06-21 21:38:50,606 INFO [train.py:1028] (0/2) Epoch 25, batch 9550, loss[loss=0.1915, simple_loss=0.2567, pruned_loss=0.06315, over 12958.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2737, pruned_loss=0.07483, over 2572035.61 frames. ], batch size: 39, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:38:54,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=462663.6666666667, ans=0.125 2024-06-21 21:39:00,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=462682.0, ans=0.125 2024-06-21 21:39:00,547 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.319e+02 2.466e+02 2.704e+02 3.652e+02, threshold=4.931e+02, percent-clipped=0.0 2024-06-21 21:39:03,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=462700.3333333333, ans=0.125 2024-06-21 21:39:04,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2024-06-21 21:39:05,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=462700.3333333333, ans=0.2 2024-06-21 21:39:08,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=462700.3333333333, ans=0.1 2024-06-21 21:39:10,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=15.0 2024-06-21 21:39:21,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=462755.3333333333, ans=0.025 2024-06-21 21:39:21,614 INFO [train.py:1028] (0/2) Epoch 25, batch 9600, loss[loss=0.2144, simple_loss=0.2655, pruned_loss=0.08169, over 10581.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2728, pruned_loss=0.07444, over 2569622.82 frames. ], batch size: 304, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:39:21,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.50 vs. limit=22.5 2024-06-21 21:39:23,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=462755.3333333333, ans=0.07 2024-06-21 21:39:26,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=462755.3333333333, ans=0.125 2024-06-21 21:39:34,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=462792.0, ans=0.125 2024-06-21 21:39:37,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=462792.0, ans=0.1 2024-06-21 21:39:41,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2024-06-21 21:39:47,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=462828.6666666667, ans=0.0 2024-06-21 21:39:50,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462828.6666666667, ans=0.1 2024-06-21 21:39:52,493 INFO [train.py:1028] (0/2) Epoch 25, batch 9650, loss[loss=0.2158, simple_loss=0.2727, pruned_loss=0.07949, over 13123.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2732, pruned_loss=0.07518, over 2561499.12 frames. ], batch size: 132, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:39:52,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=462847.0, ans=0.09899494936611666 2024-06-21 21:39:56,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=462847.0, ans=0.125 2024-06-21 21:40:02,681 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.399e+02 2.610e+02 2.903e+02 4.224e+02, threshold=5.220e+02, percent-clipped=0.0 2024-06-21 21:40:09,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=462883.6666666667, ans=22.5 2024-06-21 21:40:10,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=462902.0, ans=0.0 2024-06-21 21:40:20,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=462920.3333333333, ans=0.125 2024-06-21 21:40:22,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=462920.3333333333, ans=0.125 2024-06-21 21:40:23,740 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:40:24,892 INFO [train.py:1028] (0/2) Epoch 25, batch 9700, loss[loss=0.2207, simple_loss=0.2719, pruned_loss=0.08468, over 13064.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2736, pruned_loss=0.07558, over 2558097.27 frames. ], batch size: 144, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:40:26,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=462938.6666666667, ans=0.0 2024-06-21 21:40:37,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=462957.0, ans=0.125 2024-06-21 21:40:49,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=462993.6666666667, ans=0.0 2024-06-21 21:40:50,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=462993.6666666667, ans=0.025 2024-06-21 21:40:57,212 INFO [train.py:1028] (0/2) Epoch 25, batch 9750, loss[loss=0.2049, simple_loss=0.2586, pruned_loss=0.07564, over 13105.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2728, pruned_loss=0.07498, over 2554829.88 frames. ], batch size: 132, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:40:57,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463030.3333333333, ans=0.0 2024-06-21 21:41:07,845 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.309e+02 2.515e+02 2.704e+02 3.548e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-21 21:41:08,591 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:41:08,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=463048.6666666667, ans=0.0 2024-06-21 21:41:14,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=463067.0, ans=0.2 2024-06-21 21:41:19,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2024-06-21 21:41:20,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=463085.3333333333, ans=0.1 2024-06-21 21:41:27,815 INFO [train.py:1028] (0/2) Epoch 25, batch 9800, loss[loss=0.2101, simple_loss=0.275, pruned_loss=0.07257, over 12990.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2723, pruned_loss=0.07466, over 2547136.81 frames. ], batch size: 39, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:41:28,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463122.0, ans=0.1 2024-06-21 21:41:38,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=463140.3333333333, ans=0.125 2024-06-21 21:41:41,677 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:41:49,271 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2024-06-21 21:41:53,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=463195.3333333333, ans=0.0 2024-06-21 21:41:58,167 INFO [train.py:1028] (0/2) Epoch 25, batch 9850, loss[loss=0.2082, simple_loss=0.2652, pruned_loss=0.07553, over 12998.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2716, pruned_loss=0.07433, over 2539233.10 frames. ], batch size: 102, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:42:06,521 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:42:07,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=463232.0, ans=0.0 2024-06-21 21:42:09,558 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.392e+02 2.535e+02 2.772e+02 3.494e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-21 21:42:20,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=463268.6666666667, ans=0.125 2024-06-21 21:42:27,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=463287.0, ans=0.0 2024-06-21 21:42:30,725 INFO [train.py:1028] (0/2) Epoch 25, batch 9900, loss[loss=0.1986, simple_loss=0.2684, pruned_loss=0.06444, over 12905.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2709, pruned_loss=0.07444, over 2531957.40 frames. ], batch size: 39, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:42:32,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=463305.3333333333, ans=0.025 2024-06-21 21:42:34,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463305.3333333333, ans=0.125 2024-06-21 21:42:44,268 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.65 vs. limit=22.5 2024-06-21 21:42:50,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2024-06-21 21:42:54,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.00 vs. limit=15.0 2024-06-21 21:42:55,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2024-06-21 21:42:56,141 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:43:01,626 INFO [train.py:1028] (0/2) Epoch 25, batch 9950, loss[loss=0.217, simple_loss=0.272, pruned_loss=0.08096, over 12703.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2703, pruned_loss=0.07456, over 2527009.03 frames. ], batch size: 29, lr: 2.29e-03, grad_scale: 16.0 2024-06-21 21:43:02,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=22.5 2024-06-21 21:43:13,800 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.373e+02 2.506e+02 2.700e+02 3.986e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 21:43:17,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463433.6666666667, ans=0.0 2024-06-21 21:43:17,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=463433.6666666667, ans=0.0 2024-06-21 21:43:27,245 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2024-06-21 21:43:33,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.98 vs. limit=22.5 2024-06-21 21:43:35,025 INFO [train.py:1028] (0/2) Epoch 25, batch 10000, loss[loss=0.242, simple_loss=0.2983, pruned_loss=0.09287, over 12754.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.271, pruned_loss=0.07526, over 2488363.70 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:43:37,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=463488.6666666667, ans=0.2 2024-06-21 21:43:42,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=463507.0, ans=0.0 2024-06-21 21:43:43,209 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.33 vs. limit=22.5 2024-06-21 21:43:45,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=463507.0, ans=0.125 2024-06-21 21:43:57,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=463543.6666666667, ans=0.09899494936611666 2024-06-21 21:44:07,419 INFO [train.py:1028] (0/2) Epoch 25, batch 10050, loss[loss=0.2385, simple_loss=0.2963, pruned_loss=0.09028, over 12655.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2714, pruned_loss=0.07599, over 2444945.02 frames. ], batch size: 22, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:44:08,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=463580.3333333333, ans=0.125 2024-06-21 21:44:17,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.424e+02 2.644e+02 2.914e+02 3.949e+02, threshold=5.287e+02, percent-clipped=0.0 2024-06-21 21:44:21,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=463617.0, ans=0.125 2024-06-21 21:44:21,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.74 vs. limit=6.0 2024-06-21 21:44:37,292 INFO [train.py:1028] (0/2) Epoch 25, batch 10100, loss[loss=0.1765, simple_loss=0.2447, pruned_loss=0.05413, over 10996.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2704, pruned_loss=0.07521, over 2424405.20 frames. ], batch size: 16, lr: 2.29e-03, grad_scale: 32.0 2024-06-21 21:44:45,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463690.3333333333, ans=0.125 2024-06-21 21:44:46,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463690.3333333333, ans=0.1 2024-06-21 21:44:50,599 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-25.pt 2024-06-21 21:46:45,746 INFO [train.py:1028] (0/2) Epoch 26, batch 0, loss[loss=0.1897, simple_loss=0.248, pruned_loss=0.06568, over 12920.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.248, pruned_loss=0.06568, over 12920.00 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:46:45,747 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 21:46:52,773 INFO [train.py:1060] (0/2) Epoch 26, validation: loss=0.1908, simple_loss=0.2527, pruned_loss=0.06451, over 351949.00 frames. 2024-06-21 21:46:52,773 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 21:46:55,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=463703.1666666667, ans=0.0 2024-06-21 21:47:01,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=463721.5, ans=0.0 2024-06-21 21:47:06,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=463739.8333333333, ans=0.2 2024-06-21 21:47:09,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=463739.8333333333, ans=0.125 2024-06-21 21:47:29,319 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.190e+02 2.373e+02 2.531e+02 3.649e+02, threshold=4.746e+02, percent-clipped=0.0 2024-06-21 21:47:29,347 INFO [train.py:1028] (0/2) Epoch 26, batch 50, loss[loss=0.1883, simple_loss=0.2458, pruned_loss=0.06535, over 12619.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2538, pruned_loss=0.06939, over 573675.22 frames. ], batch size: 29, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:47:30,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=463794.8333333333, ans=0.0 2024-06-21 21:47:33,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=463794.8333333333, ans=0.125 2024-06-21 21:47:33,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=463794.8333333333, ans=0.1 2024-06-21 21:47:33,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=463794.8333333333, ans=0.2 2024-06-21 21:47:36,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=463813.1666666667, ans=0.0 2024-06-21 21:47:47,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=463831.5, ans=0.0 2024-06-21 21:47:49,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=463849.8333333333, ans=0.025 2024-06-21 21:47:55,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=463868.1666666667, ans=0.0 2024-06-21 21:47:57,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=463868.1666666667, ans=0.0 2024-06-21 21:47:58,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=463868.1666666667, ans=0.2 2024-06-21 21:48:00,989 INFO [train.py:1028] (0/2) Epoch 26, batch 100, loss[loss=0.1809, simple_loss=0.2462, pruned_loss=0.05786, over 13327.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2508, pruned_loss=0.06772, over 1016026.63 frames. ], batch size: 46, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:48:12,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=463904.8333333333, ans=0.95 2024-06-21 21:48:23,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463941.5, ans=0.1 2024-06-21 21:48:26,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=463941.5, ans=0.125 2024-06-21 21:48:35,286 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.199e+02 2.354e+02 2.540e+02 3.565e+02, threshold=4.708e+02, percent-clipped=0.0 2024-06-21 21:48:35,314 INFO [train.py:1028] (0/2) Epoch 26, batch 150, loss[loss=0.1776, simple_loss=0.2333, pruned_loss=0.06093, over 12697.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2496, pruned_loss=0.06609, over 1363793.73 frames. ], batch size: 29, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:48:36,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=463978.1666666667, ans=0.1 2024-06-21 21:48:38,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463978.1666666667, ans=0.125 2024-06-21 21:48:39,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2024-06-21 21:48:40,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2024-06-21 21:48:43,944 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2024-06-21 21:48:55,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.05 vs. limit=15.0 2024-06-21 21:48:56,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464033.1666666667, ans=0.1 2024-06-21 21:48:58,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=464033.1666666667, ans=0.0 2024-06-21 21:49:00,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=464051.5, ans=0.125 2024-06-21 21:49:00,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=464051.5, ans=0.0 2024-06-21 21:49:07,635 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=464051.5, ans=0.125 2024-06-21 21:49:10,121 INFO [train.py:1028] (0/2) Epoch 26, batch 200, loss[loss=0.2351, simple_loss=0.2806, pruned_loss=0.09481, over 12525.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2503, pruned_loss=0.0666, over 1634010.38 frames. ], batch size: 202, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:49:25,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=464106.5, ans=0.125 2024-06-21 21:49:37,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464143.1666666667, ans=0.1 2024-06-21 21:49:42,632 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.198e+02 2.315e+02 2.428e+02 3.898e+02, threshold=4.630e+02, percent-clipped=0.0 2024-06-21 21:49:42,660 INFO [train.py:1028] (0/2) Epoch 26, batch 250, loss[loss=0.1667, simple_loss=0.2202, pruned_loss=0.05657, over 13084.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2505, pruned_loss=0.06619, over 1845602.93 frames. ], batch size: 144, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:49:52,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=464179.8333333333, ans=0.125 2024-06-21 21:49:56,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=464198.1666666667, ans=0.125 2024-06-21 21:49:57,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=464198.1666666667, ans=0.125 2024-06-21 21:49:57,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=464198.1666666667, ans=0.2 2024-06-21 21:50:18,008 INFO [train.py:1028] (0/2) Epoch 26, batch 300, loss[loss=0.1851, simple_loss=0.2401, pruned_loss=0.06509, over 13148.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2509, pruned_loss=0.06659, over 2008267.43 frames. ], batch size: 112, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:50:30,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.90 vs. limit=10.0 2024-06-21 21:50:38,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=464308.1666666667, ans=0.125 2024-06-21 21:50:43,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=464308.1666666667, ans=0.2 2024-06-21 21:50:43,164 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:50:43,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=464308.1666666667, ans=0.125 2024-06-21 21:50:46,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=464326.5, ans=0.125 2024-06-21 21:50:50,771 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.277e+02 2.408e+02 2.668e+02 4.399e+02, threshold=4.816e+02, percent-clipped=0.0 2024-06-21 21:50:50,801 INFO [train.py:1028] (0/2) Epoch 26, batch 350, loss[loss=0.1652, simple_loss=0.2288, pruned_loss=0.05083, over 12912.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2502, pruned_loss=0.06601, over 2137395.71 frames. ], batch size: 33, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:51:15,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=464399.8333333333, ans=0.125 2024-06-21 21:51:18,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=464399.8333333333, ans=0.125 2024-06-21 21:51:20,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464418.1666666667, ans=0.1 2024-06-21 21:51:27,383 INFO [train.py:1028] (0/2) Epoch 26, batch 400, loss[loss=0.189, simple_loss=0.2562, pruned_loss=0.06084, over 13267.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.25, pruned_loss=0.06589, over 2238528.15 frames. ], batch size: 63, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:51:30,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=464436.5, ans=0.0 2024-06-21 21:51:36,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=464454.8333333333, ans=0.0 2024-06-21 21:51:41,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464473.1666666667, ans=0.1 2024-06-21 21:51:44,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=464473.1666666667, ans=0.0 2024-06-21 21:51:48,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=464491.5, ans=0.025 2024-06-21 21:51:48,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=464491.5, ans=6.0 2024-06-21 21:51:58,304 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.194e+02 2.344e+02 2.535e+02 3.433e+02, threshold=4.688e+02, percent-clipped=0.0 2024-06-21 21:51:58,332 INFO [train.py:1028] (0/2) Epoch 26, batch 450, loss[loss=0.1907, simple_loss=0.2501, pruned_loss=0.06567, over 13246.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2499, pruned_loss=0.0659, over 2314000.35 frames. ], batch size: 67, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:52:04,350 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.34 vs. limit=10.0 2024-06-21 21:52:19,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=464564.8333333333, ans=0.0 2024-06-21 21:52:26,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=464601.5, ans=0.2 2024-06-21 21:52:31,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=464601.5, ans=0.125 2024-06-21 21:52:33,850 INFO [train.py:1028] (0/2) Epoch 26, batch 500, loss[loss=0.194, simple_loss=0.2435, pruned_loss=0.07226, over 13081.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2505, pruned_loss=0.06593, over 2376210.24 frames. ], batch size: 121, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:52:34,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=464619.8333333333, ans=0.07 2024-06-21 21:52:40,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=464638.1666666667, ans=0.0 2024-06-21 21:52:40,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=15.0 2024-06-21 21:53:09,057 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.231e+02 2.357e+02 2.477e+02 2.920e+02, threshold=4.714e+02, percent-clipped=0.0 2024-06-21 21:53:09,085 INFO [train.py:1028] (0/2) Epoch 26, batch 550, loss[loss=0.1836, simple_loss=0.2437, pruned_loss=0.06172, over 13026.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2501, pruned_loss=0.06567, over 2421063.66 frames. ], batch size: 158, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:53:29,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=464766.5, ans=0.125 2024-06-21 21:53:41,370 INFO [train.py:1028] (0/2) Epoch 26, batch 600, loss[loss=0.1735, simple_loss=0.2273, pruned_loss=0.05988, over 13020.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2496, pruned_loss=0.06565, over 2458361.93 frames. ], batch size: 144, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:53:46,149 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2024-06-21 21:53:47,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=464821.5, ans=0.125 2024-06-21 21:53:55,988 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2024-06-21 21:54:14,138 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.203e+02 2.380e+02 2.561e+02 3.662e+02, threshold=4.760e+02, percent-clipped=0.0 2024-06-21 21:54:14,171 INFO [train.py:1028] (0/2) Epoch 26, batch 650, loss[loss=0.1952, simple_loss=0.2562, pruned_loss=0.06708, over 13222.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2498, pruned_loss=0.06557, over 2490186.99 frames. ], batch size: 59, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:54:16,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464894.8333333333, ans=0.1 2024-06-21 21:54:23,670 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.11 vs. limit=15.0 2024-06-21 21:54:26,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=464913.1666666667, ans=0.125 2024-06-21 21:54:32,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=464931.5, ans=0.125 2024-06-21 21:54:37,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=464949.8333333333, ans=0.2 2024-06-21 21:54:43,539 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.26 vs. limit=22.5 2024-06-21 21:54:46,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=464968.1666666667, ans=0.0 2024-06-21 21:54:49,684 INFO [train.py:1028] (0/2) Epoch 26, batch 700, loss[loss=0.2039, simple_loss=0.2677, pruned_loss=0.07007, over 13237.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2495, pruned_loss=0.06558, over 2512352.00 frames. ], batch size: 46, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:55:00,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=465004.8333333333, ans=0.125 2024-06-21 21:55:06,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=465023.1666666667, ans=0.125 2024-06-21 21:55:14,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=465041.5, ans=0.0 2024-06-21 21:55:22,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=465059.8333333333, ans=0.0 2024-06-21 21:55:24,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=465059.8333333333, ans=0.2 2024-06-21 21:55:25,148 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.274e+02 2.382e+02 2.523e+02 3.095e+02, threshold=4.764e+02, percent-clipped=0.0 2024-06-21 21:55:25,177 INFO [train.py:1028] (0/2) Epoch 26, batch 750, loss[loss=0.1803, simple_loss=0.247, pruned_loss=0.05675, over 13221.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2499, pruned_loss=0.06577, over 2528234.30 frames. ], batch size: 63, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:55:26,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=465078.1666666667, ans=0.125 2024-06-21 21:55:30,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=465078.1666666667, ans=0.125 2024-06-21 21:55:49,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=465133.1666666667, ans=0.0 2024-06-21 21:55:50,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465151.5, ans=0.1 2024-06-21 21:55:51,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465151.5, ans=0.1 2024-06-21 21:55:57,357 INFO [train.py:1028] (0/2) Epoch 26, batch 800, loss[loss=0.1772, simple_loss=0.2404, pruned_loss=0.05698, over 12928.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2504, pruned_loss=0.06579, over 2541171.76 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:56:02,789 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 21:56:03,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=465188.1666666667, ans=0.125 2024-06-21 21:56:05,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=465188.1666666667, ans=0.2 2024-06-21 21:56:11,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=465206.5, ans=0.125 2024-06-21 21:56:19,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=465224.8333333333, ans=0.0 2024-06-21 21:56:19,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=465224.8333333333, ans=0.025 2024-06-21 21:56:34,745 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.308e+02 2.485e+02 2.667e+02 3.454e+02, threshold=4.970e+02, percent-clipped=0.0 2024-06-21 21:56:34,776 INFO [train.py:1028] (0/2) Epoch 26, batch 850, loss[loss=0.175, simple_loss=0.2306, pruned_loss=0.05966, over 13166.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2504, pruned_loss=0.06567, over 2552806.37 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:56:36,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=465261.5, ans=0.0 2024-06-21 21:56:42,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=465279.8333333333, ans=0.125 2024-06-21 21:56:43,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=465279.8333333333, ans=0.025 2024-06-21 21:56:53,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=465316.5, ans=0.125 2024-06-21 21:57:07,135 INFO [train.py:1028] (0/2) Epoch 26, batch 900, loss[loss=0.1968, simple_loss=0.2563, pruned_loss=0.06868, over 12992.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2501, pruned_loss=0.06594, over 2557448.57 frames. ], batch size: 36, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:57:07,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465353.1666666667, ans=0.1 2024-06-21 21:57:07,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=465353.1666666667, ans=0.125 2024-06-21 21:57:14,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=465371.5, ans=0.125 2024-06-21 21:57:14,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=465371.5, ans=0.125 2024-06-21 21:57:19,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=465389.8333333333, ans=0.09899494936611666 2024-06-21 21:57:28,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465389.8333333333, ans=0.0 2024-06-21 21:57:42,922 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.244e+02 2.489e+02 2.752e+02 3.970e+02, threshold=4.978e+02, percent-clipped=0.0 2024-06-21 21:57:42,950 INFO [train.py:1028] (0/2) Epoch 26, batch 950, loss[loss=0.1799, simple_loss=0.2481, pruned_loss=0.0558, over 13181.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2499, pruned_loss=0.06571, over 2561079.08 frames. ], batch size: 40, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:57:52,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.08 vs. limit=22.5 2024-06-21 21:57:54,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=465463.1666666667, ans=0.125 2024-06-21 21:57:57,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465481.5, ans=0.1 2024-06-21 21:58:14,750 INFO [train.py:1028] (0/2) Epoch 26, batch 1000, loss[loss=0.1901, simple_loss=0.2495, pruned_loss=0.0654, over 13291.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2494, pruned_loss=0.06574, over 2562696.83 frames. ], batch size: 49, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:58:20,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.11 vs. limit=22.5 2024-06-21 21:58:27,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465573.1666666667, ans=0.1 2024-06-21 21:58:32,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=465573.1666666667, ans=0.0 2024-06-21 21:58:40,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.68 vs. limit=15.0 2024-06-21 21:58:49,547 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.233e+02 2.348e+02 2.500e+02 3.277e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-21 21:58:49,575 INFO [train.py:1028] (0/2) Epoch 26, batch 1050, loss[loss=0.1787, simple_loss=0.2463, pruned_loss=0.05557, over 13186.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2494, pruned_loss=0.0657, over 2565478.21 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:58:58,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=465646.5, ans=0.125 2024-06-21 21:59:09,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=465683.1666666667, ans=0.125 2024-06-21 21:59:12,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=465683.1666666667, ans=0.0 2024-06-21 21:59:14,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=465701.5, ans=10.0 2024-06-21 21:59:25,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=465719.8333333333, ans=0.125 2024-06-21 21:59:25,684 INFO [train.py:1028] (0/2) Epoch 26, batch 1100, loss[loss=0.1871, simple_loss=0.2529, pruned_loss=0.06064, over 13316.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.25, pruned_loss=0.06575, over 2570226.51 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:59:25,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=465719.8333333333, ans=0.04949747468305833 2024-06-21 21:59:27,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=465719.8333333333, ans=0.125 2024-06-21 21:59:28,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=465719.8333333333, ans=0.125 2024-06-21 21:59:30,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=465719.8333333333, ans=0.125 2024-06-21 21:59:30,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=465719.8333333333, ans=0.0 2024-06-21 21:59:32,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=465738.1666666667, ans=0.0 2024-06-21 21:59:34,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=465738.1666666667, ans=0.125 2024-06-21 21:59:52,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2024-06-21 21:59:54,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=465793.1666666667, ans=0.2 2024-06-21 21:59:58,428 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.223e+02 2.340e+02 2.467e+02 3.194e+02, threshold=4.679e+02, percent-clipped=0.0 2024-06-21 21:59:58,457 INFO [train.py:1028] (0/2) Epoch 26, batch 1150, loss[loss=0.1771, simple_loss=0.2427, pruned_loss=0.05579, over 13244.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2499, pruned_loss=0.06574, over 2571324.02 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 21:59:59,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=465811.5, ans=0.2 2024-06-21 22:00:00,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=465811.5, ans=0.125 2024-06-21 22:00:09,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=465829.8333333333, ans=0.125 2024-06-21 22:00:16,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465848.1666666667, ans=0.1 2024-06-21 22:00:18,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.96 vs. limit=22.5 2024-06-21 22:00:22,553 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2024-06-21 22:00:24,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=465884.8333333333, ans=0.2 2024-06-21 22:00:35,127 INFO [train.py:1028] (0/2) Epoch 26, batch 1200, loss[loss=0.198, simple_loss=0.2609, pruned_loss=0.0676, over 13169.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2501, pruned_loss=0.06594, over 2573945.95 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:00:36,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=465903.1666666667, ans=0.1 2024-06-21 22:00:57,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.24 vs. limit=15.0 2024-06-21 22:00:59,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=465958.1666666667, ans=0.125 2024-06-21 22:01:02,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=465976.5, ans=0.125 2024-06-21 22:01:08,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=465994.8333333333, ans=0.0 2024-06-21 22:01:08,474 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.277e+02 2.418e+02 2.581e+02 3.681e+02, threshold=4.836e+02, percent-clipped=0.0 2024-06-21 22:01:08,503 INFO [train.py:1028] (0/2) Epoch 26, batch 1250, loss[loss=0.1854, simple_loss=0.2449, pruned_loss=0.063, over 13210.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2502, pruned_loss=0.06586, over 2583176.26 frames. ], batch size: 112, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:01:29,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=466031.5, ans=0.0 2024-06-21 22:01:33,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=466049.8333333333, ans=0.2 2024-06-21 22:01:45,332 INFO [train.py:1028] (0/2) Epoch 26, batch 1300, loss[loss=0.2048, simple_loss=0.2553, pruned_loss=0.07708, over 12742.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2503, pruned_loss=0.06578, over 2584287.83 frames. ], batch size: 176, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:01:48,312 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2024-06-21 22:01:54,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2024-06-21 22:01:56,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466104.8333333333, ans=0.1 2024-06-21 22:01:56,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=466104.8333333333, ans=0.125 2024-06-21 22:02:18,980 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.197e+02 2.333e+02 2.561e+02 3.124e+02, threshold=4.666e+02, percent-clipped=0.0 2024-06-21 22:02:19,010 INFO [train.py:1028] (0/2) Epoch 26, batch 1350, loss[loss=0.193, simple_loss=0.2647, pruned_loss=0.06061, over 13170.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2508, pruned_loss=0.066, over 2585823.26 frames. ], batch size: 59, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:02:28,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466196.5, ans=0.1 2024-06-21 22:02:46,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=466233.1666666667, ans=0.0 2024-06-21 22:02:48,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=466251.5, ans=0.125 2024-06-21 22:02:51,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=466251.5, ans=10.0 2024-06-21 22:02:52,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=466251.5, ans=0.2 2024-06-21 22:02:55,800 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=12.0 2024-06-21 22:02:55,920 INFO [train.py:1028] (0/2) Epoch 26, batch 1400, loss[loss=0.1883, simple_loss=0.2503, pruned_loss=0.06317, over 12853.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2507, pruned_loss=0.06602, over 2586351.08 frames. ], batch size: 26, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:02:58,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=466269.8333333333, ans=0.125 2024-06-21 22:03:00,541 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.24 vs. limit=22.5 2024-06-21 22:03:05,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.01 vs. limit=15.0 2024-06-21 22:03:18,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466324.8333333333, ans=0.1 2024-06-21 22:03:24,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=466343.1666666667, ans=0.125 2024-06-21 22:03:32,564 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.244e+02 2.394e+02 2.571e+02 3.266e+02, threshold=4.788e+02, percent-clipped=0.0 2024-06-21 22:03:32,596 INFO [train.py:1028] (0/2) Epoch 26, batch 1450, loss[loss=0.1901, simple_loss=0.2505, pruned_loss=0.06482, over 13096.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2504, pruned_loss=0.066, over 2586684.38 frames. ], batch size: 121, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:03:34,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=466361.5, ans=0.125 2024-06-21 22:03:35,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=466361.5, ans=0.0 2024-06-21 22:04:05,770 INFO [train.py:1028] (0/2) Epoch 26, batch 1500, loss[loss=0.1833, simple_loss=0.2457, pruned_loss=0.06048, over 13223.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2511, pruned_loss=0.0665, over 2589526.87 frames. ], batch size: 83, lr: 2.24e-03, grad_scale: 32.0 2024-06-21 22:04:09,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=466453.1666666667, ans=0.07 2024-06-21 22:04:11,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466471.5, ans=0.0 2024-06-21 22:04:16,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466471.5, ans=0.0 2024-06-21 22:04:37,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=466526.5, ans=0.025 2024-06-21 22:04:38,690 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.224e+02 2.425e+02 2.566e+02 2.949e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-21 22:04:38,720 INFO [train.py:1028] (0/2) Epoch 26, batch 1550, loss[loss=0.2197, simple_loss=0.2672, pruned_loss=0.08612, over 13093.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2513, pruned_loss=0.06663, over 2583976.09 frames. ], batch size: 102, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:04:43,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466544.8333333333, ans=0.1 2024-06-21 22:04:47,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=466563.1666666667, ans=0.125 2024-06-21 22:04:47,858 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:04:49,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466563.1666666667, ans=0.125 2024-06-21 22:04:50,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=466563.1666666667, ans=0.125 2024-06-21 22:04:59,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=466581.5, ans=0.125 2024-06-21 22:05:01,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.85 vs. limit=10.0 2024-06-21 22:05:06,220 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2024-06-21 22:05:07,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.63 vs. limit=6.0 2024-06-21 22:05:11,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=466618.1666666667, ans=0.1 2024-06-21 22:05:14,167 INFO [train.py:1028] (0/2) Epoch 26, batch 1600, loss[loss=0.1918, simple_loss=0.2526, pruned_loss=0.06548, over 13162.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2516, pruned_loss=0.06661, over 2579036.05 frames. ], batch size: 77, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:05:17,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2024-06-21 22:05:27,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2024-06-21 22:05:28,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=466654.8333333333, ans=0.0 2024-06-21 22:05:36,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=466691.5, ans=0.0 2024-06-21 22:05:42,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.12 vs. limit=15.0 2024-06-21 22:05:45,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2024-06-21 22:05:49,158 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.258e+02 2.358e+02 2.543e+02 3.752e+02, threshold=4.715e+02, percent-clipped=0.0 2024-06-21 22:05:49,187 INFO [train.py:1028] (0/2) Epoch 26, batch 1650, loss[loss=0.1799, simple_loss=0.233, pruned_loss=0.06337, over 13200.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2515, pruned_loss=0.06681, over 2575310.74 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:06:00,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466746.5, ans=0.1 2024-06-21 22:06:08,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=466764.8333333333, ans=0.125 2024-06-21 22:06:20,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=466801.5, ans=0.125 2024-06-21 22:06:22,580 INFO [train.py:1028] (0/2) Epoch 26, batch 1700, loss[loss=0.179, simple_loss=0.2471, pruned_loss=0.05549, over 12244.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2509, pruned_loss=0.06605, over 2580648.30 frames. ], batch size: 25, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:06:29,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=466838.1666666667, ans=0.02 2024-06-21 22:06:33,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=466838.1666666667, ans=0.2 2024-06-21 22:06:34,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2024-06-21 22:06:49,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=466874.8333333333, ans=0.125 2024-06-21 22:06:51,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=466874.8333333333, ans=0.125 2024-06-21 22:06:52,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=466893.1666666667, ans=0.125 2024-06-21 22:06:59,984 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.224e+02 2.459e+02 2.575e+02 3.376e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 22:07:00,012 INFO [train.py:1028] (0/2) Epoch 26, batch 1750, loss[loss=0.1946, simple_loss=0.2605, pruned_loss=0.06431, over 12509.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2506, pruned_loss=0.06563, over 2581904.51 frames. ], batch size: 22, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:07:03,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466911.5, ans=0.1 2024-06-21 22:07:30,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=466984.8333333333, ans=0.0 2024-06-21 22:07:35,630 INFO [train.py:1028] (0/2) Epoch 26, batch 1800, loss[loss=0.1737, simple_loss=0.2353, pruned_loss=0.05606, over 13275.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2511, pruned_loss=0.06623, over 2583083.43 frames. ], batch size: 67, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:07:38,700 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.78 vs. limit=12.0 2024-06-21 22:07:42,241 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2024-06-21 22:07:43,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467021.5, ans=0.1 2024-06-21 22:07:43,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=467021.5, ans=0.125 2024-06-21 22:07:44,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=467021.5, ans=0.125 2024-06-21 22:07:58,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=467058.1666666667, ans=0.025 2024-06-21 22:08:02,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=467076.5, ans=0.125 2024-06-21 22:08:06,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2024-06-21 22:08:08,539 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.221e+02 2.327e+02 2.523e+02 3.559e+02, threshold=4.654e+02, percent-clipped=0.0 2024-06-21 22:08:08,570 INFO [train.py:1028] (0/2) Epoch 26, batch 1850, loss[loss=0.204, simple_loss=0.2613, pruned_loss=0.07332, over 13193.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2509, pruned_loss=0.06587, over 2584286.41 frames. ], batch size: 83, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:08:11,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=467094.8333333333, ans=0.125 2024-06-21 22:08:13,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=467094.8333333333, ans=0.0 2024-06-21 22:08:15,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=467113.1666666667, ans=0.025 2024-06-21 22:08:22,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=467131.5, ans=0.125 2024-06-21 22:08:27,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=467149.8333333333, ans=0.04949747468305833 2024-06-21 22:08:41,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=467168.1666666667, ans=0.0 2024-06-21 22:08:43,661 INFO [train.py:1028] (0/2) Epoch 26, batch 1900, loss[loss=0.1831, simple_loss=0.2412, pruned_loss=0.06253, over 13193.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2504, pruned_loss=0.06579, over 2586220.08 frames. ], batch size: 95, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:08:50,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=467204.8333333333, ans=0.0 2024-06-21 22:09:03,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=467241.5, ans=0.125 2024-06-21 22:09:10,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=467259.8333333333, ans=0.125 2024-06-21 22:09:15,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.217e+02 2.364e+02 2.500e+02 3.346e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-21 22:09:15,703 INFO [train.py:1028] (0/2) Epoch 26, batch 1950, loss[loss=0.1942, simple_loss=0.26, pruned_loss=0.06416, over 13295.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2497, pruned_loss=0.06564, over 2591601.43 frames. ], batch size: 52, lr: 2.24e-03, grad_scale: 64.0 2024-06-21 22:09:23,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=467278.1666666667, ans=0.125 2024-06-21 22:09:26,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=467296.5, ans=0.125 2024-06-21 22:09:33,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2024-06-21 22:09:41,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=467333.1666666667, ans=0.125 2024-06-21 22:09:51,435 INFO [train.py:1028] (0/2) Epoch 26, batch 2000, loss[loss=0.1779, simple_loss=0.245, pruned_loss=0.05542, over 12642.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2496, pruned_loss=0.0657, over 2587349.10 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:09:53,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=467369.8333333333, ans=0.2 2024-06-21 22:09:59,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=467388.1666666667, ans=0.2 2024-06-21 22:10:05,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=467406.5, ans=0.025 2024-06-21 22:10:13,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=467424.8333333333, ans=0.04949747468305833 2024-06-21 22:10:24,013 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.263e+02 2.384e+02 2.469e+02 3.091e+02, threshold=4.769e+02, percent-clipped=0.0 2024-06-21 22:10:24,043 INFO [train.py:1028] (0/2) Epoch 26, batch 2050, loss[loss=0.2037, simple_loss=0.2699, pruned_loss=0.0688, over 12841.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.25, pruned_loss=0.06587, over 2583963.88 frames. ], batch size: 29, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:10:28,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=467461.5, ans=0.125 2024-06-21 22:10:36,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=467498.1666666667, ans=0.0 2024-06-21 22:10:50,423 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.13 vs. limit=22.5 2024-06-21 22:10:52,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=467534.8333333333, ans=0.0 2024-06-21 22:10:56,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=15.0 2024-06-21 22:10:59,914 INFO [train.py:1028] (0/2) Epoch 26, batch 2100, loss[loss=0.2013, simple_loss=0.2646, pruned_loss=0.06902, over 13178.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2501, pruned_loss=0.06547, over 2586079.78 frames. ], batch size: 59, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:11:00,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2024-06-21 22:11:14,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=467589.8333333333, ans=0.1 2024-06-21 22:11:15,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=467589.8333333333, ans=0.0 2024-06-21 22:11:20,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=467608.1666666667, ans=0.125 2024-06-21 22:11:34,921 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.240e+02 2.368e+02 2.518e+02 3.356e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 22:11:34,949 INFO [train.py:1028] (0/2) Epoch 26, batch 2150, loss[loss=0.1815, simple_loss=0.2511, pruned_loss=0.05601, over 13320.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2499, pruned_loss=0.06528, over 2588501.98 frames. ], batch size: 52, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:11:49,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=467681.5, ans=0.2 2024-06-21 22:11:57,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=467699.8333333333, ans=0.125 2024-06-21 22:11:57,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2024-06-21 22:12:01,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=467718.1666666667, ans=0.2 2024-06-21 22:12:05,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=467718.1666666667, ans=0.95 2024-06-21 22:12:07,649 INFO [train.py:1028] (0/2) Epoch 26, batch 2200, loss[loss=0.198, simple_loss=0.2543, pruned_loss=0.07086, over 13208.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2505, pruned_loss=0.06569, over 2588257.41 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:12:08,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=467736.5, ans=0.125 2024-06-21 22:12:14,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=467754.8333333333, ans=0.0 2024-06-21 22:12:22,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2024-06-21 22:12:30,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=467791.5, ans=0.125 2024-06-21 22:12:37,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=467809.8333333333, ans=0.0 2024-06-21 22:12:38,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=467809.8333333333, ans=0.07 2024-06-21 22:12:39,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.201e+02 2.362e+02 2.509e+02 2.984e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-21 22:12:39,822 INFO [train.py:1028] (0/2) Epoch 26, batch 2250, loss[loss=0.1724, simple_loss=0.2358, pruned_loss=0.05449, over 13229.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2504, pruned_loss=0.06541, over 2586936.39 frames. ], batch size: 63, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:12:46,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=467846.5, ans=0.125 2024-06-21 22:12:46,850 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.78 vs. limit=22.5 2024-06-21 22:12:57,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=467864.8333333333, ans=0.125 2024-06-21 22:13:13,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.44 vs. limit=10.0 2024-06-21 22:13:15,114 INFO [train.py:1028] (0/2) Epoch 26, batch 2300, loss[loss=0.18, simple_loss=0.2504, pruned_loss=0.05486, over 12857.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2507, pruned_loss=0.06554, over 2581416.12 frames. ], batch size: 33, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:13:37,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2024-06-21 22:13:40,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2024-06-21 22:13:42,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=467974.8333333333, ans=0.125 2024-06-21 22:13:50,149 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.279e+02 2.418e+02 2.670e+02 3.520e+02, threshold=4.836e+02, percent-clipped=0.0 2024-06-21 22:13:50,177 INFO [train.py:1028] (0/2) Epoch 26, batch 2350, loss[loss=0.2048, simple_loss=0.2725, pruned_loss=0.06855, over 13249.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2508, pruned_loss=0.0658, over 2585240.70 frames. ], batch size: 67, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:14:07,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468048.1666666667, ans=0.125 2024-06-21 22:14:11,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.21 vs. limit=15.0 2024-06-21 22:14:12,118 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.588e+01 2024-06-21 22:14:16,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=468084.8333333333, ans=0.0 2024-06-21 22:14:20,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.61 vs. limit=22.5 2024-06-21 22:14:22,245 INFO [train.py:1028] (0/2) Epoch 26, batch 2400, loss[loss=0.189, simple_loss=0.2533, pruned_loss=0.06238, over 13321.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2505, pruned_loss=0.06578, over 2587839.60 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:14:24,162 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=468103.1666666667, ans=0.125 2024-06-21 22:14:30,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=468121.5, ans=0.2 2024-06-21 22:14:34,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2024-06-21 22:14:35,061 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:14:43,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=468158.1666666667, ans=0.125 2024-06-21 22:14:49,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=468176.5, ans=10.0 2024-06-21 22:14:56,635 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.233e+02 2.358e+02 2.594e+02 3.691e+02, threshold=4.715e+02, percent-clipped=0.0 2024-06-21 22:14:56,671 INFO [train.py:1028] (0/2) Epoch 26, batch 2450, loss[loss=0.186, simple_loss=0.2469, pruned_loss=0.06257, over 13265.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2503, pruned_loss=0.06627, over 2583968.74 frames. ], batch size: 63, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:15:04,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=468213.1666666667, ans=0.05 2024-06-21 22:15:17,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=468231.5, ans=0.05 2024-06-21 22:15:18,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=15.0 2024-06-21 22:15:20,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=468249.8333333333, ans=0.0 2024-06-21 22:15:21,489 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.036e+01 2024-06-21 22:15:28,139 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=468268.1666666667, ans=0.04949747468305833 2024-06-21 22:15:32,654 INFO [train.py:1028] (0/2) Epoch 26, batch 2500, loss[loss=0.186, simple_loss=0.2391, pruned_loss=0.0664, over 13217.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2492, pruned_loss=0.06584, over 2587437.86 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:15:45,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=468323.1666666667, ans=0.125 2024-06-21 22:15:54,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=468341.5, ans=0.125 2024-06-21 22:15:57,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=468341.5, ans=0.025 2024-06-21 22:15:59,935 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.61 vs. limit=22.5 2024-06-21 22:16:00,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=468359.8333333333, ans=0.07 2024-06-21 22:16:02,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=468359.8333333333, ans=0.1 2024-06-21 22:16:05,101 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.166e+02 2.320e+02 2.590e+02 3.730e+02, threshold=4.641e+02, percent-clipped=0.0 2024-06-21 22:16:05,129 INFO [train.py:1028] (0/2) Epoch 26, batch 2550, loss[loss=0.1903, simple_loss=0.2567, pruned_loss=0.06202, over 12746.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2484, pruned_loss=0.06592, over 2587812.58 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 64.0 2024-06-21 22:16:14,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=468396.5, ans=0.05 2024-06-21 22:16:14,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=468396.5, ans=0.09899494936611666 2024-06-21 22:16:39,960 INFO [train.py:1028] (0/2) Epoch 26, batch 2600, loss[loss=0.1936, simple_loss=0.2584, pruned_loss=0.06441, over 13265.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2476, pruned_loss=0.06585, over 2586456.50 frames. ], batch size: 52, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:16:42,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=468469.8333333333, ans=0.0 2024-06-21 22:16:50,721 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=468488.1666666667, ans=0.0 2024-06-21 22:16:51,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=468488.1666666667, ans=0.95 2024-06-21 22:16:55,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=468506.5, ans=0.125 2024-06-21 22:16:56,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=468506.5, ans=0.0 2024-06-21 22:16:57,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=468506.5, ans=0.125 2024-06-21 22:17:16,175 INFO [train.py:1028] (0/2) Epoch 26, batch 2650, loss[loss=0.2016, simple_loss=0.2499, pruned_loss=0.07662, over 13018.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2466, pruned_loss=0.06575, over 2587156.97 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:17:16,833 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.200e+02 2.333e+02 2.444e+02 3.258e+02, threshold=4.665e+02, percent-clipped=0.0 2024-06-21 22:17:26,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=15.0 2024-06-21 22:17:33,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=468598.1666666667, ans=0.04949747468305833 2024-06-21 22:17:33,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=468598.1666666667, ans=0.125 2024-06-21 22:17:33,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468598.1666666667, ans=0.1 2024-06-21 22:17:36,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=15.0 2024-06-21 22:17:37,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=468616.5, ans=0.125 2024-06-21 22:17:38,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2024-06-21 22:17:39,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2024-06-21 22:17:39,738 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-21 22:17:43,686 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:17:49,680 INFO [train.py:1028] (0/2) Epoch 26, batch 2700, loss[loss=0.1778, simple_loss=0.2294, pruned_loss=0.06307, over 13241.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.245, pruned_loss=0.0653, over 2585536.53 frames. ], batch size: 89, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:18:07,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468689.8333333333, ans=0.125 2024-06-21 22:18:10,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=468708.1666666667, ans=0.125 2024-06-21 22:18:17,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=468726.5, ans=0.125 2024-06-21 22:18:22,271 INFO [train.py:1028] (0/2) Epoch 26, batch 2750, loss[loss=0.1913, simple_loss=0.2424, pruned_loss=0.07013, over 13274.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2438, pruned_loss=0.06449, over 2581834.86 frames. ], batch size: 43, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:18:22,827 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.207e+02 2.359e+02 2.626e+02 3.748e+02, threshold=4.717e+02, percent-clipped=0.0 2024-06-21 22:18:33,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=468763.1666666667, ans=0.0 2024-06-21 22:18:36,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=468763.1666666667, ans=0.035 2024-06-21 22:18:39,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=468781.5, ans=0.0 2024-06-21 22:18:39,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=468781.5, ans=0.125 2024-06-21 22:18:42,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=468781.5, ans=0.1 2024-06-21 22:18:47,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2024-06-21 22:19:00,974 INFO [train.py:1028] (0/2) Epoch 26, batch 2800, loss[loss=0.199, simple_loss=0.2419, pruned_loss=0.07808, over 10745.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2433, pruned_loss=0.06458, over 2578391.35 frames. ], batch size: 303, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:19:01,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.65 vs. limit=10.0 2024-06-21 22:19:32,924 INFO [train.py:1028] (0/2) Epoch 26, batch 2850, loss[loss=0.1755, simple_loss=0.2344, pruned_loss=0.05824, over 13016.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2426, pruned_loss=0.06468, over 2576305.84 frames. ], batch size: 48, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:19:33,532 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.225e+02 2.347e+02 2.498e+02 2.969e+02, threshold=4.693e+02, percent-clipped=0.0 2024-06-21 22:19:36,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=468928.1666666667, ans=0.2 2024-06-21 22:19:42,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=468946.5, ans=0.2 2024-06-21 22:19:47,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=468964.8333333333, ans=0.125 2024-06-21 22:19:56,211 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=15.0 2024-06-21 22:19:57,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=468983.1666666667, ans=0.2 2024-06-21 22:20:01,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=469001.5, ans=0.125 2024-06-21 22:20:05,509 INFO [train.py:1028] (0/2) Epoch 26, batch 2900, loss[loss=0.1917, simple_loss=0.2543, pruned_loss=0.06455, over 13109.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2414, pruned_loss=0.06429, over 2584658.83 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:20:11,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=469038.1666666667, ans=0.125 2024-06-21 22:20:21,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=469056.5, ans=0.1 2024-06-21 22:20:36,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=469093.1666666667, ans=0.125 2024-06-21 22:20:40,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=469093.1666666667, ans=0.125 2024-06-21 22:20:42,095 INFO [train.py:1028] (0/2) Epoch 26, batch 2950, loss[loss=0.1781, simple_loss=0.2379, pruned_loss=0.05917, over 13285.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.241, pruned_loss=0.06398, over 2578567.92 frames. ], batch size: 43, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:20:42,627 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.224e+02 2.380e+02 2.563e+02 3.217e+02, threshold=4.759e+02, percent-clipped=0.0 2024-06-21 22:20:43,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2024-06-21 22:20:47,891 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=22.5 2024-06-21 22:21:02,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=469148.1666666667, ans=0.125 2024-06-21 22:21:02,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=469148.1666666667, ans=0.125 2024-06-21 22:21:05,065 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:21:05,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=469148.1666666667, ans=0.0 2024-06-21 22:21:08,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.26 vs. limit=22.5 2024-06-21 22:21:20,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.59 vs. limit=6.0 2024-06-21 22:21:20,249 INFO [train.py:1028] (0/2) Epoch 26, batch 3000, loss[loss=0.1773, simple_loss=0.2382, pruned_loss=0.0582, over 13188.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2397, pruned_loss=0.06336, over 2577650.01 frames. ], batch size: 59, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:21:20,250 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 22:21:28,194 INFO [train.py:1060] (0/2) Epoch 26, validation: loss=0.19, simple_loss=0.2513, pruned_loss=0.06441, over 351949.00 frames. 2024-06-21 22:21:28,195 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 22:21:37,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=469221.5, ans=0.0 2024-06-21 22:21:39,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.94 vs. limit=22.5 2024-06-21 22:21:51,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=469258.1666666667, ans=0.125 2024-06-21 22:21:56,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=469276.5, ans=0.0 2024-06-21 22:22:00,435 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2024-06-21 22:22:01,452 INFO [train.py:1028] (0/2) Epoch 26, batch 3050, loss[loss=0.189, simple_loss=0.2451, pruned_loss=0.06645, over 13282.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.2388, pruned_loss=0.06333, over 2578704.36 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:22:02,077 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.239e+02 2.394e+02 2.560e+02 3.892e+02, threshold=4.788e+02, percent-clipped=0.0 2024-06-21 22:22:02,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469294.8333333333, ans=0.125 2024-06-21 22:22:14,367 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-256000.pt 2024-06-21 22:22:28,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469349.8333333333, ans=0.125 2024-06-21 22:22:40,947 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.82 vs. limit=22.5 2024-06-21 22:22:41,832 INFO [train.py:1028] (0/2) Epoch 26, batch 3100, loss[loss=0.1713, simple_loss=0.2253, pruned_loss=0.05869, over 13035.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.238, pruned_loss=0.06291, over 2579196.70 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:22:54,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2024-06-21 22:23:09,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=469441.5, ans=0.2 2024-06-21 22:23:19,829 INFO [train.py:1028] (0/2) Epoch 26, batch 3150, loss[loss=0.1735, simple_loss=0.2275, pruned_loss=0.0597, over 12957.00 frames. ], tot_loss[loss=0.1807, simple_loss=0.237, pruned_loss=0.0622, over 2580164.03 frames. ], batch size: 158, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:23:20,438 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.203e+02 2.359e+02 2.601e+02 3.409e+02, threshold=4.718e+02, percent-clipped=0.0 2024-06-21 22:23:22,684 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=469478.1666666667, ans=0.0 2024-06-21 22:23:26,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=469496.5, ans=0.125 2024-06-21 22:23:43,926 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.90 vs. limit=10.0 2024-06-21 22:23:47,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469551.5, ans=0.1 2024-06-21 22:23:52,638 INFO [train.py:1028] (0/2) Epoch 26, batch 3200, loss[loss=0.1672, simple_loss=0.2237, pruned_loss=0.0554, over 13120.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2363, pruned_loss=0.06198, over 2581293.40 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:23:54,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=469569.8333333333, ans=15.0 2024-06-21 22:23:55,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469569.8333333333, ans=0.1 2024-06-21 22:23:56,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=469569.8333333333, ans=0.125 2024-06-21 22:24:00,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=469588.1666666667, ans=0.125 2024-06-21 22:24:04,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=469588.1666666667, ans=0.09899494936611666 2024-06-21 22:24:17,273 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:24:19,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2024-06-21 22:24:21,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469643.1666666667, ans=0.1 2024-06-21 22:24:24,603 INFO [train.py:1028] (0/2) Epoch 26, batch 3250, loss[loss=0.1691, simple_loss=0.2258, pruned_loss=0.05614, over 13276.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2364, pruned_loss=0.0623, over 2585529.09 frames. ], batch size: 72, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:24:25,176 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.136e+02 2.250e+02 2.442e+02 3.025e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-21 22:24:33,414 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2024-06-21 22:24:33,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=469679.8333333333, ans=0.04949747468305833 2024-06-21 22:24:35,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469679.8333333333, ans=0.125 2024-06-21 22:24:46,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=469698.1666666667, ans=0.2 2024-06-21 22:24:51,195 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:24:58,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=469716.5, ans=0.125 2024-06-21 22:25:01,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=469734.8333333333, ans=0.2 2024-06-21 22:25:01,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=469734.8333333333, ans=0.125 2024-06-21 22:25:01,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.62 vs. limit=15.0 2024-06-21 22:25:03,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=469734.8333333333, ans=0.0 2024-06-21 22:25:06,057 INFO [train.py:1028] (0/2) Epoch 26, batch 3300, loss[loss=0.1724, simple_loss=0.2244, pruned_loss=0.06021, over 12695.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.2359, pruned_loss=0.06192, over 2581892.43 frames. ], batch size: 176, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:25:17,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.51 vs. limit=10.0 2024-06-21 22:25:26,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=469808.1666666667, ans=0.09899494936611666 2024-06-21 22:25:33,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=469826.5, ans=0.2 2024-06-21 22:25:35,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.34 vs. limit=15.0 2024-06-21 22:25:38,752 INFO [train.py:1028] (0/2) Epoch 26, batch 3350, loss[loss=0.1765, simple_loss=0.2258, pruned_loss=0.0636, over 12964.00 frames. ], tot_loss[loss=0.1803, simple_loss=0.236, pruned_loss=0.06232, over 2576609.84 frames. ], batch size: 158, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:25:39,431 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.183e+02 2.310e+02 2.482e+02 3.430e+02, threshold=4.620e+02, percent-clipped=0.0 2024-06-21 22:25:49,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=469863.1666666667, ans=0.2 2024-06-21 22:25:49,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=469863.1666666667, ans=0.125 2024-06-21 22:25:54,250 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2024-06-21 22:25:54,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=469881.5, ans=0.125 2024-06-21 22:26:05,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469918.1666666667, ans=0.1 2024-06-21 22:26:10,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=469918.1666666667, ans=0.125 2024-06-21 22:26:11,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=469918.1666666667, ans=0.125 2024-06-21 22:26:12,375 INFO [train.py:1028] (0/2) Epoch 26, batch 3400, loss[loss=0.1866, simple_loss=0.2573, pruned_loss=0.05793, over 12645.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2354, pruned_loss=0.06227, over 2576192.94 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:26:19,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=469936.5, ans=0.125 2024-06-21 22:26:27,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=469954.8333333333, ans=0.125 2024-06-21 22:26:28,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=469973.1666666667, ans=0.125 2024-06-21 22:26:41,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=469991.5, ans=0.0 2024-06-21 22:26:45,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=470009.8333333333, ans=0.125 2024-06-21 22:26:49,274 INFO [train.py:1028] (0/2) Epoch 26, batch 3450, loss[loss=0.1822, simple_loss=0.2278, pruned_loss=0.06833, over 12786.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2343, pruned_loss=0.06169, over 2577951.93 frames. ], batch size: 176, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:26:49,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.32 vs. limit=10.0 2024-06-21 22:26:49,826 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.873e+02 2.200e+02 2.336e+02 2.535e+02 3.349e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-21 22:27:04,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470046.5, ans=0.1 2024-06-21 22:27:18,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=470101.5, ans=0.0 2024-06-21 22:27:24,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470101.5, ans=0.1 2024-06-21 22:27:24,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=470101.5, ans=0.025 2024-06-21 22:27:26,092 INFO [train.py:1028] (0/2) Epoch 26, batch 3500, loss[loss=0.1877, simple_loss=0.242, pruned_loss=0.06668, over 12991.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2334, pruned_loss=0.06114, over 2576736.28 frames. ], batch size: 33, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:27:29,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2024-06-21 22:27:29,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=470119.8333333333, ans=0.025 2024-06-21 22:27:32,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=470138.1666666667, ans=0.125 2024-06-21 22:27:37,745 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:27:52,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2024-06-21 22:27:54,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=470193.1666666667, ans=0.0 2024-06-21 22:27:59,147 INFO [train.py:1028] (0/2) Epoch 26, batch 3550, loss[loss=0.1743, simple_loss=0.2228, pruned_loss=0.06291, over 13149.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2329, pruned_loss=0.06093, over 2576833.27 frames. ], batch size: 95, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:27:59,701 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.178e+02 2.269e+02 2.418e+02 3.289e+02, threshold=4.539e+02, percent-clipped=0.0 2024-06-21 22:27:59,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=470211.5, ans=0.0 2024-06-21 22:28:09,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=470229.8333333333, ans=0.125 2024-06-21 22:28:14,008 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2024-06-21 22:28:34,101 INFO [train.py:1028] (0/2) Epoch 26, batch 3600, loss[loss=0.1708, simple_loss=0.2275, pruned_loss=0.05704, over 13327.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2321, pruned_loss=0.06071, over 2580595.45 frames. ], batch size: 49, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:28:48,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=470339.8333333333, ans=0.0 2024-06-21 22:28:48,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=470339.8333333333, ans=0.125 2024-06-21 22:28:49,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=470339.8333333333, ans=0.125 2024-06-21 22:28:51,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=470339.8333333333, ans=0.025 2024-06-21 22:29:09,721 INFO [train.py:1028] (0/2) Epoch 26, batch 3650, loss[loss=0.1975, simple_loss=0.2464, pruned_loss=0.07429, over 13019.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2326, pruned_loss=0.06077, over 2578595.72 frames. ], batch size: 102, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:29:10,295 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.131e+02 2.299e+02 2.507e+02 3.321e+02, threshold=4.598e+02, percent-clipped=0.0 2024-06-21 22:29:11,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=470394.8333333333, ans=0.1 2024-06-21 22:29:11,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=470394.8333333333, ans=0.125 2024-06-21 22:29:13,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=470394.8333333333, ans=0.09899494936611666 2024-06-21 22:29:26,668 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.08 vs. limit=15.0 2024-06-21 22:29:42,023 INFO [train.py:1028] (0/2) Epoch 26, batch 3700, loss[loss=0.1645, simple_loss=0.2215, pruned_loss=0.05375, over 13258.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2321, pruned_loss=0.06062, over 2584191.50 frames. ], batch size: 72, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:29:50,406 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.30 vs. limit=10.0 2024-06-21 22:29:50,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=470504.8333333333, ans=0.0 2024-06-21 22:29:58,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=470523.1666666667, ans=0.025 2024-06-21 22:29:59,440 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.14 vs. limit=22.5 2024-06-21 22:30:02,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=470541.5, ans=0.125 2024-06-21 22:30:09,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=470559.8333333333, ans=0.125 2024-06-21 22:30:12,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=470559.8333333333, ans=0.07 2024-06-21 22:30:15,079 INFO [train.py:1028] (0/2) Epoch 26, batch 3750, loss[loss=0.171, simple_loss=0.2315, pruned_loss=0.05521, over 12418.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2313, pruned_loss=0.06019, over 2585410.54 frames. ], batch size: 22, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:30:15,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=470578.1666666667, ans=0.0 2024-06-21 22:30:15,735 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.153e+02 2.246e+02 2.425e+02 3.051e+02, threshold=4.493e+02, percent-clipped=0.0 2024-06-21 22:30:28,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=470596.5, ans=0.95 2024-06-21 22:30:29,914 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2024-06-21 22:30:37,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=470633.1666666667, ans=0.125 2024-06-21 22:30:37,958 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:30:40,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=15.0 2024-06-21 22:30:41,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=470633.1666666667, ans=0.0 2024-06-21 22:30:49,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=470651.5, ans=0.0 2024-06-21 22:30:50,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=470651.5, ans=0.2 2024-06-21 22:30:52,026 INFO [train.py:1028] (0/2) Epoch 26, batch 3800, loss[loss=0.1705, simple_loss=0.2276, pruned_loss=0.05671, over 13251.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.231, pruned_loss=0.06, over 2584073.67 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:31:12,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=470706.5, ans=0.0 2024-06-21 22:31:29,562 INFO [train.py:1028] (0/2) Epoch 26, batch 3850, loss[loss=0.1816, simple_loss=0.2271, pruned_loss=0.06808, over 13073.00 frames. ], tot_loss[loss=0.175, simple_loss=0.2306, pruned_loss=0.0597, over 2583149.77 frames. ], batch size: 144, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:31:30,211 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.893e+02 2.132e+02 2.259e+02 2.452e+02 3.350e+02, threshold=4.518e+02, percent-clipped=0.0 2024-06-21 22:31:34,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=470761.5, ans=0.125 2024-06-21 22:31:45,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=470798.1666666667, ans=0.125 2024-06-21 22:31:50,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=470816.5, ans=0.125 2024-06-21 22:31:56,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=470834.8333333333, ans=0.125 2024-06-21 22:31:57,223 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2024-06-21 22:31:59,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=470834.8333333333, ans=0.125 2024-06-21 22:32:02,319 INFO [train.py:1028] (0/2) Epoch 26, batch 3900, loss[loss=0.1652, simple_loss=0.2254, pruned_loss=0.05252, over 13224.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2307, pruned_loss=0.05994, over 2586738.06 frames. ], batch size: 83, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:32:02,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=470853.1666666667, ans=0.125 2024-06-21 22:32:10,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=470871.5, ans=0.025 2024-06-21 22:32:10,701 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=12.0 2024-06-21 22:32:16,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470889.8333333333, ans=0.125 2024-06-21 22:32:18,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=470889.8333333333, ans=0.125 2024-06-21 22:32:24,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470908.1666666667, ans=0.125 2024-06-21 22:32:35,184 INFO [train.py:1028] (0/2) Epoch 26, batch 3950, loss[loss=0.1808, simple_loss=0.2239, pruned_loss=0.06878, over 13114.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2299, pruned_loss=0.05951, over 2588509.80 frames. ], batch size: 132, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:32:39,143 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.738e+02 2.149e+02 2.289e+02 2.472e+02 4.018e+02, threshold=4.578e+02, percent-clipped=0.0 2024-06-21 22:32:45,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=470963.1666666667, ans=0.125 2024-06-21 22:32:49,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=470963.1666666667, ans=0.0 2024-06-21 22:33:14,733 INFO [train.py:1028] (0/2) Epoch 26, batch 4000, loss[loss=0.1945, simple_loss=0.2498, pruned_loss=0.06966, over 12929.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2296, pruned_loss=0.05969, over 2583554.69 frames. ], batch size: 39, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:33:27,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=471054.8333333333, ans=0.2 2024-06-21 22:33:48,302 INFO [train.py:1028] (0/2) Epoch 26, batch 4050, loss[loss=0.1832, simple_loss=0.2277, pruned_loss=0.06934, over 10890.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2296, pruned_loss=0.05994, over 2580486.74 frames. ], batch size: 304, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:33:48,871 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.148e+02 2.250e+02 2.404e+02 3.385e+02, threshold=4.501e+02, percent-clipped=0.0 2024-06-21 22:33:49,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=471128.1666666667, ans=0.09899494936611666 2024-06-21 22:33:52,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471128.1666666667, ans=0.1 2024-06-21 22:34:02,332 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:34:02,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2024-06-21 22:34:03,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=471164.8333333333, ans=0.04949747468305833 2024-06-21 22:34:21,082 INFO [train.py:1028] (0/2) Epoch 26, batch 4100, loss[loss=0.1886, simple_loss=0.2401, pruned_loss=0.06852, over 13088.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2303, pruned_loss=0.06045, over 2577232.84 frames. ], batch size: 102, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:34:21,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471219.8333333333, ans=0.125 2024-06-21 22:34:28,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.71 vs. limit=15.0 2024-06-21 22:34:40,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=471256.5, ans=0.2 2024-06-21 22:34:52,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471293.1666666667, ans=0.1 2024-06-21 22:34:53,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2024-06-21 22:34:57,969 INFO [train.py:1028] (0/2) Epoch 26, batch 4150, loss[loss=0.1744, simple_loss=0.2393, pruned_loss=0.05476, over 13140.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2304, pruned_loss=0.06055, over 2574849.79 frames. ], batch size: 55, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:34:58,601 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.812e+02 2.243e+02 2.353e+02 2.480e+02 3.313e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-21 22:34:59,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471311.5, ans=0.1 2024-06-21 22:35:00,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=471311.5, ans=0.04949747468305833 2024-06-21 22:35:19,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=471348.1666666667, ans=0.125 2024-06-21 22:35:21,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=471366.5, ans=0.125 2024-06-21 22:35:22,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.22 vs. limit=22.5 2024-06-21 22:35:32,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=471384.8333333333, ans=0.125 2024-06-21 22:35:32,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=471384.8333333333, ans=0.125 2024-06-21 22:35:34,221 INFO [train.py:1028] (0/2) Epoch 26, batch 4200, loss[loss=0.1691, simple_loss=0.2228, pruned_loss=0.05768, over 13062.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2299, pruned_loss=0.06034, over 2577685.12 frames. ], batch size: 102, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:35:45,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=471421.5, ans=0.0 2024-06-21 22:35:52,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471439.8333333333, ans=0.1 2024-06-21 22:35:56,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=471458.1666666667, ans=0.125 2024-06-21 22:35:57,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=471458.1666666667, ans=0.125 2024-06-21 22:36:01,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=471476.5, ans=0.2 2024-06-21 22:36:04,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2024-06-21 22:36:04,990 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=471476.5, ans=0.2 2024-06-21 22:36:06,848 INFO [train.py:1028] (0/2) Epoch 26, batch 4250, loss[loss=0.1584, simple_loss=0.2248, pruned_loss=0.04606, over 13299.00 frames. ], tot_loss[loss=0.175, simple_loss=0.2299, pruned_loss=0.0601, over 2580525.47 frames. ], batch size: 46, lr: 2.23e-03, grad_scale: 32.0 2024-06-21 22:36:07,465 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.178e+02 2.345e+02 2.576e+02 3.363e+02, threshold=4.690e+02, percent-clipped=0.0 2024-06-21 22:36:12,385 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2024-06-21 22:36:14,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=471513.1666666667, ans=0.025 2024-06-21 22:36:14,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=471513.1666666667, ans=0.025 2024-06-21 22:36:14,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471513.1666666667, ans=0.1 2024-06-21 22:36:18,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=471513.1666666667, ans=0.125 2024-06-21 22:36:33,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=471568.1666666667, ans=0.2 2024-06-21 22:36:39,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.20 vs. limit=15.0 2024-06-21 22:36:42,526 INFO [train.py:1028] (0/2) Epoch 26, batch 4300, loss[loss=0.1756, simple_loss=0.2387, pruned_loss=0.05625, over 13211.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.2294, pruned_loss=0.06, over 2581472.92 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:36:52,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=471604.8333333333, ans=0.125 2024-06-21 22:36:59,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=471623.1666666667, ans=0.0 2024-06-21 22:37:12,958 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:37:19,361 INFO [train.py:1028] (0/2) Epoch 26, batch 4350, loss[loss=0.181, simple_loss=0.2329, pruned_loss=0.06457, over 13189.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.229, pruned_loss=0.05977, over 2586398.49 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:37:19,906 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.146e+02 2.258e+02 2.415e+02 2.906e+02, threshold=4.516e+02, percent-clipped=0.0 2024-06-21 22:37:21,640 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2024-06-21 22:37:23,517 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2024-06-21 22:37:31,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=471714.8333333333, ans=0.025 2024-06-21 22:37:32,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=471714.8333333333, ans=0.0 2024-06-21 22:37:46,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=471751.5, ans=0.025 2024-06-21 22:37:52,161 INFO [train.py:1028] (0/2) Epoch 26, batch 4400, loss[loss=0.1735, simple_loss=0.2285, pruned_loss=0.05928, over 13212.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2297, pruned_loss=0.06032, over 2587122.79 frames. ], batch size: 83, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:37:54,755 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:37:55,105 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.25 vs. limit=15.0 2024-06-21 22:37:56,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=471769.8333333333, ans=0.125 2024-06-21 22:38:11,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471824.8333333333, ans=0.125 2024-06-21 22:38:19,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2024-06-21 22:38:24,673 INFO [train.py:1028] (0/2) Epoch 26, batch 4450, loss[loss=0.1708, simple_loss=0.23, pruned_loss=0.05582, over 12936.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2296, pruned_loss=0.06034, over 2581414.69 frames. ], batch size: 33, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:38:25,336 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.804e+02 2.098e+02 2.208e+02 2.373e+02 2.901e+02, threshold=4.415e+02, percent-clipped=0.0 2024-06-21 22:38:48,925 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:38:49,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=471916.5, ans=0.2 2024-06-21 22:38:50,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=471916.5, ans=0.0 2024-06-21 22:38:54,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=471934.8333333333, ans=0.125 2024-06-21 22:39:00,358 INFO [train.py:1028] (0/2) Epoch 26, batch 4500, loss[loss=0.1786, simple_loss=0.2291, pruned_loss=0.06402, over 13243.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2287, pruned_loss=0.05971, over 2585404.38 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:39:04,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2024-06-21 22:39:05,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=471953.1666666667, ans=0.125 2024-06-21 22:39:15,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471971.5, ans=0.1 2024-06-21 22:39:22,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=471989.8333333333, ans=0.125 2024-06-21 22:39:29,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=472026.5, ans=0.125 2024-06-21 22:39:36,730 INFO [train.py:1028] (0/2) Epoch 26, batch 4550, loss[loss=0.1864, simple_loss=0.2443, pruned_loss=0.06428, over 13239.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2286, pruned_loss=0.05964, over 2588416.72 frames. ], batch size: 52, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:39:37,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.134e+02 2.272e+02 2.428e+02 2.828e+02, threshold=4.543e+02, percent-clipped=0.0 2024-06-21 22:39:38,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=472044.8333333333, ans=0.125 2024-06-21 22:39:38,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=472044.8333333333, ans=0.2 2024-06-21 22:39:50,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=472081.5, ans=0.125 2024-06-21 22:39:54,652 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=472081.5, ans=0.125 2024-06-21 22:39:58,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=472099.8333333333, ans=0.125 2024-06-21 22:40:04,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=472118.1666666667, ans=0.125 2024-06-21 22:40:09,498 INFO [train.py:1028] (0/2) Epoch 26, batch 4600, loss[loss=0.192, simple_loss=0.2403, pruned_loss=0.0719, over 12572.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2284, pruned_loss=0.05955, over 2583724.66 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:40:10,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=472136.5, ans=0.0 2024-06-21 22:40:11,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.49 vs. limit=15.0 2024-06-21 22:40:18,121 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.00 vs. limit=15.0 2024-06-21 22:40:18,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472154.8333333333, ans=0.1 2024-06-21 22:40:25,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=472173.1666666667, ans=0.2 2024-06-21 22:40:27,441 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:40:28,258 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2024-06-21 22:40:41,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472209.8333333333, ans=0.1 2024-06-21 22:40:41,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=472209.8333333333, ans=0.2 2024-06-21 22:40:44,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=472209.8333333333, ans=0.125 2024-06-21 22:40:44,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=472228.1666666667, ans=0.025 2024-06-21 22:40:45,268 INFO [train.py:1028] (0/2) Epoch 26, batch 4650, loss[loss=0.1772, simple_loss=0.2227, pruned_loss=0.06586, over 13047.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2273, pruned_loss=0.05915, over 2587188.40 frames. ], batch size: 132, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:40:45,837 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.125e+02 2.236e+02 2.357e+02 3.184e+02, threshold=4.472e+02, percent-clipped=0.0 2024-06-21 22:40:55,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=472246.5, ans=0.125 2024-06-21 22:41:09,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=472283.1666666667, ans=0.125 2024-06-21 22:41:21,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472319.8333333333, ans=0.1 2024-06-21 22:41:22,023 INFO [train.py:1028] (0/2) Epoch 26, batch 4700, loss[loss=0.173, simple_loss=0.2343, pruned_loss=0.0559, over 13026.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2272, pruned_loss=0.05896, over 2584104.84 frames. ], batch size: 26, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:41:22,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=472319.8333333333, ans=0.0 2024-06-21 22:41:30,381 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:41:45,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2024-06-21 22:41:54,965 INFO [train.py:1028] (0/2) Epoch 26, batch 4750, loss[loss=0.1854, simple_loss=0.237, pruned_loss=0.06686, over 12534.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2276, pruned_loss=0.05935, over 2581208.26 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:41:55,748 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.225e+02 2.414e+02 2.680e+02 3.339e+02, threshold=4.827e+02, percent-clipped=0.0 2024-06-21 22:41:55,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=472411.5, ans=0.125 2024-06-21 22:41:56,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=472411.5, ans=0.0 2024-06-21 22:42:02,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=472429.8333333333, ans=0.2 2024-06-21 22:42:14,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=472466.5, ans=0.125 2024-06-21 22:42:28,292 INFO [train.py:1028] (0/2) Epoch 26, batch 4800, loss[loss=0.1594, simple_loss=0.2191, pruned_loss=0.04989, over 13248.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2272, pruned_loss=0.05893, over 2577770.18 frames. ], batch size: 63, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:42:38,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=472521.5, ans=0.0 2024-06-21 22:42:40,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472521.5, ans=0.125 2024-06-21 22:42:57,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=472576.5, ans=0.2 2024-06-21 22:43:07,642 INFO [train.py:1028] (0/2) Epoch 26, batch 4850, loss[loss=0.1629, simple_loss=0.2171, pruned_loss=0.05431, over 13262.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2269, pruned_loss=0.05873, over 2574786.09 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:43:07,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472594.8333333333, ans=0.125 2024-06-21 22:43:08,323 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.124e+02 2.257e+02 2.407e+02 3.075e+02, threshold=4.514e+02, percent-clipped=0.0 2024-06-21 22:43:17,455 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=472613.1666666667, ans=0.0 2024-06-21 22:43:20,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.65 vs. limit=15.0 2024-06-21 22:43:22,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=472631.5, ans=0.125 2024-06-21 22:43:25,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=472631.5, ans=0.95 2024-06-21 22:43:31,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=472649.8333333333, ans=0.0 2024-06-21 22:43:40,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=472668.1666666667, ans=0.0 2024-06-21 22:43:40,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.22 vs. limit=15.0 2024-06-21 22:43:41,792 INFO [train.py:1028] (0/2) Epoch 26, batch 4900, loss[loss=0.1756, simple_loss=0.2319, pruned_loss=0.05963, over 13199.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2269, pruned_loss=0.05886, over 2575298.50 frames. ], batch size: 59, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:43:48,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472704.8333333333, ans=0.1 2024-06-21 22:43:49,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=472704.8333333333, ans=0.0 2024-06-21 22:43:49,685 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.31 vs. limit=22.5 2024-06-21 22:43:50,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=472704.8333333333, ans=0.025 2024-06-21 22:44:07,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472741.5, ans=0.1 2024-06-21 22:44:09,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472759.8333333333, ans=0.125 2024-06-21 22:44:15,375 INFO [train.py:1028] (0/2) Epoch 26, batch 4950, loss[loss=0.1788, simple_loss=0.2209, pruned_loss=0.06839, over 11179.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2268, pruned_loss=0.05908, over 2570005.34 frames. ], batch size: 304, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:44:15,978 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.172e+02 2.333e+02 2.669e+02 3.582e+02, threshold=4.667e+02, percent-clipped=0.0 2024-06-21 22:44:16,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=472778.1666666667, ans=0.0 2024-06-21 22:44:17,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=472778.1666666667, ans=0.0 2024-06-21 22:44:19,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=472778.1666666667, ans=0.2 2024-06-21 22:44:22,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472796.5, ans=0.1 2024-06-21 22:44:43,590 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.04 vs. limit=22.5 2024-06-21 22:44:44,636 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:44:53,172 INFO [train.py:1028] (0/2) Epoch 26, batch 5000, loss[loss=0.1651, simple_loss=0.2115, pruned_loss=0.0594, over 13156.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2267, pruned_loss=0.0591, over 2574571.85 frames. ], batch size: 95, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:44:56,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=472869.8333333333, ans=0.125 2024-06-21 22:44:59,832 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-21 22:45:01,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=15.0 2024-06-21 22:45:03,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2024-06-21 22:45:15,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=472906.5, ans=0.125 2024-06-21 22:45:22,221 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:45:24,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=472943.1666666667, ans=0.2 2024-06-21 22:45:24,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=15.0 2024-06-21 22:45:31,738 INFO [train.py:1028] (0/2) Epoch 26, batch 5050, loss[loss=0.1828, simple_loss=0.238, pruned_loss=0.06375, over 13192.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2275, pruned_loss=0.05937, over 2574723.75 frames. ], batch size: 37, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:45:32,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.820e+02 2.111e+02 2.272e+02 2.459e+02 3.080e+02, threshold=4.543e+02, percent-clipped=0.0 2024-06-21 22:45:36,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=472961.5, ans=0.0 2024-06-21 22:45:42,687 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=472979.8333333333, ans=0.125 2024-06-21 22:45:47,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=472998.1666666667, ans=0.0 2024-06-21 22:45:47,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=472998.1666666667, ans=0.125 2024-06-21 22:45:53,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=473016.5, ans=0.05 2024-06-21 22:45:55,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=473016.5, ans=0.125 2024-06-21 22:45:55,852 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:46:00,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=473034.8333333333, ans=0.0 2024-06-21 22:46:05,534 INFO [train.py:1028] (0/2) Epoch 26, batch 5100, loss[loss=0.1916, simple_loss=0.2445, pruned_loss=0.06934, over 12984.00 frames. ], tot_loss[loss=0.1735, simple_loss=0.2277, pruned_loss=0.05966, over 2570815.01 frames. ], batch size: 39, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:46:07,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473053.1666666667, ans=0.1 2024-06-21 22:46:14,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2024-06-21 22:46:25,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=473108.1666666667, ans=0.125 2024-06-21 22:46:26,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=473108.1666666667, ans=0.125 2024-06-21 22:46:33,961 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.02 vs. limit=12.0 2024-06-21 22:46:41,355 INFO [train.py:1028] (0/2) Epoch 26, batch 5150, loss[loss=0.1575, simple_loss=0.2112, pruned_loss=0.05193, over 13106.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2273, pruned_loss=0.05972, over 2571918.58 frames. ], batch size: 132, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:46:42,018 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.184e+02 2.383e+02 2.587e+02 3.699e+02, threshold=4.767e+02, percent-clipped=0.0 2024-06-21 22:46:45,038 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2024-06-21 22:46:47,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.83 vs. limit=15.0 2024-06-21 22:46:52,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=473163.1666666667, ans=0.0 2024-06-21 22:46:56,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=473181.5, ans=0.125 2024-06-21 22:46:58,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=473181.5, ans=0.0 2024-06-21 22:47:05,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=473199.8333333333, ans=0.2 2024-06-21 22:47:15,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473218.1666666667, ans=0.0 2024-06-21 22:47:17,934 INFO [train.py:1028] (0/2) Epoch 26, batch 5200, loss[loss=0.17, simple_loss=0.2223, pruned_loss=0.05884, over 13163.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2272, pruned_loss=0.05963, over 2575684.72 frames. ], batch size: 95, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:47:28,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473254.8333333333, ans=0.1 2024-06-21 22:47:29,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=473254.8333333333, ans=0.0 2024-06-21 22:47:33,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=473273.1666666667, ans=0.025 2024-06-21 22:47:36,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=473273.1666666667, ans=0.0 2024-06-21 22:47:36,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=473273.1666666667, ans=0.0 2024-06-21 22:47:37,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2024-06-21 22:47:48,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.72 vs. limit=22.5 2024-06-21 22:47:51,365 INFO [train.py:1028] (0/2) Epoch 26, batch 5250, loss[loss=0.1833, simple_loss=0.2409, pruned_loss=0.06281, over 13248.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2272, pruned_loss=0.0596, over 2572678.22 frames. ], batch size: 52, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:47:51,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473328.1666666667, ans=0.125 2024-06-21 22:47:51,951 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.154e+02 2.268e+02 2.455e+02 3.354e+02, threshold=4.537e+02, percent-clipped=0.0 2024-06-21 22:47:58,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=473346.5, ans=0.125 2024-06-21 22:48:09,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473364.8333333333, ans=0.125 2024-06-21 22:48:23,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=473419.8333333333, ans=0.0 2024-06-21 22:48:24,267 INFO [train.py:1028] (0/2) Epoch 26, batch 5300, loss[loss=0.1646, simple_loss=0.2129, pruned_loss=0.05817, over 13036.00 frames. ], tot_loss[loss=0.1733, simple_loss=0.2273, pruned_loss=0.05963, over 2569487.61 frames. ], batch size: 144, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:48:26,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=473419.8333333333, ans=0.125 2024-06-21 22:48:34,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2024-06-21 22:48:42,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=473456.5, ans=0.025 2024-06-21 22:48:47,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=473474.8333333333, ans=0.2 2024-06-21 22:49:05,215 INFO [train.py:1028] (0/2) Epoch 26, batch 5350, loss[loss=0.169, simple_loss=0.2314, pruned_loss=0.05331, over 12001.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2265, pruned_loss=0.05925, over 2576560.28 frames. ], batch size: 17, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:49:05,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=473511.5, ans=0.125 2024-06-21 22:49:05,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.088e+02 2.226e+02 2.386e+02 3.248e+02, threshold=4.451e+02, percent-clipped=0.0 2024-06-21 22:49:09,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=473511.5, ans=0.1 2024-06-21 22:49:11,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=473511.5, ans=0.0 2024-06-21 22:49:31,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473584.8333333333, ans=0.1 2024-06-21 22:49:38,172 INFO [train.py:1028] (0/2) Epoch 26, batch 5400, loss[loss=0.1799, simple_loss=0.2228, pruned_loss=0.06855, over 12260.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2267, pruned_loss=0.05973, over 2568825.35 frames. ], batch size: 240, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:49:43,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473603.1666666667, ans=0.1 2024-06-21 22:49:58,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=473658.1666666667, ans=0.04949747468305833 2024-06-21 22:50:03,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=473658.1666666667, ans=0.2 2024-06-21 22:50:10,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=473694.8333333333, ans=0.125 2024-06-21 22:50:11,036 INFO [train.py:1028] (0/2) Epoch 26, batch 5450, loss[loss=0.1671, simple_loss=0.2281, pruned_loss=0.05307, over 13008.00 frames. ], tot_loss[loss=0.1733, simple_loss=0.227, pruned_loss=0.05978, over 2572314.68 frames. ], batch size: 26, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:50:11,717 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.165e+02 2.344e+02 2.492e+02 3.563e+02, threshold=4.689e+02, percent-clipped=0.0 2024-06-21 22:50:20,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=473713.1666666667, ans=15.0 2024-06-21 22:50:20,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.86 vs. limit=15.0 2024-06-21 22:50:25,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.70 vs. limit=10.0 2024-06-21 22:50:28,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=473731.5, ans=0.2 2024-06-21 22:50:36,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=473749.8333333333, ans=0.0 2024-06-21 22:50:45,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=473768.1666666667, ans=0.0 2024-06-21 22:50:47,692 INFO [train.py:1028] (0/2) Epoch 26, batch 5500, loss[loss=0.202, simple_loss=0.2442, pruned_loss=0.07991, over 12244.00 frames. ], tot_loss[loss=0.1735, simple_loss=0.2274, pruned_loss=0.05982, over 2566125.86 frames. ], batch size: 240, lr: 2.22e-03, grad_scale: 64.0 2024-06-21 22:51:00,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473804.8333333333, ans=0.1 2024-06-21 22:51:09,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=473823.1666666667, ans=0.0 2024-06-21 22:51:11,899 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 22:51:15,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=473841.5, ans=0.0 2024-06-21 22:51:23,894 INFO [train.py:1028] (0/2) Epoch 26, batch 5550, loss[loss=0.1677, simple_loss=0.2258, pruned_loss=0.05483, over 13229.00 frames. ], tot_loss[loss=0.173, simple_loss=0.227, pruned_loss=0.05956, over 2568774.25 frames. ], batch size: 43, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:51:25,211 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.135e+02 2.262e+02 2.450e+02 3.251e+02, threshold=4.524e+02, percent-clipped=0.0 2024-06-21 22:51:40,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=473914.8333333333, ans=0.2 2024-06-21 22:51:44,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=473933.1666666667, ans=0.125 2024-06-21 22:51:46,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=473933.1666666667, ans=0.125 2024-06-21 22:51:48,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=473933.1666666667, ans=0.125 2024-06-21 22:51:52,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=473951.5, ans=0.2 2024-06-21 22:51:52,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=473951.5, ans=0.125 2024-06-21 22:51:55,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=473951.5, ans=0.125 2024-06-21 22:51:56,205 INFO [train.py:1028] (0/2) Epoch 26, batch 5600, loss[loss=0.1716, simple_loss=0.2195, pruned_loss=0.06189, over 13245.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2266, pruned_loss=0.05936, over 2570772.07 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:52:13,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=474006.5, ans=0.0 2024-06-21 22:52:30,254 INFO [train.py:1028] (0/2) Epoch 26, batch 5650, loss[loss=0.1773, simple_loss=0.2268, pruned_loss=0.06384, over 12585.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2267, pruned_loss=0.05934, over 2574568.30 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:52:31,505 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.123e+02 2.212e+02 2.361e+02 3.097e+02, threshold=4.424e+02, percent-clipped=0.0 2024-06-21 22:52:53,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474116.5, ans=0.1 2024-06-21 22:53:10,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=474153.1666666667, ans=0.125 2024-06-21 22:53:11,173 INFO [train.py:1028] (0/2) Epoch 26, batch 5700, loss[loss=0.1487, simple_loss=0.2068, pruned_loss=0.04533, over 13234.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2263, pruned_loss=0.05922, over 2578436.38 frames. ], batch size: 63, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:53:17,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474171.5, ans=0.1 2024-06-21 22:53:37,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=474226.5, ans=0.125 2024-06-21 22:53:44,158 INFO [train.py:1028] (0/2) Epoch 26, batch 5750, loss[loss=0.1934, simple_loss=0.2392, pruned_loss=0.07382, over 12789.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2271, pruned_loss=0.05964, over 2579183.42 frames. ], batch size: 176, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:53:45,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.189e+02 2.312e+02 2.522e+02 2.990e+02, threshold=4.623e+02, percent-clipped=0.0 2024-06-21 22:53:46,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=474244.8333333333, ans=0.125 2024-06-21 22:53:51,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.48 vs. limit=10.0 2024-06-21 22:53:52,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=474263.1666666667, ans=0.2 2024-06-21 22:53:59,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=474281.5, ans=0.2 2024-06-21 22:54:13,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=474318.1666666667, ans=0.05 2024-06-21 22:54:13,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=474318.1666666667, ans=0.125 2024-06-21 22:54:16,934 INFO [train.py:1028] (0/2) Epoch 26, batch 5800, loss[loss=0.1831, simple_loss=0.2299, pruned_loss=0.06813, over 12724.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2286, pruned_loss=0.06053, over 2578479.16 frames. ], batch size: 176, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:54:17,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2024-06-21 22:54:26,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474354.8333333333, ans=0.1 2024-06-21 22:54:40,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.38 vs. limit=22.5 2024-06-21 22:54:55,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2024-06-21 22:54:56,004 INFO [train.py:1028] (0/2) Epoch 26, batch 5850, loss[loss=0.198, simple_loss=0.244, pruned_loss=0.07601, over 12528.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2305, pruned_loss=0.06106, over 2576523.79 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:54:57,309 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.292e+02 2.420e+02 2.667e+02 3.986e+02, threshold=4.839e+02, percent-clipped=0.0 2024-06-21 22:54:57,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474428.1666666667, ans=0.1 2024-06-21 22:55:07,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=474446.5, ans=10.0 2024-06-21 22:55:29,110 INFO [train.py:1028] (0/2) Epoch 26, batch 5900, loss[loss=0.1731, simple_loss=0.2245, pruned_loss=0.06083, over 13099.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2321, pruned_loss=0.0616, over 2576475.28 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:55:33,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=474519.8333333333, ans=0.1 2024-06-21 22:55:39,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=474538.1666666667, ans=0.125 2024-06-21 22:55:41,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=474538.1666666667, ans=0.125 2024-06-21 22:55:48,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2024-06-21 22:55:57,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=474593.1666666667, ans=0.0 2024-06-21 22:56:02,422 INFO [train.py:1028] (0/2) Epoch 26, batch 5950, loss[loss=0.184, simple_loss=0.2312, pruned_loss=0.06841, over 13043.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2332, pruned_loss=0.0621, over 2581173.46 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:56:03,760 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.173e+02 2.381e+02 2.579e+02 3.557e+02, threshold=4.763e+02, percent-clipped=0.0 2024-06-21 22:56:04,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=474611.5, ans=0.125 2024-06-21 22:56:08,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2024-06-21 22:56:14,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=474629.8333333333, ans=0.125 2024-06-21 22:56:22,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=474666.5, ans=0.2 2024-06-21 22:56:38,567 INFO [train.py:1028] (0/2) Epoch 26, batch 6000, loss[loss=0.2289, simple_loss=0.2738, pruned_loss=0.09194, over 12211.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2343, pruned_loss=0.06242, over 2575463.87 frames. ], batch size: 240, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:56:38,568 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 22:56:46,498 INFO [train.py:1060] (0/2) Epoch 26, validation: loss=0.1911, simple_loss=0.2517, pruned_loss=0.06525, over 351949.00 frames. 2024-06-21 22:56:46,499 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 22:56:48,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=474703.1666666667, ans=0.5 2024-06-21 22:57:02,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=474721.5, ans=0.125 2024-06-21 22:57:15,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=474758.1666666667, ans=0.2 2024-06-21 22:57:25,191 INFO [train.py:1028] (0/2) Epoch 26, batch 6050, loss[loss=0.1882, simple_loss=0.2477, pruned_loss=0.06437, over 13185.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2351, pruned_loss=0.06243, over 2578090.58 frames. ], batch size: 40, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:57:25,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=474794.8333333333, ans=0.0 2024-06-21 22:57:26,545 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.254e+02 2.402e+02 2.605e+02 3.298e+02, threshold=4.803e+02, percent-clipped=0.0 2024-06-21 22:57:27,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=474794.8333333333, ans=0.125 2024-06-21 22:57:29,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=474794.8333333333, ans=0.125 2024-06-21 22:57:41,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.90 vs. limit=12.0 2024-06-21 22:57:57,879 INFO [train.py:1028] (0/2) Epoch 26, batch 6100, loss[loss=0.194, simple_loss=0.2399, pruned_loss=0.07408, over 13141.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2359, pruned_loss=0.06257, over 2580148.51 frames. ], batch size: 121, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:57:59,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.66 vs. limit=15.0 2024-06-21 22:58:13,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.11 vs. limit=10.0 2024-06-21 22:58:15,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=474923.1666666667, ans=0.035 2024-06-21 22:58:17,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=25.50 vs. limit=22.5 2024-06-21 22:58:26,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=474959.8333333333, ans=0.025 2024-06-21 22:58:29,468 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2024-06-21 22:58:32,096 INFO [train.py:1028] (0/2) Epoch 26, batch 6150, loss[loss=0.166, simple_loss=0.2183, pruned_loss=0.05689, over 10827.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2371, pruned_loss=0.06299, over 2578806.24 frames. ], batch size: 304, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:58:33,474 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.248e+02 2.395e+02 2.691e+02 3.822e+02, threshold=4.791e+02, percent-clipped=0.0 2024-06-21 22:58:42,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=474996.5, ans=22.5 2024-06-21 22:58:47,478 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.99 vs. limit=22.5 2024-06-21 22:59:07,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2024-06-21 22:59:09,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=475051.5, ans=0.035 2024-06-21 22:59:11,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475051.5, ans=0.1 2024-06-21 22:59:11,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=475051.5, ans=0.125 2024-06-21 22:59:13,635 INFO [train.py:1028] (0/2) Epoch 26, batch 6200, loss[loss=0.2063, simple_loss=0.2682, pruned_loss=0.07219, over 13273.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2383, pruned_loss=0.0633, over 2575650.30 frames. ], batch size: 89, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:59:16,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=475069.8333333333, ans=0.2 2024-06-21 22:59:16,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=475069.8333333333, ans=0.125 2024-06-21 22:59:17,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=475069.8333333333, ans=0.125 2024-06-21 22:59:22,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=475088.1666666667, ans=0.025 2024-06-21 22:59:32,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475106.5, ans=0.1 2024-06-21 22:59:42,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=475143.1666666667, ans=0.125 2024-06-21 22:59:44,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=475143.1666666667, ans=0.09899494936611666 2024-06-21 22:59:47,506 INFO [train.py:1028] (0/2) Epoch 26, batch 6250, loss[loss=0.1897, simple_loss=0.2438, pruned_loss=0.0678, over 13222.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2397, pruned_loss=0.06415, over 2567944.60 frames. ], batch size: 83, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 22:59:48,973 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.271e+02 2.465e+02 2.690e+02 3.732e+02, threshold=4.931e+02, percent-clipped=0.0 2024-06-21 22:59:58,329 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=475179.8333333333, ans=0.0 2024-06-21 23:00:03,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=475198.1666666667, ans=0.0 2024-06-21 23:00:21,255 INFO [train.py:1028] (0/2) Epoch 26, batch 6300, loss[loss=0.1892, simple_loss=0.2507, pruned_loss=0.0639, over 11292.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2407, pruned_loss=0.06442, over 2563140.43 frames. ], batch size: 16, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:00:23,158 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2024-06-21 23:00:28,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2024-06-21 23:00:56,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=475326.5, ans=0.125 2024-06-21 23:00:58,851 INFO [train.py:1028] (0/2) Epoch 26, batch 6350, loss[loss=0.2258, simple_loss=0.2761, pruned_loss=0.08769, over 12538.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2428, pruned_loss=0.06472, over 2572880.46 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:01:00,235 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.394e+02 2.641e+02 2.964e+02 4.205e+02, threshold=5.281e+02, percent-clipped=0.0 2024-06-21 23:01:13,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475363.1666666667, ans=0.125 2024-06-21 23:01:20,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=475381.5, ans=0.2 2024-06-21 23:01:35,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=475436.5, ans=0.025 2024-06-21 23:01:35,889 INFO [train.py:1028] (0/2) Epoch 26, batch 6400, loss[loss=0.1911, simple_loss=0.2476, pruned_loss=0.06727, over 13226.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2447, pruned_loss=0.06544, over 2574580.35 frames. ], batch size: 67, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:01:39,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=475436.5, ans=0.125 2024-06-21 23:01:45,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=475454.8333333333, ans=0.0 2024-06-21 23:01:53,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475473.1666666667, ans=0.125 2024-06-21 23:01:55,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=475491.5, ans=0.0 2024-06-21 23:01:59,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475491.5, ans=0.1 2024-06-21 23:02:03,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=475509.8333333333, ans=0.125 2024-06-21 23:02:05,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=475509.8333333333, ans=10.0 2024-06-21 23:02:07,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=475509.8333333333, ans=0.0 2024-06-21 23:02:07,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.09 vs. limit=15.0 2024-06-21 23:02:08,594 INFO [train.py:1028] (0/2) Epoch 26, batch 6450, loss[loss=0.2291, simple_loss=0.2841, pruned_loss=0.08705, over 12550.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2464, pruned_loss=0.06607, over 2580210.76 frames. ], batch size: 202, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:02:08,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=475528.1666666667, ans=0.05 2024-06-21 23:02:09,878 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.309e+02 2.453e+02 2.695e+02 3.835e+02, threshold=4.905e+02, percent-clipped=0.0 2024-06-21 23:02:13,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2024-06-21 23:02:16,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=475546.5, ans=0.2 2024-06-21 23:02:25,195 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.51 vs. limit=10.0 2024-06-21 23:02:27,629 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:02:28,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=475583.1666666667, ans=0.0 2024-06-21 23:02:41,242 INFO [train.py:1028] (0/2) Epoch 26, batch 6500, loss[loss=0.2173, simple_loss=0.2614, pruned_loss=0.08657, over 10846.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2482, pruned_loss=0.06622, over 2584481.10 frames. ], batch size: 304, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:02:43,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=475619.8333333333, ans=0.025 2024-06-21 23:02:48,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=475638.1666666667, ans=0.125 2024-06-21 23:02:49,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=475638.1666666667, ans=0.0 2024-06-21 23:03:02,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2024-06-21 23:03:05,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=475656.5, ans=0.04949747468305833 2024-06-21 23:03:09,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.75 vs. limit=12.0 2024-06-21 23:03:13,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=475693.1666666667, ans=0.2 2024-06-21 23:03:22,915 INFO [train.py:1028] (0/2) Epoch 26, batch 6550, loss[loss=0.181, simple_loss=0.2455, pruned_loss=0.05822, over 12581.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.249, pruned_loss=0.06615, over 2588599.26 frames. ], batch size: 22, lr: 2.22e-03, grad_scale: 32.0 2024-06-21 23:03:24,275 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.416e+02 2.549e+02 2.811e+02 3.833e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-21 23:03:26,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475711.5, ans=0.1 2024-06-21 23:03:31,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475729.8333333333, ans=0.1 2024-06-21 23:03:38,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=475748.1666666667, ans=0.125 2024-06-21 23:03:55,281 INFO [train.py:1028] (0/2) Epoch 26, batch 6600, loss[loss=0.2129, simple_loss=0.2701, pruned_loss=0.07788, over 13226.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2499, pruned_loss=0.06661, over 2592552.40 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:03:56,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=475803.1666666667, ans=0.0 2024-06-21 23:03:58,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=475803.1666666667, ans=0.1 2024-06-21 23:04:05,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=475821.5, ans=0.125 2024-06-21 23:04:09,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=475839.8333333333, ans=0.0 2024-06-21 23:04:22,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=475876.5, ans=0.0 2024-06-21 23:04:22,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2024-06-21 23:04:27,013 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2024-06-21 23:04:28,786 INFO [train.py:1028] (0/2) Epoch 26, batch 6650, loss[loss=0.2295, simple_loss=0.2786, pruned_loss=0.09016, over 12938.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2512, pruned_loss=0.06728, over 2585883.19 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:04:30,187 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.293e+02 2.453e+02 2.711e+02 3.442e+02, threshold=4.906e+02, percent-clipped=0.0 2024-06-21 23:04:31,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=475894.8333333333, ans=0.0 2024-06-21 23:04:33,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=475894.8333333333, ans=0.025 2024-06-21 23:04:41,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=475913.1666666667, ans=15.0 2024-06-21 23:04:48,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=475931.5, ans=0.125 2024-06-21 23:05:02,537 INFO [train.py:1028] (0/2) Epoch 26, batch 6700, loss[loss=0.2196, simple_loss=0.2719, pruned_loss=0.08368, over 12708.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.252, pruned_loss=0.0677, over 2585116.41 frames. ], batch size: 176, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:05:24,930 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.39 vs. limit=15.0 2024-06-21 23:05:28,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476023.1666666667, ans=0.1 2024-06-21 23:05:28,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.06 vs. limit=12.0 2024-06-21 23:05:32,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=476041.5, ans=0.2 2024-06-21 23:05:42,902 INFO [train.py:1028] (0/2) Epoch 26, batch 6750, loss[loss=0.2688, simple_loss=0.3083, pruned_loss=0.1146, over 12220.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2527, pruned_loss=0.0682, over 2578351.44 frames. ], batch size: 240, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:05:44,200 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.325e+02 2.494e+02 2.647e+02 3.717e+02, threshold=4.988e+02, percent-clipped=0.0 2024-06-21 23:05:47,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.17 vs. limit=15.0 2024-06-21 23:05:47,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=476078.1666666667, ans=0.04949747468305833 2024-06-21 23:05:49,134 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=10.67 vs. limit=15.0 2024-06-21 23:06:03,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=476133.1666666667, ans=0.0 2024-06-21 23:06:11,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2024-06-21 23:06:15,966 INFO [train.py:1028] (0/2) Epoch 26, batch 6800, loss[loss=0.1801, simple_loss=0.2425, pruned_loss=0.05885, over 13264.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2543, pruned_loss=0.06844, over 2579972.05 frames. ], batch size: 67, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:06:22,749 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2024-06-21 23:06:38,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=476224.8333333333, ans=0.025 2024-06-21 23:06:46,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=476243.1666666667, ans=0.0 2024-06-21 23:06:49,184 INFO [train.py:1028] (0/2) Epoch 26, batch 6850, loss[loss=0.2069, simple_loss=0.2808, pruned_loss=0.0665, over 13276.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2549, pruned_loss=0.06842, over 2583376.71 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:06:50,172 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2024-06-21 23:06:50,498 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.377e+02 2.593e+02 2.979e+02 4.847e+02, threshold=5.186e+02, percent-clipped=0.0 2024-06-21 23:06:57,420 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2024-06-21 23:07:09,165 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.83 vs. limit=22.5 2024-06-21 23:07:15,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=476316.5, ans=0.125 2024-06-21 23:07:30,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=476353.1666666667, ans=0.125 2024-06-21 23:07:30,892 INFO [train.py:1028] (0/2) Epoch 26, batch 6900, loss[loss=0.2023, simple_loss=0.2576, pruned_loss=0.07345, over 13239.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2557, pruned_loss=0.06863, over 2585074.30 frames. ], batch size: 49, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:07:46,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476389.8333333333, ans=0.1 2024-06-21 23:07:49,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476389.8333333333, ans=0.1 2024-06-21 23:07:59,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=476426.5, ans=0.2 2024-06-21 23:08:03,773 INFO [train.py:1028] (0/2) Epoch 26, batch 6950, loss[loss=0.1993, simple_loss=0.2539, pruned_loss=0.07241, over 11447.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2562, pruned_loss=0.06856, over 2577522.17 frames. ], batch size: 16, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:08:04,970 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.351e+02 2.512e+02 2.813e+02 3.310e+02, threshold=5.025e+02, percent-clipped=0.0 2024-06-21 23:08:05,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=476444.8333333333, ans=0.125 2024-06-21 23:08:12,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=476463.1666666667, ans=0.0 2024-06-21 23:08:16,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=476481.5, ans=0.0 2024-06-21 23:08:20,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-21 23:08:36,773 INFO [train.py:1028] (0/2) Epoch 26, batch 7000, loss[loss=0.2017, simple_loss=0.2556, pruned_loss=0.07389, over 12985.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2562, pruned_loss=0.06846, over 2574897.25 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:08:39,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476536.5, ans=0.1 2024-06-21 23:08:41,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.30 vs. limit=10.0 2024-06-21 23:08:53,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2024-06-21 23:08:59,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=476591.5, ans=0.0 2024-06-21 23:09:00,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476591.5, ans=0.1 2024-06-21 23:09:03,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=476591.5, ans=0.0 2024-06-21 23:09:06,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=476609.8333333333, ans=0.0 2024-06-21 23:09:10,721 INFO [train.py:1028] (0/2) Epoch 26, batch 7050, loss[loss=0.211, simple_loss=0.2671, pruned_loss=0.0774, over 12750.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2569, pruned_loss=0.06854, over 2582124.27 frames. ], batch size: 176, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:09:12,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.318e+02 2.466e+02 2.634e+02 4.011e+02, threshold=4.932e+02, percent-clipped=0.0 2024-06-21 23:09:13,623 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=476628.1666666667, ans=0.125 2024-06-21 23:09:19,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=476628.1666666667, ans=10.0 2024-06-21 23:09:21,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=476646.5, ans=0.125 2024-06-21 23:09:30,889 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-260000.pt 2024-06-21 23:09:41,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476683.1666666667, ans=0.1 2024-06-21 23:09:43,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=476683.1666666667, ans=0.125 2024-06-21 23:09:47,204 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=476683.1666666667, ans=0.125 2024-06-21 23:09:55,355 INFO [train.py:1028] (0/2) Epoch 26, batch 7100, loss[loss=0.2181, simple_loss=0.2779, pruned_loss=0.07917, over 13132.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2572, pruned_loss=0.06895, over 2574237.67 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:09:59,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=476719.8333333333, ans=0.2 2024-06-21 23:10:01,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.46 vs. limit=12.0 2024-06-21 23:10:02,150 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.11 vs. limit=12.0 2024-06-21 23:10:21,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.23 vs. limit=10.0 2024-06-21 23:10:28,803 INFO [train.py:1028] (0/2) Epoch 26, batch 7150, loss[loss=0.2284, simple_loss=0.2769, pruned_loss=0.08993, over 12515.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2581, pruned_loss=0.069, over 2573094.27 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:10:30,172 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.338e+02 2.540e+02 2.719e+02 4.396e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:10:33,277 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.22 vs. limit=22.5 2024-06-21 23:10:38,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=15.0 2024-06-21 23:10:38,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=476829.8333333333, ans=0.0 2024-06-21 23:10:42,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=476848.1666666667, ans=10.0 2024-06-21 23:10:45,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=476848.1666666667, ans=0.2 2024-06-21 23:11:00,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=476884.8333333333, ans=0.125 2024-06-21 23:11:02,044 INFO [train.py:1028] (0/2) Epoch 26, batch 7200, loss[loss=0.2619, simple_loss=0.317, pruned_loss=0.1034, over 13169.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2591, pruned_loss=0.06952, over 2577952.06 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:11:06,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2024-06-21 23:11:08,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=476921.5, ans=0.025 2024-06-21 23:11:19,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=476939.8333333333, ans=0.0 2024-06-21 23:11:33,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=476976.5, ans=0.125 2024-06-21 23:11:44,019 INFO [train.py:1028] (0/2) Epoch 26, batch 7250, loss[loss=0.1968, simple_loss=0.2576, pruned_loss=0.06801, over 12928.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2602, pruned_loss=0.06964, over 2578783.81 frames. ], batch size: 36, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:11:45,230 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.363e+02 2.506e+02 2.774e+02 3.855e+02, threshold=5.012e+02, percent-clipped=0.0 2024-06-21 23:11:54,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477013.1666666667, ans=0.1 2024-06-21 23:12:01,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=477031.5, ans=0.015 2024-06-21 23:12:03,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=477049.8333333333, ans=0.125 2024-06-21 23:12:04,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=477049.8333333333, ans=0.07 2024-06-21 23:12:12,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.92 vs. limit=15.0 2024-06-21 23:12:15,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=477068.1666666667, ans=0.09899494936611666 2024-06-21 23:12:17,241 INFO [train.py:1028] (0/2) Epoch 26, batch 7300, loss[loss=0.1826, simple_loss=0.2429, pruned_loss=0.06111, over 12952.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.261, pruned_loss=0.07001, over 2579117.71 frames. ], batch size: 36, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:12:20,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=477086.5, ans=0.125 2024-06-21 23:12:22,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477086.5, ans=0.125 2024-06-21 23:12:25,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.45 vs. limit=15.0 2024-06-21 23:12:25,868 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2024-06-21 23:12:31,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=477123.1666666667, ans=0.125 2024-06-21 23:12:36,687 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.31 vs. limit=10.0 2024-06-21 23:12:39,055 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2024-06-21 23:12:42,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=477141.5, ans=0.2 2024-06-21 23:12:45,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=477159.8333333333, ans=12.0 2024-06-21 23:12:50,727 INFO [train.py:1028] (0/2) Epoch 26, batch 7350, loss[loss=0.2123, simple_loss=0.2736, pruned_loss=0.07551, over 13301.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2624, pruned_loss=0.07054, over 2579923.17 frames. ], batch size: 46, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:12:52,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.395e+02 2.551e+02 2.719e+02 3.986e+02, threshold=5.103e+02, percent-clipped=0.0 2024-06-21 23:13:00,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477196.5, ans=0.1 2024-06-21 23:13:05,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=477214.8333333333, ans=0.2 2024-06-21 23:13:07,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=477214.8333333333, ans=0.0 2024-06-21 23:13:18,493 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.506e+01 2024-06-21 23:13:23,370 INFO [train.py:1028] (0/2) Epoch 26, batch 7400, loss[loss=0.2182, simple_loss=0.2843, pruned_loss=0.07599, over 13328.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2616, pruned_loss=0.07003, over 2585859.17 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:13:33,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-21 23:13:37,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=477306.5, ans=0.125 2024-06-21 23:13:46,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=477324.8333333333, ans=0.2 2024-06-21 23:13:51,572 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=22.5 2024-06-21 23:13:57,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=477343.1666666667, ans=0.125 2024-06-21 23:13:59,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=477343.1666666667, ans=0.0 2024-06-21 23:14:03,326 INFO [train.py:1028] (0/2) Epoch 26, batch 7450, loss[loss=0.1886, simple_loss=0.2496, pruned_loss=0.06378, over 12636.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2622, pruned_loss=0.07034, over 2580744.36 frames. ], batch size: 29, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:14:04,749 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.389e+02 2.604e+02 2.860e+02 4.095e+02, threshold=5.207e+02, percent-clipped=0.0 2024-06-21 23:14:06,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=477361.5, ans=0.0 2024-06-21 23:14:11,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=477379.8333333333, ans=0.125 2024-06-21 23:14:11,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=477379.8333333333, ans=0.125 2024-06-21 23:14:15,416 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.46 vs. limit=15.0 2024-06-21 23:14:23,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=477416.5, ans=0.125 2024-06-21 23:14:27,454 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2024-06-21 23:14:31,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=477434.8333333333, ans=0.1 2024-06-21 23:14:36,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=477453.1666666667, ans=0.125 2024-06-21 23:14:37,109 INFO [train.py:1028] (0/2) Epoch 26, batch 7500, loss[loss=0.1881, simple_loss=0.2435, pruned_loss=0.06635, over 10509.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2624, pruned_loss=0.07052, over 2578679.27 frames. ], batch size: 303, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:14:40,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=477453.1666666667, ans=0.125 2024-06-21 23:14:49,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2024-06-21 23:14:59,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=477508.1666666667, ans=0.125 2024-06-21 23:15:01,610 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=12.0 2024-06-21 23:15:02,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.24 vs. limit=15.0 2024-06-21 23:15:04,893 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=477526.5, ans=0.125 2024-06-21 23:15:09,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=477544.8333333333, ans=0.125 2024-06-21 23:15:10,131 INFO [train.py:1028] (0/2) Epoch 26, batch 7550, loss[loss=0.2124, simple_loss=0.2607, pruned_loss=0.08198, over 12971.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.263, pruned_loss=0.07119, over 2579217.30 frames. ], batch size: 158, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:15:10,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477544.8333333333, ans=0.1 2024-06-21 23:15:11,380 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.368e+02 2.531e+02 2.781e+02 3.441e+02, threshold=5.062e+02, percent-clipped=0.0 2024-06-21 23:15:18,864 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.17 vs. limit=22.5 2024-06-21 23:15:20,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2024-06-21 23:15:20,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=15.0 2024-06-21 23:15:30,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=477599.8333333333, ans=0.125 2024-06-21 23:15:42,280 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477618.1666666667, ans=0.0 2024-06-21 23:15:51,113 INFO [train.py:1028] (0/2) Epoch 26, batch 7600, loss[loss=0.1817, simple_loss=0.2456, pruned_loss=0.05885, over 13197.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2633, pruned_loss=0.07119, over 2578901.44 frames. ], batch size: 83, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:15:51,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.00 vs. limit=22.5 2024-06-21 23:16:02,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477654.8333333333, ans=0.125 2024-06-21 23:16:04,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=477673.1666666667, ans=0.125 2024-06-21 23:16:05,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=477673.1666666667, ans=0.2 2024-06-21 23:16:10,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=477691.5, ans=0.0 2024-06-21 23:16:13,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=477691.5, ans=0.125 2024-06-21 23:16:18,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477709.8333333333, ans=0.1 2024-06-21 23:16:19,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.36 vs. limit=22.5 2024-06-21 23:16:24,637 INFO [train.py:1028] (0/2) Epoch 26, batch 7650, loss[loss=0.1825, simple_loss=0.2455, pruned_loss=0.05976, over 12977.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2632, pruned_loss=0.07118, over 2574751.17 frames. ], batch size: 33, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:16:26,043 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.373e+02 2.530e+02 2.713e+02 3.577e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-21 23:16:28,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=477728.1666666667, ans=0.05 2024-06-21 23:16:34,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=477746.5, ans=0.2 2024-06-21 23:16:36,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=477746.5, ans=0.0 2024-06-21 23:16:44,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=477783.1666666667, ans=0.0 2024-06-21 23:16:45,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.78 vs. limit=10.0 2024-06-21 23:16:47,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=477783.1666666667, ans=0.0 2024-06-21 23:16:53,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=477801.5, ans=0.0 2024-06-21 23:16:58,276 INFO [train.py:1028] (0/2) Epoch 26, batch 7700, loss[loss=0.2185, simple_loss=0.2889, pruned_loss=0.07407, over 13242.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2636, pruned_loss=0.07137, over 2570497.49 frames. ], batch size: 63, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:16:58,332 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:16:58,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=477819.8333333333, ans=0.5 2024-06-21 23:17:06,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=477838.1666666667, ans=0.0 2024-06-21 23:17:09,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477838.1666666667, ans=0.1 2024-06-21 23:17:14,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477856.5, ans=0.125 2024-06-21 23:17:18,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=477874.8333333333, ans=0.125 2024-06-21 23:17:21,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=477874.8333333333, ans=0.125 2024-06-21 23:17:28,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477893.1666666667, ans=0.0 2024-06-21 23:17:36,195 INFO [train.py:1028] (0/2) Epoch 26, batch 7750, loss[loss=0.2, simple_loss=0.2663, pruned_loss=0.06684, over 13267.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2649, pruned_loss=0.07215, over 2574087.44 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:17:36,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=477911.5, ans=0.025 2024-06-21 23:17:37,457 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.427e+02 2.634e+02 2.796e+02 3.850e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-21 23:17:46,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=477929.8333333333, ans=0.125 2024-06-21 23:17:47,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=477929.8333333333, ans=0.0 2024-06-21 23:17:58,758 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.60 vs. limit=6.0 2024-06-21 23:18:13,398 INFO [train.py:1028] (0/2) Epoch 26, batch 7800, loss[loss=0.2079, simple_loss=0.2649, pruned_loss=0.07542, over 13148.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2654, pruned_loss=0.07211, over 2579658.93 frames. ], batch size: 95, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:18:17,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=478003.1666666667, ans=0.1 2024-06-21 23:18:23,437 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.88 vs. limit=15.0 2024-06-21 23:18:23,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=478021.5, ans=0.125 2024-06-21 23:18:29,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=478039.8333333333, ans=0.2 2024-06-21 23:18:34,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=478058.1666666667, ans=0.0 2024-06-21 23:18:46,165 INFO [train.py:1028] (0/2) Epoch 26, batch 7850, loss[loss=0.1853, simple_loss=0.2532, pruned_loss=0.05874, over 11705.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2664, pruned_loss=0.07231, over 2574052.56 frames. ], batch size: 17, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:18:47,333 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.363e+02 2.534e+02 2.788e+02 3.432e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 23:18:55,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=478113.1666666667, ans=0.2 2024-06-21 23:19:05,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=478149.8333333333, ans=0.1 2024-06-21 23:19:06,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=478149.8333333333, ans=0.125 2024-06-21 23:19:18,957 INFO [train.py:1028] (0/2) Epoch 26, batch 7900, loss[loss=0.1963, simple_loss=0.2636, pruned_loss=0.06446, over 13167.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2663, pruned_loss=0.07225, over 2572696.44 frames. ], batch size: 77, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:19:32,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478204.8333333333, ans=0.125 2024-06-21 23:19:37,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=478204.8333333333, ans=0.125 2024-06-21 23:19:41,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=478223.1666666667, ans=0.0 2024-06-21 23:19:42,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=478223.1666666667, ans=0.125 2024-06-21 23:19:42,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=478223.1666666667, ans=0.125 2024-06-21 23:19:52,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=478259.8333333333, ans=0.125 2024-06-21 23:19:53,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=478259.8333333333, ans=0.125 2024-06-21 23:19:58,242 INFO [train.py:1028] (0/2) Epoch 26, batch 7950, loss[loss=0.2277, simple_loss=0.2769, pruned_loss=0.08923, over 10519.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2671, pruned_loss=0.07237, over 2575917.33 frames. ], batch size: 303, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:19:58,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=478278.1666666667, ans=0.0 2024-06-21 23:19:59,683 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.386e+02 2.666e+02 2.885e+02 3.534e+02, threshold=5.332e+02, percent-clipped=0.0 2024-06-21 23:20:15,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=478314.8333333333, ans=0.025 2024-06-21 23:20:15,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=478314.8333333333, ans=0.0 2024-06-21 23:20:22,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=478333.1666666667, ans=0.125 2024-06-21 23:20:26,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=478351.5, ans=0.125 2024-06-21 23:20:32,745 INFO [train.py:1028] (0/2) Epoch 26, batch 8000, loss[loss=0.196, simple_loss=0.263, pruned_loss=0.06448, over 12915.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.268, pruned_loss=0.07271, over 2572269.17 frames. ], batch size: 30, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:20:38,809 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=478388.1666666667, ans=0.025 2024-06-21 23:20:52,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.67 vs. limit=15.0 2024-06-21 23:20:54,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.60 vs. limit=5.0 2024-06-21 23:21:06,761 INFO [train.py:1028] (0/2) Epoch 26, batch 8050, loss[loss=0.1968, simple_loss=0.2717, pruned_loss=0.06099, over 13188.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2676, pruned_loss=0.0723, over 2571409.75 frames. ], batch size: 83, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:21:08,082 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.407e+02 2.576e+02 2.935e+02 4.611e+02, threshold=5.153e+02, percent-clipped=0.0 2024-06-21 23:21:10,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=478461.5, ans=0.025 2024-06-21 23:21:16,540 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=478479.8333333333, ans=0.125 2024-06-21 23:21:17,934 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=478479.8333333333, ans=0.0 2024-06-21 23:21:25,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=478516.5, ans=0.125 2024-06-21 23:21:35,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=478534.8333333333, ans=0.125 2024-06-21 23:21:44,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.87 vs. limit=15.0 2024-06-21 23:21:45,169 INFO [train.py:1028] (0/2) Epoch 26, batch 8100, loss[loss=0.224, simple_loss=0.2855, pruned_loss=0.08124, over 13230.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.268, pruned_loss=0.07261, over 2575586.77 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:21:45,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=478553.1666666667, ans=0.035 2024-06-21 23:21:47,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=478553.1666666667, ans=0.125 2024-06-21 23:21:58,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=478589.8333333333, ans=0.0 2024-06-21 23:21:58,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=478589.8333333333, ans=0.2 2024-06-21 23:22:01,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=478589.8333333333, ans=0.0 2024-06-21 23:22:02,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=478589.8333333333, ans=0.125 2024-06-21 23:22:04,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=478608.1666666667, ans=0.1 2024-06-21 23:22:11,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.39 vs. limit=10.0 2024-06-21 23:22:17,519 INFO [train.py:1028] (0/2) Epoch 26, batch 8150, loss[loss=0.2017, simple_loss=0.26, pruned_loss=0.07172, over 13072.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.268, pruned_loss=0.07237, over 2579284.10 frames. ], batch size: 121, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:22:18,720 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.432e+02 2.535e+02 2.797e+02 3.241e+02, threshold=5.069e+02, percent-clipped=0.0 2024-06-21 23:22:19,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=478644.8333333333, ans=0.2 2024-06-21 23:22:24,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2024-06-21 23:22:24,787 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=478663.1666666667, ans=0.125 2024-06-21 23:22:37,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=478699.8333333333, ans=15.0 2024-06-21 23:22:39,689 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:22:42,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=478699.8333333333, ans=0.125 2024-06-21 23:22:49,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=478736.5, ans=0.0 2024-06-21 23:22:50,281 INFO [train.py:1028] (0/2) Epoch 26, batch 8200, loss[loss=0.2393, simple_loss=0.2926, pruned_loss=0.093, over 13137.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2675, pruned_loss=0.07217, over 2583822.81 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:22:58,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=478754.8333333333, ans=0.0 2024-06-21 23:23:04,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=478773.1666666667, ans=0.125 2024-06-21 23:23:08,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2024-06-21 23:23:21,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=478809.8333333333, ans=0.125 2024-06-21 23:23:23,622 INFO [train.py:1028] (0/2) Epoch 26, batch 8250, loss[loss=0.2072, simple_loss=0.2737, pruned_loss=0.07038, over 13257.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2682, pruned_loss=0.07236, over 2584516.08 frames. ], batch size: 52, lr: 2.21e-03, grad_scale: 64.0 2024-06-21 23:23:24,881 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.399e+02 2.555e+02 2.806e+02 3.763e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-21 23:23:37,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=478846.5, ans=0.0 2024-06-21 23:23:40,031 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2024-06-21 23:23:42,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478864.8333333333, ans=0.1 2024-06-21 23:23:54,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=478883.1666666667, ans=0.035 2024-06-21 23:23:59,542 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.60 vs. limit=10.0 2024-06-21 23:24:02,121 INFO [train.py:1028] (0/2) Epoch 26, batch 8300, loss[loss=0.1868, simple_loss=0.2419, pruned_loss=0.06583, over 13035.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.268, pruned_loss=0.07207, over 2581916.44 frames. ], batch size: 102, lr: 2.21e-03, grad_scale: 16.0 2024-06-21 23:24:04,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=478919.8333333333, ans=0.0 2024-06-21 23:24:08,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478938.1666666667, ans=0.1 2024-06-21 23:24:11,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=478938.1666666667, ans=0.0 2024-06-21 23:24:15,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=478956.5, ans=0.125 2024-06-21 23:24:21,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=478974.8333333333, ans=0.125 2024-06-21 23:24:22,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=478974.8333333333, ans=0.0 2024-06-21 23:24:34,812 INFO [train.py:1028] (0/2) Epoch 26, batch 8350, loss[loss=0.1972, simple_loss=0.2641, pruned_loss=0.06508, over 13228.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2677, pruned_loss=0.07187, over 2581484.44 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 16.0 2024-06-21 23:24:37,557 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.355e+02 2.485e+02 2.698e+02 3.793e+02, threshold=4.971e+02, percent-clipped=0.0 2024-06-21 23:24:42,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479029.8333333333, ans=0.1 2024-06-21 23:24:44,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=479029.8333333333, ans=0.0 2024-06-21 23:24:48,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2024-06-21 23:24:51,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=479048.1666666667, ans=0.0 2024-06-21 23:25:01,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=479084.8333333333, ans=0.125 2024-06-21 23:25:04,467 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=479084.8333333333, ans=0.0 2024-06-21 23:25:07,794 INFO [train.py:1028] (0/2) Epoch 26, batch 8400, loss[loss=0.1969, simple_loss=0.2591, pruned_loss=0.06729, over 12934.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2678, pruned_loss=0.07198, over 2579141.53 frames. ], batch size: 39, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:25:08,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=479103.1666666667, ans=0.0 2024-06-21 23:25:11,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=479103.1666666667, ans=0.125 2024-06-21 23:25:11,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=479103.1666666667, ans=0.0 2024-06-21 23:25:18,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=479121.5, ans=0.125 2024-06-21 23:25:26,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=479139.8333333333, ans=0.125 2024-06-21 23:25:37,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479158.1666666667, ans=0.1 2024-06-21 23:25:42,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=479176.5, ans=0.0 2024-06-21 23:25:42,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479176.5, ans=0.1 2024-06-21 23:25:47,336 INFO [train.py:1028] (0/2) Epoch 26, batch 8450, loss[loss=0.218, simple_loss=0.2795, pruned_loss=0.0783, over 13136.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2683, pruned_loss=0.07193, over 2580350.54 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:25:49,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.434e+02 2.540e+02 2.817e+02 3.923e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:25:53,854 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:25:57,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=479213.1666666667, ans=0.125 2024-06-21 23:26:00,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479231.5, ans=0.125 2024-06-21 23:26:16,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=479268.1666666667, ans=0.5 2024-06-21 23:26:19,897 INFO [train.py:1028] (0/2) Epoch 26, batch 8500, loss[loss=0.1882, simple_loss=0.2529, pruned_loss=0.06172, over 12678.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2695, pruned_loss=0.07267, over 2578064.62 frames. ], batch size: 29, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:26:28,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=15.0 2024-06-21 23:26:29,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=479304.8333333333, ans=0.2 2024-06-21 23:26:30,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=479304.8333333333, ans=0.125 2024-06-21 23:26:31,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=479304.8333333333, ans=0.5 2024-06-21 23:26:32,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=479323.1666666667, ans=0.2 2024-06-21 23:26:53,169 INFO [train.py:1028] (0/2) Epoch 26, batch 8550, loss[loss=0.2034, simple_loss=0.2685, pruned_loss=0.06915, over 12715.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2694, pruned_loss=0.0726, over 2576036.57 frames. ], batch size: 22, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:26:55,810 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.456e+02 2.626e+02 2.937e+02 4.705e+02, threshold=5.251e+02, percent-clipped=0.0 2024-06-21 23:26:58,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=479378.1666666667, ans=0.125 2024-06-21 23:27:00,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=479396.5, ans=0.125 2024-06-21 23:27:00,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=479396.5, ans=0.125 2024-06-21 23:27:11,868 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:27:22,706 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:27:26,429 INFO [train.py:1028] (0/2) Epoch 26, batch 8600, loss[loss=0.2223, simple_loss=0.2788, pruned_loss=0.08286, over 13137.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2702, pruned_loss=0.07289, over 2573593.63 frames. ], batch size: 112, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:27:42,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=479488.1666666667, ans=0.0 2024-06-21 23:28:00,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=479543.1666666667, ans=0.0 2024-06-21 23:28:04,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=479543.1666666667, ans=0.2 2024-06-21 23:28:06,537 INFO [train.py:1028] (0/2) Epoch 26, batch 8650, loss[loss=0.2048, simple_loss=0.2685, pruned_loss=0.07058, over 13015.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2698, pruned_loss=0.07227, over 2576661.03 frames. ], batch size: 102, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:28:09,198 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.285e+02 2.452e+02 2.622e+02 3.077e+02, threshold=4.904e+02, percent-clipped=0.0 2024-06-21 23:28:18,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=479579.8333333333, ans=0.2 2024-06-21 23:28:22,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479598.1666666667, ans=0.125 2024-06-21 23:28:30,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=479616.5, ans=0.125 2024-06-21 23:28:37,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.21 vs. limit=10.0 2024-06-21 23:28:39,682 INFO [train.py:1028] (0/2) Epoch 26, batch 8700, loss[loss=0.2023, simple_loss=0.274, pruned_loss=0.06532, over 13184.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.27, pruned_loss=0.0725, over 2573400.06 frames. ], batch size: 59, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:28:44,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=479653.1666666667, ans=0.125 2024-06-21 23:28:57,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=479689.8333333333, ans=0.125 2024-06-21 23:29:02,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=479708.1666666667, ans=0.0 2024-06-21 23:29:13,133 INFO [train.py:1028] (0/2) Epoch 26, batch 8750, loss[loss=0.2038, simple_loss=0.2599, pruned_loss=0.07383, over 13100.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2702, pruned_loss=0.07287, over 2569362.52 frames. ], batch size: 121, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:29:16,061 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.430e+02 2.527e+02 2.709e+02 3.613e+02, threshold=5.055e+02, percent-clipped=0.0 2024-06-21 23:29:22,884 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.42 vs. limit=10.0 2024-06-21 23:29:24,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=479763.1666666667, ans=0.125 2024-06-21 23:29:24,613 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2024-06-21 23:29:43,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479799.8333333333, ans=0.125 2024-06-21 23:29:44,318 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=479799.8333333333, ans=0.125 2024-06-21 23:29:53,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479818.1666666667, ans=0.125 2024-06-21 23:29:53,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=479818.1666666667, ans=0.125 2024-06-21 23:29:54,353 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2024-06-21 23:29:54,475 INFO [train.py:1028] (0/2) Epoch 26, batch 8800, loss[loss=0.1872, simple_loss=0.2616, pruned_loss=0.05642, over 13258.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2707, pruned_loss=0.07313, over 2574605.23 frames. ], batch size: 72, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:29:56,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=479836.5, ans=0.125 2024-06-21 23:30:00,288 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=479836.5, ans=0.125 2024-06-21 23:30:01,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479854.8333333333, ans=0.125 2024-06-21 23:30:05,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479854.8333333333, ans=0.1 2024-06-21 23:30:10,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=479873.1666666667, ans=0.025 2024-06-21 23:30:14,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=479891.5, ans=0.125 2024-06-21 23:30:28,935 INFO [train.py:1028] (0/2) Epoch 26, batch 8850, loss[loss=0.2163, simple_loss=0.2765, pruned_loss=0.07811, over 12491.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2708, pruned_loss=0.07331, over 2564077.76 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:30:31,854 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.391e+02 2.542e+02 2.731e+02 3.641e+02, threshold=5.084e+02, percent-clipped=0.0 2024-06-21 23:30:32,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=479928.1666666667, ans=10.0 2024-06-21 23:30:35,591 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=22.5 2024-06-21 23:30:55,805 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.22 vs. limit=12.0 2024-06-21 23:30:56,538 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=15.0 2024-06-21 23:30:58,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=480001.5, ans=0.0 2024-06-21 23:31:02,916 INFO [train.py:1028] (0/2) Epoch 26, batch 8900, loss[loss=0.2219, simple_loss=0.2803, pruned_loss=0.08177, over 13012.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2713, pruned_loss=0.0735, over 2561488.55 frames. ], batch size: 33, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:31:03,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=480019.8333333333, ans=0.125 2024-06-21 23:31:12,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=480038.1666666667, ans=0.0 2024-06-21 23:31:15,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.83 vs. limit=15.0 2024-06-21 23:31:20,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480056.5, ans=0.1 2024-06-21 23:31:21,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=480056.5, ans=0.125 2024-06-21 23:31:40,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=480093.1666666667, ans=0.05 2024-06-21 23:31:42,976 INFO [train.py:1028] (0/2) Epoch 26, batch 8950, loss[loss=0.2162, simple_loss=0.2757, pruned_loss=0.07838, over 12511.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2712, pruned_loss=0.07324, over 2561121.59 frames. ], batch size: 202, lr: 2.21e-03, grad_scale: 32.0 2024-06-21 23:31:45,695 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.385e+02 2.524e+02 2.691e+02 4.301e+02, threshold=5.048e+02, percent-clipped=0.0 2024-06-21 23:31:54,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=480129.8333333333, ans=15.0 2024-06-21 23:31:56,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=480148.1666666667, ans=0.125 2024-06-21 23:31:59,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=480148.1666666667, ans=0.2 2024-06-21 23:32:16,646 INFO [train.py:1028] (0/2) Epoch 26, batch 9000, loss[loss=0.2102, simple_loss=0.2769, pruned_loss=0.07173, over 13297.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2719, pruned_loss=0.07336, over 2567007.16 frames. ], batch size: 46, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:32:16,647 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 23:32:22,636 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5978, 4.1353, 3.0954, 4.3645], device='cuda:0') 2024-06-21 23:32:22,810 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.9786, 4.4894, 3.4824, 4.6990], device='cuda:0') 2024-06-21 23:32:24,677 INFO [train.py:1060] (0/2) Epoch 26, validation: loss=0.1911, simple_loss=0.2514, pruned_loss=0.0654, over 351949.00 frames. 2024-06-21 23:32:24,678 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 23:32:36,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=480221.5, ans=0.125 2024-06-21 23:32:44,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=480258.1666666667, ans=0.125 2024-06-21 23:32:52,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=480276.5, ans=0.125 2024-06-21 23:32:57,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=480276.5, ans=0.125 2024-06-21 23:32:58,677 INFO [train.py:1028] (0/2) Epoch 26, batch 9050, loss[loss=0.1749, simple_loss=0.2386, pruned_loss=0.05559, over 11877.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2725, pruned_loss=0.07374, over 2567545.49 frames. ], batch size: 17, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:33:01,222 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.428e+02 2.600e+02 2.750e+02 3.494e+02, threshold=5.200e+02, percent-clipped=0.0 2024-06-21 23:33:02,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=480294.8333333333, ans=0.025 2024-06-21 23:33:05,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=480313.1666666667, ans=0.125 2024-06-21 23:33:07,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2024-06-21 23:33:08,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=480313.1666666667, ans=0.0 2024-06-21 23:33:09,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=480313.1666666667, ans=10.0 2024-06-21 23:33:11,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=480331.5, ans=0.2 2024-06-21 23:33:14,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=480331.5, ans=0.2 2024-06-21 23:33:15,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480331.5, ans=0.125 2024-06-21 23:33:17,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480349.8333333333, ans=0.125 2024-06-21 23:33:20,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480349.8333333333, ans=0.1 2024-06-21 23:33:30,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480368.1666666667, ans=0.1 2024-06-21 23:33:30,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=480386.5, ans=0.1 2024-06-21 23:33:31,395 INFO [train.py:1028] (0/2) Epoch 26, batch 9100, loss[loss=0.2217, simple_loss=0.2889, pruned_loss=0.0773, over 13265.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2728, pruned_loss=0.0739, over 2568376.75 frames. ], batch size: 72, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:33:33,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.15 vs. limit=15.0 2024-06-21 23:33:43,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.82 vs. limit=15.0 2024-06-21 23:33:55,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=480441.5, ans=0.125 2024-06-21 23:33:57,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=480459.8333333333, ans=0.125 2024-06-21 23:34:02,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=480459.8333333333, ans=0.0 2024-06-21 23:34:02,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480459.8333333333, ans=0.1 2024-06-21 23:34:03,770 INFO [train.py:1028] (0/2) Epoch 26, batch 9150, loss[loss=0.1943, simple_loss=0.2696, pruned_loss=0.05955, over 13172.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2724, pruned_loss=0.07379, over 2569802.98 frames. ], batch size: 77, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:34:04,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=480478.1666666667, ans=0.2 2024-06-21 23:34:06,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.386e+02 2.534e+02 2.682e+02 3.276e+02, threshold=5.068e+02, percent-clipped=0.0 2024-06-21 23:34:09,026 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=480478.1666666667, ans=10.0 2024-06-21 23:34:29,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=480551.5, ans=0.125 2024-06-21 23:34:33,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=480551.5, ans=0.125 2024-06-21 23:34:35,975 INFO [train.py:1028] (0/2) Epoch 26, batch 9200, loss[loss=0.2224, simple_loss=0.2853, pruned_loss=0.07976, over 12903.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2724, pruned_loss=0.07348, over 2573112.02 frames. ], batch size: 36, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:34:36,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=480569.8333333333, ans=0.2 2024-06-21 23:34:44,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=480588.1666666667, ans=0.05 2024-06-21 23:34:49,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=480588.1666666667, ans=0.025 2024-06-21 23:34:59,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=480606.5, ans=0.125 2024-06-21 23:34:59,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=480606.5, ans=0.125 2024-06-21 23:35:14,254 INFO [train.py:1028] (0/2) Epoch 26, batch 9250, loss[loss=0.2077, simple_loss=0.2774, pruned_loss=0.06895, over 13225.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2715, pruned_loss=0.07311, over 2574443.04 frames. ], batch size: 67, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:35:16,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480661.5, ans=0.1 2024-06-21 23:35:17,690 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.388e+02 2.505e+02 2.711e+02 3.177e+02, threshold=5.010e+02, percent-clipped=0.0 2024-06-21 23:35:33,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=480716.5, ans=0.0 2024-06-21 23:35:38,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=480716.5, ans=0.0 2024-06-21 23:35:38,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=480716.5, ans=0.0 2024-06-21 23:35:42,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=480734.8333333333, ans=0.125 2024-06-21 23:35:46,966 INFO [train.py:1028] (0/2) Epoch 26, batch 9300, loss[loss=0.1887, simple_loss=0.2538, pruned_loss=0.06181, over 12972.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2718, pruned_loss=0.07329, over 2570778.78 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:35:47,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=480753.1666666667, ans=0.5 2024-06-21 23:36:10,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.83 vs. limit=6.0 2024-06-21 23:36:18,305 INFO [train.py:1028] (0/2) Epoch 26, batch 9350, loss[loss=0.2054, simple_loss=0.2689, pruned_loss=0.07091, over 12542.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2718, pruned_loss=0.07319, over 2567956.91 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:36:18,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=480844.8333333333, ans=0.2 2024-06-21 23:36:21,204 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.425e+02 2.540e+02 2.739e+02 4.025e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:36:22,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.88 vs. limit=10.0 2024-06-21 23:36:22,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=480844.8333333333, ans=0.125 2024-06-21 23:36:25,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480863.1666666667, ans=0.1 2024-06-21 23:36:31,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2024-06-21 23:36:34,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=480881.5, ans=0.07 2024-06-21 23:36:37,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2024-06-21 23:36:39,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=480899.8333333333, ans=0.2 2024-06-21 23:36:49,077 INFO [train.py:1028] (0/2) Epoch 26, batch 9400, loss[loss=0.2092, simple_loss=0.2774, pruned_loss=0.07051, over 13263.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2719, pruned_loss=0.07313, over 2567926.50 frames. ], batch size: 52, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:36:51,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=480936.5, ans=0.1 2024-06-21 23:36:55,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=480954.8333333333, ans=0.0 2024-06-21 23:37:10,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2024-06-21 23:37:11,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.08 vs. limit=22.5 2024-06-21 23:37:15,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=481009.8333333333, ans=0.025 2024-06-21 23:37:16,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=481009.8333333333, ans=0.2 2024-06-21 23:37:19,858 INFO [train.py:1028] (0/2) Epoch 26, batch 9450, loss[loss=0.2037, simple_loss=0.2662, pruned_loss=0.07061, over 12768.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2729, pruned_loss=0.07367, over 2568259.94 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:37:23,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.436e+02 2.620e+02 2.964e+02 4.092e+02, threshold=5.241e+02, percent-clipped=0.0 2024-06-21 23:37:25,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2024-06-21 23:37:28,133 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2024-06-21 23:37:28,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481046.5, ans=0.0 2024-06-21 23:37:28,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=481046.5, ans=0.125 2024-06-21 23:37:29,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481046.5, ans=0.0 2024-06-21 23:37:30,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=481046.5, ans=0.125 2024-06-21 23:37:32,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=481064.8333333333, ans=0.125 2024-06-21 23:37:34,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=481064.8333333333, ans=0.025 2024-06-21 23:37:45,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=481101.5, ans=0.125 2024-06-21 23:37:50,079 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=481119.8333333333, ans=0.125 2024-06-21 23:37:52,857 INFO [train.py:1028] (0/2) Epoch 26, batch 9500, loss[loss=0.1997, simple_loss=0.268, pruned_loss=0.06566, over 13266.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2725, pruned_loss=0.07305, over 2577471.90 frames. ], batch size: 43, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:37:53,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=481119.8333333333, ans=0.125 2024-06-21 23:38:00,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=481119.8333333333, ans=0.125 2024-06-21 23:38:02,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=481138.1666666667, ans=0.05 2024-06-21 23:38:03,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=481138.1666666667, ans=0.0 2024-06-21 23:38:05,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=481138.1666666667, ans=0.125 2024-06-21 23:38:05,354 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.85 vs. limit=10.0 2024-06-21 23:38:13,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=481156.5, ans=0.125 2024-06-21 23:38:16,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=481174.8333333333, ans=0.125 2024-06-21 23:38:23,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=481193.1666666667, ans=15.0 2024-06-21 23:38:26,955 INFO [train.py:1028] (0/2) Epoch 26, batch 9550, loss[loss=0.1868, simple_loss=0.2475, pruned_loss=0.06305, over 13189.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2724, pruned_loss=0.07322, over 2572553.24 frames. ], batch size: 40, lr: 2.20e-03, grad_scale: 16.0 2024-06-21 23:38:30,521 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.407e+02 2.602e+02 2.851e+02 3.547e+02, threshold=5.205e+02, percent-clipped=0.0 2024-06-21 23:38:32,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=481229.8333333333, ans=0.125 2024-06-21 23:38:33,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=481229.8333333333, ans=0.0 2024-06-21 23:38:34,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.49 vs. limit=22.5 2024-06-21 23:38:34,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=481229.8333333333, ans=0.0 2024-06-21 23:38:35,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=481229.8333333333, ans=0.0 2024-06-21 23:38:46,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=481266.5, ans=0.125 2024-06-21 23:38:57,944 INFO [train.py:1028] (0/2) Epoch 26, batch 9600, loss[loss=0.225, simple_loss=0.2809, pruned_loss=0.08454, over 10429.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2715, pruned_loss=0.07293, over 2570761.13 frames. ], batch size: 304, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:03,260 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=14.14 vs. limit=15.0 2024-06-21 23:39:12,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=8.0 2024-06-21 23:39:17,032 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.59 vs. limit=22.5 2024-06-21 23:39:19,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=481358.1666666667, ans=0.0 2024-06-21 23:39:20,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=481358.1666666667, ans=0.025 2024-06-21 23:39:27,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2024-06-21 23:39:28,323 INFO [train.py:1028] (0/2) Epoch 26, batch 9650, loss[loss=0.2131, simple_loss=0.2735, pruned_loss=0.07628, over 13145.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2715, pruned_loss=0.07348, over 2561423.25 frames. ], batch size: 132, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:29,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=481394.8333333333, ans=0.125 2024-06-21 23:39:31,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=481394.8333333333, ans=0.125 2024-06-21 23:39:31,357 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.366e+02 2.510e+02 2.692e+02 3.858e+02, threshold=5.019e+02, percent-clipped=0.0 2024-06-21 23:39:34,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=481413.1666666667, ans=0.02 2024-06-21 23:39:43,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=481431.5, ans=0.05 2024-06-21 23:39:46,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=481449.8333333333, ans=0.125 2024-06-21 23:39:59,051 INFO [train.py:1028] (0/2) Epoch 26, batch 9700, loss[loss=0.2066, simple_loss=0.2599, pruned_loss=0.07665, over 12986.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.271, pruned_loss=0.07337, over 2556035.81 frames. ], batch size: 144, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:39:59,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=481486.5, ans=0.0 2024-06-21 23:40:02,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=481486.5, ans=0.125 2024-06-21 23:40:04,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481504.8333333333, ans=0.1 2024-06-21 23:40:13,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=481523.1666666667, ans=0.125 2024-06-21 23:40:20,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=481541.5, ans=0.125 2024-06-21 23:40:20,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=481541.5, ans=0.125 2024-06-21 23:40:27,221 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2024-06-21 23:40:31,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=15.0 2024-06-21 23:40:32,253 INFO [train.py:1028] (0/2) Epoch 26, batch 9750, loss[loss=0.2108, simple_loss=0.2634, pruned_loss=0.07914, over 13131.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2701, pruned_loss=0.07282, over 2552893.14 frames. ], batch size: 132, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:40:35,383 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.390e+02 2.551e+02 2.783e+02 3.640e+02, threshold=5.103e+02, percent-clipped=0.0 2024-06-21 23:40:36,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=481578.1666666667, ans=0.0 2024-06-21 23:40:41,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=481596.5, ans=0.125 2024-06-21 23:40:44,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=481614.8333333333, ans=0.0 2024-06-21 23:40:44,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=481614.8333333333, ans=0.125 2024-06-21 23:40:45,663 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2024-06-21 23:40:48,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=481614.8333333333, ans=0.1 2024-06-21 23:40:58,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=481651.5, ans=0.125 2024-06-21 23:41:02,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=481651.5, ans=0.025 2024-06-21 23:41:03,614 INFO [train.py:1028] (0/2) Epoch 26, batch 9800, loss[loss=0.1981, simple_loss=0.2648, pruned_loss=0.06574, over 12896.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2692, pruned_loss=0.07214, over 2545155.28 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:41:07,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=481669.8333333333, ans=0.0 2024-06-21 23:41:10,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=481688.1666666667, ans=0.125 2024-06-21 23:41:13,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481688.1666666667, ans=0.1 2024-06-21 23:41:18,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481706.5, ans=0.0 2024-06-21 23:41:31,837 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2024-06-21 23:41:32,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481743.1666666667, ans=0.1 2024-06-21 23:41:33,931 INFO [train.py:1028] (0/2) Epoch 26, batch 9850, loss[loss=0.2022, simple_loss=0.2598, pruned_loss=0.07236, over 13190.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2686, pruned_loss=0.07191, over 2538861.30 frames. ], batch size: 103, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:41:37,029 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.400e+02 2.540e+02 2.759e+02 3.633e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-21 23:41:37,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=481761.5, ans=0.125 2024-06-21 23:41:45,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=481779.8333333333, ans=0.0 2024-06-21 23:42:06,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2024-06-21 23:42:07,193 INFO [train.py:1028] (0/2) Epoch 26, batch 9900, loss[loss=0.1671, simple_loss=0.234, pruned_loss=0.05007, over 12945.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2681, pruned_loss=0.07207, over 2530865.95 frames. ], batch size: 39, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:42:07,405 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:42:13,335 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:42:13,380 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=481871.5, ans=0.125 2024-06-21 23:42:24,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481889.8333333333, ans=0.1 2024-06-21 23:42:26,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.90 vs. limit=10.0 2024-06-21 23:42:38,514 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=481944.8333333333, ans=0.0 2024-06-21 23:42:39,054 INFO [train.py:1028] (0/2) Epoch 26, batch 9950, loss[loss=0.2119, simple_loss=0.2754, pruned_loss=0.07422, over 12647.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2668, pruned_loss=0.07209, over 2525822.37 frames. ], batch size: 29, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:42:42,166 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.405e+02 2.511e+02 2.748e+02 3.618e+02, threshold=5.023e+02, percent-clipped=0.0 2024-06-21 23:42:44,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481944.8333333333, ans=0.1 2024-06-21 23:42:49,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=481963.1666666667, ans=0.0 2024-06-21 23:42:52,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481981.5, ans=0.1 2024-06-21 23:43:01,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=481999.8333333333, ans=0.0 2024-06-21 23:43:05,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=482018.1666666667, ans=0.07 2024-06-21 23:43:07,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=482018.1666666667, ans=0.125 2024-06-21 23:43:08,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=482018.1666666667, ans=0.0 2024-06-21 23:43:09,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2024-06-21 23:43:12,050 INFO [train.py:1028] (0/2) Epoch 26, batch 10000, loss[loss=0.2092, simple_loss=0.2758, pruned_loss=0.07129, over 12503.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2672, pruned_loss=0.0726, over 2487887.01 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:43:16,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=482036.5, ans=0.125 2024-06-21 23:43:29,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=482073.1666666667, ans=0.125 2024-06-21 23:43:29,621 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:43:39,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482109.8333333333, ans=0.1 2024-06-21 23:43:43,203 INFO [train.py:1028] (0/2) Epoch 26, batch 10050, loss[loss=0.1996, simple_loss=0.2619, pruned_loss=0.06863, over 12674.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2675, pruned_loss=0.07341, over 2445040.37 frames. ], batch size: 22, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:43:44,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=482128.1666666667, ans=0.0 2024-06-21 23:43:46,096 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.426e+02 2.531e+02 2.669e+02 3.402e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-21 23:43:46,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=482128.1666666667, ans=6.0 2024-06-21 23:43:46,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=482128.1666666667, ans=0.125 2024-06-21 23:43:46,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=482128.1666666667, ans=0.0 2024-06-21 23:43:51,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=482146.5, ans=0.125 2024-06-21 23:43:53,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=482146.5, ans=0.05 2024-06-21 23:44:00,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=482164.8333333333, ans=0.05 2024-06-21 23:44:08,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=482201.5, ans=0.0 2024-06-21 23:44:13,530 INFO [train.py:1028] (0/2) Epoch 26, batch 10100, loss[loss=0.1841, simple_loss=0.2435, pruned_loss=0.06235, over 11448.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2676, pruned_loss=0.07309, over 2426259.24 frames. ], batch size: 17, lr: 2.20e-03, grad_scale: 32.0 2024-06-21 23:44:26,931 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-26.pt 2024-06-21 23:46:22,411 INFO [train.py:1028] (0/2) Epoch 27, batch 0, loss[loss=0.187, simple_loss=0.2528, pruned_loss=0.06062, over 12954.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2528, pruned_loss=0.06062, over 12954.00 frames. ], batch size: 36, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:46:22,412 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-21 23:46:29,652 INFO [train.py:1060] (0/2) Epoch 27, validation: loss=0.192, simple_loss=0.2529, pruned_loss=0.06551, over 351949.00 frames. 2024-06-21 23:46:29,652 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-21 23:46:31,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=482251.0, ans=0.025 2024-06-21 23:46:50,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=482306.0, ans=0.0 2024-06-21 23:46:53,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=482306.0, ans=0.125 2024-06-21 23:46:54,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=482306.0, ans=0.125 2024-06-21 23:46:55,712 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.271e+02 2.447e+02 2.740e+02 3.969e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-21 23:47:03,924 INFO [train.py:1028] (0/2) Epoch 27, batch 50, loss[loss=0.1969, simple_loss=0.2615, pruned_loss=0.06611, over 12674.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2506, pruned_loss=0.06579, over 574755.92 frames. ], batch size: 29, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:47:07,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=482342.6666666667, ans=0.125 2024-06-21 23:47:13,140 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=482361.0, ans=0.125 2024-06-21 23:47:28,917 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:47:38,094 INFO [train.py:1028] (0/2) Epoch 27, batch 100, loss[loss=0.2022, simple_loss=0.2673, pruned_loss=0.0685, over 13258.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2492, pruned_loss=0.06485, over 1018644.39 frames. ], batch size: 46, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:47:38,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=482434.3333333333, ans=0.125 2024-06-21 23:47:39,733 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.07 vs. limit=15.0 2024-06-21 23:47:49,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=482452.6666666667, ans=0.125 2024-06-21 23:47:57,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=482471.0, ans=0.125 2024-06-21 23:48:00,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=482471.0, ans=0.125 2024-06-21 23:48:04,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482489.3333333333, ans=0.125 2024-06-21 23:48:07,852 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.198e+02 2.353e+02 2.553e+02 3.563e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 23:48:08,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2024-06-21 23:48:11,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=482507.6666666667, ans=0.125 2024-06-21 23:48:15,321 INFO [train.py:1028] (0/2) Epoch 27, batch 150, loss[loss=0.1861, simple_loss=0.2461, pruned_loss=0.06302, over 12632.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2487, pruned_loss=0.06453, over 1366153.38 frames. ], batch size: 29, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:48:16,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=482526.0, ans=0.125 2024-06-21 23:48:24,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=482544.3333333333, ans=0.025 2024-06-21 23:48:34,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=482581.0, ans=0.04949747468305833 2024-06-21 23:48:35,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=482581.0, ans=0.0 2024-06-21 23:48:47,597 INFO [train.py:1028] (0/2) Epoch 27, batch 200, loss[loss=0.2018, simple_loss=0.2604, pruned_loss=0.07157, over 12556.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2485, pruned_loss=0.0644, over 1635743.33 frames. ], batch size: 202, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:48:57,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482636.0, ans=0.1 2024-06-21 23:49:03,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=482654.3333333333, ans=0.05 2024-06-21 23:49:10,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.11 vs. limit=12.0 2024-06-21 23:49:11,155 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.206e+02 2.368e+02 2.540e+02 3.047e+02, threshold=4.736e+02, percent-clipped=0.0 2024-06-21 23:49:19,133 INFO [train.py:1028] (0/2) Epoch 27, batch 250, loss[loss=0.1754, simple_loss=0.2237, pruned_loss=0.06353, over 12961.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2484, pruned_loss=0.06453, over 1846716.33 frames. ], batch size: 144, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:49:24,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=482709.3333333333, ans=0.125 2024-06-21 23:49:24,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2024-06-21 23:49:26,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=482709.3333333333, ans=0.125 2024-06-21 23:49:34,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=482727.6666666667, ans=0.0 2024-06-21 23:49:37,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=482746.0, ans=0.0 2024-06-21 23:49:52,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=482782.6666666667, ans=0.125 2024-06-21 23:49:59,472 INFO [train.py:1028] (0/2) Epoch 27, batch 300, loss[loss=0.1954, simple_loss=0.2492, pruned_loss=0.07078, over 13213.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2488, pruned_loss=0.06471, over 2009992.13 frames. ], batch size: 112, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:50:04,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=482801.0, ans=0.04949747468305833 2024-06-21 23:50:08,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=482819.3333333333, ans=0.125 2024-06-21 23:50:10,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=482819.3333333333, ans=0.125 2024-06-21 23:50:22,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482856.0, ans=0.1 2024-06-21 23:50:23,404 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.208e+02 2.288e+02 2.435e+02 3.423e+02, threshold=4.577e+02, percent-clipped=0.0 2024-06-21 23:50:23,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=482856.0, ans=0.1 2024-06-21 23:50:30,941 INFO [train.py:1028] (0/2) Epoch 27, batch 350, loss[loss=0.1737, simple_loss=0.2336, pruned_loss=0.05685, over 12952.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2476, pruned_loss=0.06412, over 2139163.76 frames. ], batch size: 33, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:50:35,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=482892.6666666667, ans=0.125 2024-06-21 23:50:35,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482892.6666666667, ans=0.1 2024-06-21 23:50:35,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=482892.6666666667, ans=0.2 2024-06-21 23:50:36,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482892.6666666667, ans=0.1 2024-06-21 23:50:39,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=482911.0, ans=0.0 2024-06-21 23:50:41,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.15 vs. limit=15.0 2024-06-21 23:50:50,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=482947.6666666667, ans=0.0 2024-06-21 23:50:59,265 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=482966.0, ans=0.125 2024-06-21 23:51:02,809 INFO [train.py:1028] (0/2) Epoch 27, batch 400, loss[loss=0.1972, simple_loss=0.2516, pruned_loss=0.07145, over 13230.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2483, pruned_loss=0.06414, over 2238964.77 frames. ], batch size: 63, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:51:03,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=482984.3333333333, ans=0.0 2024-06-21 23:51:10,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=483002.6666666667, ans=0.125 2024-06-21 23:51:29,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=483039.3333333333, ans=0.125 2024-06-21 23:51:30,025 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.288e+02 2.459e+02 2.692e+02 3.203e+02, threshold=4.918e+02, percent-clipped=0.0 2024-06-21 23:51:36,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=483057.6666666667, ans=0.125 2024-06-21 23:51:37,900 INFO [train.py:1028] (0/2) Epoch 27, batch 450, loss[loss=0.2023, simple_loss=0.2649, pruned_loss=0.06988, over 13203.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2489, pruned_loss=0.06448, over 2312043.02 frames. ], batch size: 67, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:51:39,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=483076.0, ans=0.04949747468305833 2024-06-21 23:51:57,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=483131.0, ans=0.125 2024-06-21 23:52:05,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=483131.0, ans=0.125 2024-06-21 23:52:14,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=483167.6666666667, ans=0.125 2024-06-21 23:52:14,791 INFO [train.py:1028] (0/2) Epoch 27, batch 500, loss[loss=0.1909, simple_loss=0.2445, pruned_loss=0.06865, over 13083.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2493, pruned_loss=0.06439, over 2374562.88 frames. ], batch size: 121, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:52:16,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=483167.6666666667, ans=0.125 2024-06-21 23:52:16,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=483167.6666666667, ans=0.025 2024-06-21 23:52:16,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=483167.6666666667, ans=0.125 2024-06-21 23:52:17,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=483167.6666666667, ans=0.2 2024-06-21 23:52:19,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=483167.6666666667, ans=0.05 2024-06-21 23:52:22,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=483186.0, ans=0.125 2024-06-21 23:52:37,712 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:52:38,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=483222.6666666667, ans=0.125 2024-06-21 23:52:39,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.247e+02 2.361e+02 2.623e+02 4.004e+02, threshold=4.722e+02, percent-clipped=0.0 2024-06-21 23:52:42,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=483241.0, ans=0.2 2024-06-21 23:52:44,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2024-06-21 23:52:47,183 INFO [train.py:1028] (0/2) Epoch 27, batch 550, loss[loss=0.1995, simple_loss=0.2506, pruned_loss=0.07417, over 12938.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2495, pruned_loss=0.06455, over 2419335.07 frames. ], batch size: 158, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:52:57,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=483277.6666666667, ans=0.025 2024-06-21 23:53:06,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2024-06-21 23:53:11,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.28 vs. limit=12.0 2024-06-21 23:53:15,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=483332.6666666667, ans=0.0 2024-06-21 23:53:19,547 INFO [train.py:1028] (0/2) Epoch 27, batch 600, loss[loss=0.1848, simple_loss=0.2366, pruned_loss=0.06654, over 13069.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2486, pruned_loss=0.06418, over 2457616.36 frames. ], batch size: 144, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:53:29,729 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=12.0 2024-06-21 23:53:41,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=483406.0, ans=0.125 2024-06-21 23:53:46,738 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.213e+02 2.306e+02 2.466e+02 3.302e+02, threshold=4.612e+02, percent-clipped=0.0 2024-06-21 23:53:54,402 INFO [train.py:1028] (0/2) Epoch 27, batch 650, loss[loss=0.188, simple_loss=0.2494, pruned_loss=0.06327, over 13199.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2487, pruned_loss=0.06378, over 2488249.55 frames. ], batch size: 59, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:53:55,836 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:53:55,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=483442.6666666667, ans=0.1 2024-06-21 23:53:57,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=483442.6666666667, ans=0.1 2024-06-21 23:54:23,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=483516.0, ans=0.2 2024-06-21 23:54:29,116 INFO [train.py:1028] (0/2) Epoch 27, batch 700, loss[loss=0.1998, simple_loss=0.2587, pruned_loss=0.07052, over 13387.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2477, pruned_loss=0.06364, over 2511453.82 frames. ], batch size: 46, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:54:35,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=483552.6666666667, ans=0.0 2024-06-21 23:54:37,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=483552.6666666667, ans=0.2 2024-06-21 23:54:40,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483552.6666666667, ans=0.125 2024-06-21 23:54:53,750 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.264e+02 2.360e+02 2.555e+02 3.306e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-21 23:55:01,304 INFO [train.py:1028] (0/2) Epoch 27, batch 750, loss[loss=0.1779, simple_loss=0.2457, pruned_loss=0.05502, over 13265.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2482, pruned_loss=0.06371, over 2527514.90 frames. ], batch size: 63, lr: 2.16e-03, grad_scale: 32.0 2024-06-21 23:55:02,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=483626.0, ans=0.125 2024-06-21 23:55:16,114 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=483662.6666666667, ans=0.2 2024-06-21 23:55:20,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=483681.0, ans=0.025 2024-06-21 23:55:24,310 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.33 vs. limit=15.0 2024-06-21 23:55:26,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483699.3333333333, ans=0.125 2024-06-21 23:55:29,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=483699.3333333333, ans=0.125 2024-06-21 23:55:33,558 INFO [train.py:1028] (0/2) Epoch 27, batch 800, loss[loss=0.1778, simple_loss=0.241, pruned_loss=0.05731, over 12947.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2487, pruned_loss=0.06388, over 2541014.60 frames. ], batch size: 36, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:55:33,618 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-21 23:55:37,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=483717.6666666667, ans=0.125 2024-06-21 23:55:39,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=483717.6666666667, ans=0.0 2024-06-21 23:55:45,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.67 vs. limit=15.0 2024-06-21 23:55:50,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483754.3333333333, ans=0.125 2024-06-21 23:55:50,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=483754.3333333333, ans=0.05 2024-06-21 23:55:56,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=483772.6666666667, ans=0.09899494936611666 2024-06-21 23:56:01,174 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.196e+02 2.314e+02 2.476e+02 3.202e+02, threshold=4.628e+02, percent-clipped=0.0 2024-06-21 23:56:02,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=483791.0, ans=0.025 2024-06-21 23:56:02,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=483791.0, ans=0.125 2024-06-21 23:56:06,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=483791.0, ans=0.025 2024-06-21 23:56:09,349 INFO [train.py:1028] (0/2) Epoch 27, batch 850, loss[loss=0.1802, simple_loss=0.2402, pruned_loss=0.06011, over 13090.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2487, pruned_loss=0.06402, over 2552587.99 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:56:26,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=483846.0, ans=0.2 2024-06-21 23:56:30,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483846.0, ans=0.1 2024-06-21 23:56:30,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=483846.0, ans=0.5 2024-06-21 23:56:37,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2024-06-21 23:56:42,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=483882.6666666667, ans=0.125 2024-06-21 23:56:44,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=483882.6666666667, ans=0.025 2024-06-21 23:56:46,675 INFO [train.py:1028] (0/2) Epoch 27, batch 900, loss[loss=0.1661, simple_loss=0.2324, pruned_loss=0.04986, over 12872.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2482, pruned_loss=0.06393, over 2556908.61 frames. ], batch size: 36, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:56:53,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=483919.3333333333, ans=0.0 2024-06-21 23:56:57,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=483919.3333333333, ans=0.125 2024-06-21 23:56:58,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2024-06-21 23:57:02,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483937.6666666667, ans=0.1 2024-06-21 23:57:11,687 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.316e+02 2.441e+02 2.622e+02 3.299e+02, threshold=4.882e+02, percent-clipped=0.0 2024-06-21 23:57:13,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=483974.3333333333, ans=0.125 2024-06-21 23:57:17,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=483974.3333333333, ans=0.125 2024-06-21 23:57:19,530 INFO [train.py:1028] (0/2) Epoch 27, batch 950, loss[loss=0.1594, simple_loss=0.2306, pruned_loss=0.04413, over 12944.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.248, pruned_loss=0.0637, over 2559873.64 frames. ], batch size: 39, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:57:21,563 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-264000.pt 2024-06-21 23:57:37,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=484029.3333333333, ans=0.2 2024-06-21 23:57:39,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484029.3333333333, ans=0.1 2024-06-21 23:57:51,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=12.0 2024-06-21 23:57:54,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=484066.0, ans=0.025 2024-06-21 23:57:58,902 INFO [train.py:1028] (0/2) Epoch 27, batch 1000, loss[loss=0.1828, simple_loss=0.2445, pruned_loss=0.06058, over 13311.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2484, pruned_loss=0.06427, over 2562286.92 frames. ], batch size: 49, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:58:01,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=484084.3333333333, ans=0.125 2024-06-21 23:58:25,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.272e+02 2.402e+02 2.595e+02 3.426e+02, threshold=4.804e+02, percent-clipped=0.0 2024-06-21 23:58:26,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=484157.6666666667, ans=0.125 2024-06-21 23:58:32,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=484176.0, ans=0.2 2024-06-21 23:58:33,269 INFO [train.py:1028] (0/2) Epoch 27, batch 1050, loss[loss=0.1622, simple_loss=0.2292, pruned_loss=0.04763, over 13200.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2488, pruned_loss=0.06411, over 2564980.52 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:58:41,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=484194.3333333333, ans=0.125 2024-06-21 23:58:42,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=484194.3333333333, ans=0.125 2024-06-21 23:58:57,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=484231.0, ans=0.025 2024-06-21 23:59:05,942 INFO [train.py:1028] (0/2) Epoch 27, batch 1100, loss[loss=0.2061, simple_loss=0.2704, pruned_loss=0.07091, over 13257.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2499, pruned_loss=0.06452, over 2570695.04 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-21 23:59:08,197 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=15.0 2024-06-21 23:59:20,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.11 vs. limit=12.0 2024-06-21 23:59:27,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=484322.6666666667, ans=0.125 2024-06-21 23:59:30,730 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.265e+02 2.353e+02 2.568e+02 3.425e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-21 23:59:32,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=484341.0, ans=0.0 2024-06-21 23:59:42,511 INFO [train.py:1028] (0/2) Epoch 27, batch 1150, loss[loss=0.1834, simple_loss=0.2491, pruned_loss=0.05887, over 13299.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2491, pruned_loss=0.06451, over 2571300.06 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 64.0 2024-06-21 23:59:43,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484359.3333333333, ans=0.1 2024-06-21 23:59:45,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484359.3333333333, ans=0.1 2024-06-21 23:59:45,502 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.35 vs. limit=10.0 2024-06-21 23:59:51,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=484377.6666666667, ans=0.125 2024-06-21 23:59:51,903 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=15.0 2024-06-21 23:59:52,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=484377.6666666667, ans=0.0 2024-06-21 23:59:53,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=484377.6666666667, ans=0.0 2024-06-22 00:00:06,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.93 vs. limit=15.0 2024-06-22 00:00:09,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=484414.3333333333, ans=0.125 2024-06-22 00:00:11,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2024-06-22 00:00:18,601 INFO [train.py:1028] (0/2) Epoch 27, batch 1200, loss[loss=0.1818, simple_loss=0.2462, pruned_loss=0.05871, over 13194.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2488, pruned_loss=0.06464, over 2573040.41 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:00:20,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484451.0, ans=0.1 2024-06-22 00:00:22,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=484451.0, ans=0.125 2024-06-22 00:00:24,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=484469.3333333333, ans=0.125 2024-06-22 00:00:36,801 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=484487.6666666667, ans=0.125 2024-06-22 00:00:40,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=484506.0, ans=0.0 2024-06-22 00:00:42,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=484506.0, ans=0.0 2024-06-22 00:00:43,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=484506.0, ans=0.125 2024-06-22 00:00:43,752 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.303e+02 2.410e+02 2.566e+02 3.458e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-22 00:00:51,476 INFO [train.py:1028] (0/2) Epoch 27, batch 1250, loss[loss=0.1819, simple_loss=0.2435, pruned_loss=0.06012, over 13164.00 frames. ], tot_loss[loss=0.189, simple_loss=0.249, pruned_loss=0.06448, over 2582701.54 frames. ], batch size: 112, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:00:58,275 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=484561.0, ans=0.09899494936611666 2024-06-22 00:01:03,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=484561.0, ans=0.125 2024-06-22 00:01:05,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=484579.3333333333, ans=0.2 2024-06-22 00:01:12,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.05 vs. limit=10.0 2024-06-22 00:01:22,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=484616.0, ans=0.2 2024-06-22 00:01:23,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=484634.3333333333, ans=0.125 2024-06-22 00:01:24,060 INFO [train.py:1028] (0/2) Epoch 27, batch 1300, loss[loss=0.1895, simple_loss=0.242, pruned_loss=0.06855, over 12871.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.25, pruned_loss=0.06513, over 2583068.32 frames. ], batch size: 177, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:01:24,271 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:01:35,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=484652.6666666667, ans=0.125 2024-06-22 00:01:37,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=484671.0, ans=0.2 2024-06-22 00:01:37,959 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=484671.0, ans=0.125 2024-06-22 00:01:38,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.48 vs. limit=10.0 2024-06-22 00:01:41,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=484671.0, ans=10.0 2024-06-22 00:01:49,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.317e+02 2.469e+02 2.739e+02 3.690e+02, threshold=4.938e+02, percent-clipped=0.0 2024-06-22 00:01:49,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=484707.6666666667, ans=0.125 2024-06-22 00:02:01,383 INFO [train.py:1028] (0/2) Epoch 27, batch 1350, loss[loss=0.1926, simple_loss=0.2587, pruned_loss=0.06327, over 13155.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2495, pruned_loss=0.06483, over 2584451.23 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:02:02,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=484726.0, ans=0.0 2024-06-22 00:02:11,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=484744.3333333333, ans=0.025 2024-06-22 00:02:29,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=484781.0, ans=0.2 2024-06-22 00:02:31,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=484799.3333333333, ans=0.2 2024-06-22 00:02:33,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=484799.3333333333, ans=0.0 2024-06-22 00:02:35,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.00 vs. limit=10.0 2024-06-22 00:02:37,399 INFO [train.py:1028] (0/2) Epoch 27, batch 1400, loss[loss=0.194, simple_loss=0.262, pruned_loss=0.06298, over 12367.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2494, pruned_loss=0.06466, over 2586385.02 frames. ], batch size: 25, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:02:37,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=484817.6666666667, ans=0.05 2024-06-22 00:03:02,073 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.301e+02 2.446e+02 2.727e+02 3.466e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 00:03:09,901 INFO [train.py:1028] (0/2) Epoch 27, batch 1450, loss[loss=0.1719, simple_loss=0.2227, pruned_loss=0.06055, over 13116.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.249, pruned_loss=0.06445, over 2586186.87 frames. ], batch size: 121, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:03:13,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=484909.3333333333, ans=0.0 2024-06-22 00:03:22,790 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=22.5 2024-06-22 00:03:31,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=484964.3333333333, ans=0.125 2024-06-22 00:03:37,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=484982.6666666667, ans=0.125 2024-06-22 00:03:42,170 INFO [train.py:1028] (0/2) Epoch 27, batch 1500, loss[loss=0.2094, simple_loss=0.2621, pruned_loss=0.07838, over 13238.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2496, pruned_loss=0.06501, over 2588888.94 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 64.0 2024-06-22 00:03:43,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=485001.0, ans=0.0 2024-06-22 00:03:53,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=485019.3333333333, ans=0.125 2024-06-22 00:03:55,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=485019.3333333333, ans=0.125 2024-06-22 00:03:56,598 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=15.0 2024-06-22 00:04:10,955 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.287e+02 2.420e+02 2.674e+02 3.535e+02, threshold=4.841e+02, percent-clipped=0.0 2024-06-22 00:04:19,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=485074.3333333333, ans=0.125 2024-06-22 00:04:22,433 INFO [train.py:1028] (0/2) Epoch 27, batch 1550, loss[loss=0.2066, simple_loss=0.2653, pruned_loss=0.07398, over 13029.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2497, pruned_loss=0.06497, over 2584331.55 frames. ], batch size: 102, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:04:25,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=485092.6666666667, ans=0.0 2024-06-22 00:04:28,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=485111.0, ans=0.125 2024-06-22 00:04:30,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=485111.0, ans=0.0 2024-06-22 00:04:30,157 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2024-06-22 00:04:41,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=485129.3333333333, ans=0.0 2024-06-22 00:04:53,372 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:04:55,285 INFO [train.py:1028] (0/2) Epoch 27, batch 1600, loss[loss=0.1845, simple_loss=0.2419, pruned_loss=0.06355, over 13123.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2503, pruned_loss=0.0653, over 2581298.32 frames. ], batch size: 77, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:04:58,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485184.3333333333, ans=0.1 2024-06-22 00:05:01,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485202.6666666667, ans=0.125 2024-06-22 00:05:03,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=485202.6666666667, ans=0.0 2024-06-22 00:05:05,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=485202.6666666667, ans=0.125 2024-06-22 00:05:05,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=485202.6666666667, ans=0.125 2024-06-22 00:05:20,423 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.330e+02 2.446e+02 2.662e+02 3.887e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 00:05:27,711 INFO [train.py:1028] (0/2) Epoch 27, batch 1650, loss[loss=0.1963, simple_loss=0.2536, pruned_loss=0.06945, over 13100.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2503, pruned_loss=0.06538, over 2576464.64 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:05:27,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=485276.0, ans=0.0 2024-06-22 00:05:36,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=6.0 2024-06-22 00:05:37,250 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=485294.3333333333, ans=0.125 2024-06-22 00:05:39,823 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=485312.6666666667, ans=0.125 2024-06-22 00:05:52,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=485331.0, ans=0.125 2024-06-22 00:06:00,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485349.3333333333, ans=0.1 2024-06-22 00:06:02,622 INFO [train.py:1028] (0/2) Epoch 27, batch 1700, loss[loss=0.1976, simple_loss=0.2531, pruned_loss=0.07099, over 12848.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2499, pruned_loss=0.06483, over 2581229.63 frames. ], batch size: 26, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:06:17,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=15.0 2024-06-22 00:06:21,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=485404.3333333333, ans=0.125 2024-06-22 00:06:21,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=485404.3333333333, ans=0.07 2024-06-22 00:06:23,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=485404.3333333333, ans=0.125 2024-06-22 00:06:31,551 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.228e+02 2.384e+02 2.554e+02 3.363e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-22 00:06:38,677 INFO [train.py:1028] (0/2) Epoch 27, batch 1750, loss[loss=0.2055, simple_loss=0.2664, pruned_loss=0.07233, over 12499.00 frames. ], tot_loss[loss=0.19, simple_loss=0.25, pruned_loss=0.06496, over 2581640.10 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:06:44,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.12 vs. limit=15.0 2024-06-22 00:07:01,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.68 vs. limit=22.5 2024-06-22 00:07:11,728 INFO [train.py:1028] (0/2) Epoch 27, batch 1800, loss[loss=0.1843, simple_loss=0.2472, pruned_loss=0.0607, over 13259.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2501, pruned_loss=0.06496, over 2581143.35 frames. ], batch size: 67, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:07:24,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=485587.6666666667, ans=0.0 2024-06-22 00:07:28,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2024-06-22 00:07:35,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=485606.0, ans=0.0 2024-06-22 00:07:37,078 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.269e+02 2.412e+02 2.546e+02 3.108e+02, threshold=4.824e+02, percent-clipped=0.0 2024-06-22 00:07:41,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=485624.3333333333, ans=0.0 2024-06-22 00:07:44,215 INFO [train.py:1028] (0/2) Epoch 27, batch 1850, loss[loss=0.1962, simple_loss=0.2494, pruned_loss=0.07148, over 13201.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2507, pruned_loss=0.06529, over 2582891.38 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:08:11,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=485697.6666666667, ans=10.0 2024-06-22 00:08:20,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=485716.0, ans=15.0 2024-06-22 00:08:22,431 INFO [train.py:1028] (0/2) Epoch 27, batch 1900, loss[loss=0.1956, simple_loss=0.2509, pruned_loss=0.07017, over 13145.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2498, pruned_loss=0.06507, over 2585212.85 frames. ], batch size: 95, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:08:23,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=485734.3333333333, ans=0.95 2024-06-22 00:08:37,705 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.56 vs. limit=15.0 2024-06-22 00:08:44,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=485789.3333333333, ans=0.125 2024-06-22 00:08:48,178 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.304e+02 2.481e+02 2.722e+02 3.607e+02, threshold=4.961e+02, percent-clipped=0.0 2024-06-22 00:08:51,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=485807.6666666667, ans=0.1 2024-06-22 00:08:55,051 INFO [train.py:1028] (0/2) Epoch 27, batch 1950, loss[loss=0.2025, simple_loss=0.2678, pruned_loss=0.06866, over 13221.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2493, pruned_loss=0.0651, over 2591475.62 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:09:02,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=485844.3333333333, ans=0.125 2024-06-22 00:09:05,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485844.3333333333, ans=0.1 2024-06-22 00:09:13,442 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2024-06-22 00:09:15,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485881.0, ans=0.1 2024-06-22 00:09:19,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=485881.0, ans=0.125 2024-06-22 00:09:24,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=485899.3333333333, ans=0.05 2024-06-22 00:09:28,041 INFO [train.py:1028] (0/2) Epoch 27, batch 2000, loss[loss=0.1939, simple_loss=0.2502, pruned_loss=0.06878, over 12600.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2498, pruned_loss=0.06521, over 2587497.46 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:09:28,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=485917.6666666667, ans=0.0 2024-06-22 00:09:30,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=485917.6666666667, ans=0.125 2024-06-22 00:09:33,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=485917.6666666667, ans=0.0 2024-06-22 00:09:35,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.56 vs. limit=15.0 2024-06-22 00:09:44,735 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.13 vs. limit=10.0 2024-06-22 00:09:50,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=485972.6666666667, ans=0.125 2024-06-22 00:09:54,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=485972.6666666667, ans=0.125 2024-06-22 00:09:56,683 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.331e+02 2.448e+02 2.583e+02 3.438e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 00:10:02,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=485991.0, ans=0.0 2024-06-22 00:10:03,153 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:10:03,661 INFO [train.py:1028] (0/2) Epoch 27, batch 2050, loss[loss=0.1887, simple_loss=0.2502, pruned_loss=0.06357, over 12729.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2496, pruned_loss=0.06521, over 2582136.80 frames. ], batch size: 29, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:10:16,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2024-06-22 00:10:21,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=486046.0, ans=0.125 2024-06-22 00:10:27,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=486064.3333333333, ans=0.0 2024-06-22 00:10:28,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=486064.3333333333, ans=0.125 2024-06-22 00:10:31,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=486082.6666666667, ans=0.125 2024-06-22 00:10:38,330 INFO [train.py:1028] (0/2) Epoch 27, batch 2100, loss[loss=0.1851, simple_loss=0.2492, pruned_loss=0.06043, over 13174.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.25, pruned_loss=0.06489, over 2585243.95 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:10:46,916 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=486119.3333333333, ans=0.2 2024-06-22 00:11:00,774 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.16 vs. limit=22.5 2024-06-22 00:11:03,782 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.303e+02 2.446e+02 2.668e+02 4.274e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 00:11:06,996 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=15.0 2024-06-22 00:11:10,979 INFO [train.py:1028] (0/2) Epoch 27, batch 2150, loss[loss=0.1799, simple_loss=0.2462, pruned_loss=0.05682, over 13258.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2494, pruned_loss=0.06447, over 2588372.54 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:11:18,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486211.0, ans=0.1 2024-06-22 00:11:26,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=486229.3333333333, ans=0.125 2024-06-22 00:11:30,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=486247.6666666667, ans=0.2 2024-06-22 00:11:34,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=486247.6666666667, ans=0.125 2024-06-22 00:11:36,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=486247.6666666667, ans=0.125 2024-06-22 00:11:39,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=486266.0, ans=10.0 2024-06-22 00:11:41,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=486266.0, ans=0.2 2024-06-22 00:11:44,279 INFO [train.py:1028] (0/2) Epoch 27, batch 2200, loss[loss=0.2124, simple_loss=0.2639, pruned_loss=0.08049, over 13280.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2499, pruned_loss=0.0649, over 2588106.42 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:11:49,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=486284.3333333333, ans=0.125 2024-06-22 00:11:55,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2024-06-22 00:11:55,392 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486302.6666666667, ans=0.1 2024-06-22 00:12:12,370 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.238e+02 2.351e+02 2.504e+02 3.721e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 00:12:22,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=486357.6666666667, ans=0.125 2024-06-22 00:12:22,651 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486357.6666666667, ans=0.1 2024-06-22 00:12:24,365 INFO [train.py:1028] (0/2) Epoch 27, batch 2250, loss[loss=0.1857, simple_loss=0.2478, pruned_loss=0.06179, over 13306.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2496, pruned_loss=0.06458, over 2587006.29 frames. ], batch size: 63, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:12:25,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2024-06-22 00:12:27,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=486376.0, ans=0.125 2024-06-22 00:12:36,579 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:12:41,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=486412.6666666667, ans=0.0 2024-06-22 00:12:49,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486449.3333333333, ans=0.0 2024-06-22 00:12:55,149 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=486449.3333333333, ans=0.125 2024-06-22 00:12:56,171 INFO [train.py:1028] (0/2) Epoch 27, batch 2300, loss[loss=0.1913, simple_loss=0.262, pruned_loss=0.06028, over 12954.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2498, pruned_loss=0.06447, over 2581628.92 frames. ], batch size: 33, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:12:58,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=486467.6666666667, ans=0.125 2024-06-22 00:13:01,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=486467.6666666667, ans=0.1 2024-06-22 00:13:16,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=486522.6666666667, ans=0.0 2024-06-22 00:13:22,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=486541.0, ans=0.125 2024-06-22 00:13:22,666 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.240e+02 2.403e+02 2.568e+02 3.306e+02, threshold=4.805e+02, percent-clipped=0.0 2024-06-22 00:13:24,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=486541.0, ans=0.0 2024-06-22 00:13:28,941 INFO [train.py:1028] (0/2) Epoch 27, batch 2350, loss[loss=0.1947, simple_loss=0.2568, pruned_loss=0.06629, over 13204.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2505, pruned_loss=0.0649, over 2585063.03 frames. ], batch size: 67, lr: 2.15e-03, grad_scale: 16.0 2024-06-22 00:13:34,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=486559.3333333333, ans=0.125 2024-06-22 00:13:44,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=486596.0, ans=0.025 2024-06-22 00:13:51,962 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=12.0 2024-06-22 00:14:01,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=486632.6666666667, ans=0.125 2024-06-22 00:14:05,510 INFO [train.py:1028] (0/2) Epoch 27, batch 2400, loss[loss=0.1908, simple_loss=0.2498, pruned_loss=0.06595, over 13335.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2489, pruned_loss=0.06448, over 2588395.25 frames. ], batch size: 46, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:14:10,589 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2024-06-22 00:14:22,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=486687.6666666667, ans=10.0 2024-06-22 00:14:23,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486687.6666666667, ans=0.1 2024-06-22 00:14:29,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=486706.0, ans=0.0 2024-06-22 00:14:33,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.240e+02 2.342e+02 2.480e+02 2.978e+02, threshold=4.685e+02, percent-clipped=0.0 2024-06-22 00:14:39,990 INFO [train.py:1028] (0/2) Epoch 27, batch 2450, loss[loss=0.1817, simple_loss=0.2456, pruned_loss=0.0589, over 13286.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2486, pruned_loss=0.06493, over 2583996.32 frames. ], batch size: 63, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:14:43,750 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.31 vs. limit=10.0 2024-06-22 00:14:44,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486742.6666666667, ans=0.1 2024-06-22 00:14:48,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=486761.0, ans=0.125 2024-06-22 00:14:57,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=486779.3333333333, ans=0.125 2024-06-22 00:15:06,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=486816.0, ans=0.125 2024-06-22 00:15:10,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486816.0, ans=0.0 2024-06-22 00:15:12,580 INFO [train.py:1028] (0/2) Epoch 27, batch 2500, loss[loss=0.1848, simple_loss=0.2389, pruned_loss=0.06535, over 13243.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2479, pruned_loss=0.06469, over 2586799.50 frames. ], batch size: 83, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:15:18,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486834.3333333333, ans=0.125 2024-06-22 00:15:22,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=486852.6666666667, ans=0.125 2024-06-22 00:15:38,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.49 vs. limit=15.0 2024-06-22 00:15:41,454 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.221e+02 2.326e+02 2.555e+02 3.079e+02, threshold=4.652e+02, percent-clipped=0.0 2024-06-22 00:15:47,943 INFO [train.py:1028] (0/2) Epoch 27, batch 2550, loss[loss=0.2071, simple_loss=0.2673, pruned_loss=0.0735, over 12525.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2466, pruned_loss=0.06421, over 2585806.61 frames. ], batch size: 22, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:15:50,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=486926.0, ans=0.125 2024-06-22 00:15:56,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486944.3333333333, ans=0.125 2024-06-22 00:16:16,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=486999.3333333333, ans=0.2 2024-06-22 00:16:17,266 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:16:23,386 INFO [train.py:1028] (0/2) Epoch 27, batch 2600, loss[loss=0.1781, simple_loss=0.2452, pruned_loss=0.0555, over 13257.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2458, pruned_loss=0.06416, over 2585077.00 frames. ], batch size: 52, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:16:24,399 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.29 vs. limit=22.5 2024-06-22 00:16:30,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=487036.0, ans=0.125 2024-06-22 00:16:32,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=487036.0, ans=0.1 2024-06-22 00:16:49,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=487091.0, ans=0.0 2024-06-22 00:16:49,380 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.217e+02 2.364e+02 2.495e+02 3.597e+02, threshold=4.728e+02, percent-clipped=0.0 2024-06-22 00:16:51,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.74 vs. limit=12.0 2024-06-22 00:16:56,101 INFO [train.py:1028] (0/2) Epoch 27, batch 2650, loss[loss=0.1976, simple_loss=0.2442, pruned_loss=0.07547, over 13010.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2445, pruned_loss=0.06371, over 2586928.66 frames. ], batch size: 144, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:17:02,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487127.6666666667, ans=0.1 2024-06-22 00:17:07,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=487127.6666666667, ans=0.0 2024-06-22 00:17:15,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2024-06-22 00:17:15,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.17 vs. limit=15.0 2024-06-22 00:17:18,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487164.3333333333, ans=0.1 2024-06-22 00:17:18,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=487164.3333333333, ans=0.125 2024-06-22 00:17:29,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.30 vs. limit=15.0 2024-06-22 00:17:30,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=487201.0, ans=0.0 2024-06-22 00:17:31,282 INFO [train.py:1028] (0/2) Epoch 27, batch 2700, loss[loss=0.1826, simple_loss=0.2374, pruned_loss=0.06389, over 13254.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2432, pruned_loss=0.0637, over 2585271.66 frames. ], batch size: 89, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:17:40,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=487219.3333333333, ans=0.2 2024-06-22 00:17:42,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=15.0 2024-06-22 00:17:47,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=487237.6666666667, ans=0.0 2024-06-22 00:17:48,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=487237.6666666667, ans=0.1 2024-06-22 00:17:57,023 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.313e+02 2.445e+02 2.618e+02 3.487e+02, threshold=4.890e+02, percent-clipped=0.0 2024-06-22 00:18:01,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=487274.3333333333, ans=0.2 2024-06-22 00:18:07,204 INFO [train.py:1028] (0/2) Epoch 27, batch 2750, loss[loss=0.2044, simple_loss=0.2543, pruned_loss=0.07721, over 13242.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2421, pruned_loss=0.06312, over 2582619.66 frames. ], batch size: 43, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:18:11,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=487292.6666666667, ans=0.125 2024-06-22 00:18:15,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=487311.0, ans=0.0 2024-06-22 00:18:26,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=487329.3333333333, ans=0.0 2024-06-22 00:18:28,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=487347.6666666667, ans=0.125 2024-06-22 00:18:40,503 INFO [train.py:1028] (0/2) Epoch 27, batch 2800, loss[loss=0.2061, simple_loss=0.2507, pruned_loss=0.0807, over 10918.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2416, pruned_loss=0.06294, over 2579683.35 frames. ], batch size: 304, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:18:58,201 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.70 vs. limit=10.0 2024-06-22 00:19:01,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=487439.3333333333, ans=0.125 2024-06-22 00:19:05,977 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.862e+02 2.299e+02 2.434e+02 2.749e+02 3.524e+02, threshold=4.867e+02, percent-clipped=0.0 2024-06-22 00:19:12,258 INFO [train.py:1028] (0/2) Epoch 27, batch 2850, loss[loss=0.1707, simple_loss=0.2359, pruned_loss=0.05273, over 13041.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2402, pruned_loss=0.06273, over 2577833.64 frames. ], batch size: 48, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:19:21,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=487494.3333333333, ans=0.125 2024-06-22 00:19:33,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=487531.0, ans=0.125 2024-06-22 00:19:36,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=487531.0, ans=0.0 2024-06-22 00:19:36,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=487531.0, ans=0.125 2024-06-22 00:19:38,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=487531.0, ans=0.125 2024-06-22 00:19:41,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.54 vs. limit=22.5 2024-06-22 00:19:46,727 INFO [train.py:1028] (0/2) Epoch 27, batch 2900, loss[loss=0.1747, simple_loss=0.2325, pruned_loss=0.05845, over 13121.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.2383, pruned_loss=0.06209, over 2585685.57 frames. ], batch size: 55, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:19:52,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=487567.6666666667, ans=0.0 2024-06-22 00:19:53,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=487567.6666666667, ans=0.0 2024-06-22 00:20:03,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=487604.3333333333, ans=0.2 2024-06-22 00:20:12,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487622.6666666667, ans=0.1 2024-06-22 00:20:15,576 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.234e+02 2.347e+02 2.565e+02 3.663e+02, threshold=4.694e+02, percent-clipped=0.0 2024-06-22 00:20:22,201 INFO [train.py:1028] (0/2) Epoch 27, batch 2950, loss[loss=0.1803, simple_loss=0.2385, pruned_loss=0.06107, over 13162.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2381, pruned_loss=0.06195, over 2579181.90 frames. ], batch size: 43, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:20:24,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.76 vs. limit=10.0 2024-06-22 00:20:28,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=487677.6666666667, ans=0.125 2024-06-22 00:20:33,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=487677.6666666667, ans=0.125 2024-06-22 00:20:36,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=487696.0, ans=0.125 2024-06-22 00:20:55,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2024-06-22 00:20:55,988 INFO [train.py:1028] (0/2) Epoch 27, batch 3000, loss[loss=0.1703, simple_loss=0.2297, pruned_loss=0.05538, over 13149.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2371, pruned_loss=0.06156, over 2578639.90 frames. ], batch size: 59, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:20:55,989 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 00:21:00,798 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.8366, 4.4167, 4.8061, 4.4103], device='cuda:0') 2024-06-22 00:21:03,891 INFO [train.py:1060] (0/2) Epoch 27, validation: loss=0.1903, simple_loss=0.2509, pruned_loss=0.06482, over 351949.00 frames. 2024-06-22 00:21:03,891 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 00:21:06,218 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.13 vs. limit=22.5 2024-06-22 00:21:26,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=487787.6666666667, ans=0.0 2024-06-22 00:21:29,010 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.45 vs. limit=10.0 2024-06-22 00:21:32,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=487806.0, ans=0.1 2024-06-22 00:21:34,041 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.298e+02 2.509e+02 2.774e+02 3.609e+02, threshold=5.018e+02, percent-clipped=0.0 2024-06-22 00:21:40,639 INFO [train.py:1028] (0/2) Epoch 27, batch 3050, loss[loss=0.1591, simple_loss=0.2205, pruned_loss=0.04882, over 13299.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2364, pruned_loss=0.06143, over 2577933.12 frames. ], batch size: 46, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:21:42,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=487842.6666666667, ans=0.125 2024-06-22 00:21:45,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=487842.6666666667, ans=0.2 2024-06-22 00:22:00,196 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=15.0 2024-06-22 00:22:05,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.26 vs. limit=12.0 2024-06-22 00:22:16,120 INFO [train.py:1028] (0/2) Epoch 27, batch 3100, loss[loss=0.1691, simple_loss=0.2172, pruned_loss=0.06055, over 13034.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2361, pruned_loss=0.06129, over 2578582.35 frames. ], batch size: 144, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:22:16,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=487934.3333333333, ans=0.2 2024-06-22 00:22:16,690 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.81 vs. limit=15.0 2024-06-22 00:22:17,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2024-06-22 00:22:19,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=487934.3333333333, ans=0.025 2024-06-22 00:22:33,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=487971.0, ans=0.125 2024-06-22 00:22:34,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=487971.0, ans=0.125 2024-06-22 00:22:39,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=487989.3333333333, ans=0.2 2024-06-22 00:22:39,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=487989.3333333333, ans=0.125 2024-06-22 00:22:42,654 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.167e+02 2.307e+02 2.523e+02 3.204e+02, threshold=4.615e+02, percent-clipped=0.0 2024-06-22 00:22:42,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=488007.6666666667, ans=0.0 2024-06-22 00:22:49,588 INFO [train.py:1028] (0/2) Epoch 27, batch 3150, loss[loss=0.1741, simple_loss=0.2239, pruned_loss=0.06215, over 12950.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2345, pruned_loss=0.06073, over 2581170.81 frames. ], batch size: 158, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:22:54,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488026.0, ans=0.0 2024-06-22 00:22:55,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.94 vs. limit=15.0 2024-06-22 00:22:57,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2024-06-22 00:23:00,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=488044.3333333333, ans=0.125 2024-06-22 00:23:02,335 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=488062.6666666667, ans=0.025 2024-06-22 00:23:06,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488062.6666666667, ans=0.1 2024-06-22 00:23:22,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=488099.3333333333, ans=0.125 2024-06-22 00:23:23,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=488099.3333333333, ans=0.0 2024-06-22 00:23:25,960 INFO [train.py:1028] (0/2) Epoch 27, batch 3200, loss[loss=0.1799, simple_loss=0.2362, pruned_loss=0.06183, over 13133.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2347, pruned_loss=0.06072, over 2581741.55 frames. ], batch size: 55, lr: 2.15e-03, grad_scale: 32.0 2024-06-22 00:23:26,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=488117.6666666667, ans=0.0 2024-06-22 00:23:27,097 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=15.0 2024-06-22 00:23:35,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=488136.0, ans=0.125 2024-06-22 00:23:38,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=488154.3333333333, ans=0.0 2024-06-22 00:23:39,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=488154.3333333333, ans=0.0 2024-06-22 00:23:44,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=488172.6666666667, ans=0.07 2024-06-22 00:23:51,741 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2024-06-22 00:23:51,832 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.269e+02 2.369e+02 2.509e+02 2.933e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-22 00:23:56,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.58 vs. limit=22.5 2024-06-22 00:24:01,416 INFO [train.py:1028] (0/2) Epoch 27, batch 3250, loss[loss=0.1737, simple_loss=0.2288, pruned_loss=0.05931, over 13269.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2339, pruned_loss=0.06038, over 2586018.36 frames. ], batch size: 72, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:24:16,353 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:24:21,457 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.17 vs. limit=22.5 2024-06-22 00:24:26,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=488264.3333333333, ans=0.0 2024-06-22 00:24:31,027 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=488282.6666666667, ans=0.125 2024-06-22 00:24:34,898 INFO [train.py:1028] (0/2) Epoch 27, batch 3300, loss[loss=0.193, simple_loss=0.2443, pruned_loss=0.07086, over 12737.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2335, pruned_loss=0.06033, over 2581587.41 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:24:36,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=488301.0, ans=0.95 2024-06-22 00:24:39,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2024-06-22 00:24:49,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=488337.6666666667, ans=0.125 2024-06-22 00:24:58,775 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:24:59,686 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2024-06-22 00:25:00,400 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.224e+02 2.386e+02 2.522e+02 3.680e+02, threshold=4.772e+02, percent-clipped=0.0 2024-06-22 00:25:01,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488374.3333333333, ans=0.1 2024-06-22 00:25:11,745 INFO [train.py:1028] (0/2) Epoch 27, batch 3350, loss[loss=0.1992, simple_loss=0.2429, pruned_loss=0.07773, over 12938.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2337, pruned_loss=0.06103, over 2577086.06 frames. ], batch size: 158, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:25:14,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=488392.6666666667, ans=0.125 2024-06-22 00:25:15,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=488392.6666666667, ans=0.2 2024-06-22 00:25:21,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=22.5 2024-06-22 00:25:23,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=488411.0, ans=0.125 2024-06-22 00:25:26,793 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:25:34,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=488447.6666666667, ans=0.0 2024-06-22 00:25:47,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.06 vs. limit=15.0 2024-06-22 00:25:48,462 INFO [train.py:1028] (0/2) Epoch 27, batch 3400, loss[loss=0.1913, simple_loss=0.2575, pruned_loss=0.06256, over 12333.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2336, pruned_loss=0.06118, over 2574170.03 frames. ], batch size: 22, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:25:48,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=488484.3333333333, ans=0.125 2024-06-22 00:25:58,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=488502.6666666667, ans=10.0 2024-06-22 00:26:07,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=488521.0, ans=0.0 2024-06-22 00:26:14,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=488557.6666666667, ans=0.125 2024-06-22 00:26:14,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=488557.6666666667, ans=0.125 2024-06-22 00:26:14,899 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.261e+02 2.571e+02 2.867e+02 3.936e+02, threshold=5.142e+02, percent-clipped=0.0 2024-06-22 00:26:15,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=488557.6666666667, ans=0.025 2024-06-22 00:26:20,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=488557.6666666667, ans=0.1 2024-06-22 00:26:21,663 INFO [train.py:1028] (0/2) Epoch 27, batch 3450, loss[loss=0.1928, simple_loss=0.2452, pruned_loss=0.07019, over 12797.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2328, pruned_loss=0.06071, over 2575125.23 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:26:43,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=488631.0, ans=0.0 2024-06-22 00:26:44,893 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:26:46,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488631.0, ans=0.1 2024-06-22 00:26:50,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488649.3333333333, ans=0.1 2024-06-22 00:26:50,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=488649.3333333333, ans=0.025 2024-06-22 00:26:53,942 INFO [train.py:1028] (0/2) Epoch 27, batch 3500, loss[loss=0.1768, simple_loss=0.238, pruned_loss=0.05781, over 12898.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2331, pruned_loss=0.06055, over 2574859.15 frames. ], batch size: 33, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:27:02,575 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=488686.0, ans=0.125 2024-06-22 00:27:09,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=488704.3333333333, ans=0.0 2024-06-22 00:27:16,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=488722.6666666667, ans=0.125 2024-06-22 00:27:21,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488722.6666666667, ans=0.1 2024-06-22 00:27:23,140 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.171e+02 2.339e+02 2.506e+02 3.144e+02, threshold=4.678e+02, percent-clipped=0.0 2024-06-22 00:27:28,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=488741.0, ans=0.1 2024-06-22 00:27:29,660 INFO [train.py:1028] (0/2) Epoch 27, batch 3550, loss[loss=0.1717, simple_loss=0.2226, pruned_loss=0.06036, over 13147.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2329, pruned_loss=0.06044, over 2576116.54 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:27:38,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488777.6666666667, ans=0.1 2024-06-22 00:27:59,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=488832.6666666667, ans=0.2 2024-06-22 00:28:06,140 INFO [train.py:1028] (0/2) Epoch 27, batch 3600, loss[loss=0.1735, simple_loss=0.2295, pruned_loss=0.05873, over 13269.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2327, pruned_loss=0.06056, over 2580782.17 frames. ], batch size: 49, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:28:11,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.48 vs. limit=6.0 2024-06-22 00:28:12,026 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2024-06-22 00:28:13,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=488869.3333333333, ans=0.0 2024-06-22 00:28:20,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=488887.6666666667, ans=0.2 2024-06-22 00:28:23,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=488887.6666666667, ans=0.125 2024-06-22 00:28:31,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=488906.0, ans=0.125 2024-06-22 00:28:32,927 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.161e+02 2.277e+02 2.417e+02 3.501e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-22 00:28:36,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=488924.3333333333, ans=0.2 2024-06-22 00:28:37,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=488924.3333333333, ans=0.2 2024-06-22 00:28:39,527 INFO [train.py:1028] (0/2) Epoch 27, batch 3650, loss[loss=0.1724, simple_loss=0.2185, pruned_loss=0.06317, over 13029.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2323, pruned_loss=0.06012, over 2580282.31 frames. ], batch size: 102, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:28:39,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488942.6666666667, ans=0.1 2024-06-22 00:28:57,494 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=488979.3333333333, ans=0.125 2024-06-22 00:29:00,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=488997.6666666667, ans=0.125 2024-06-22 00:29:02,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=488997.6666666667, ans=0.5 2024-06-22 00:29:04,469 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:29:11,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=489016.0, ans=0.125 2024-06-22 00:29:13,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=489016.0, ans=0.125 2024-06-22 00:29:15,417 INFO [train.py:1028] (0/2) Epoch 27, batch 3700, loss[loss=0.17, simple_loss=0.2301, pruned_loss=0.05497, over 13262.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2315, pruned_loss=0.05988, over 2584999.41 frames. ], batch size: 72, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:29:16,181 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=489034.3333333333, ans=0.125 2024-06-22 00:29:16,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=489034.3333333333, ans=0.2 2024-06-22 00:29:20,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=489034.3333333333, ans=0.025 2024-06-22 00:29:23,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=489052.6666666667, ans=0.2 2024-06-22 00:29:31,758 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:29:31,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=489071.0, ans=0.125 2024-06-22 00:29:31,974 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2024-06-22 00:29:39,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=489089.3333333333, ans=0.0 2024-06-22 00:29:41,354 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.140e+02 2.266e+02 2.443e+02 3.403e+02, threshold=4.532e+02, percent-clipped=0.0 2024-06-22 00:29:41,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=489107.6666666667, ans=0.0 2024-06-22 00:29:42,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=489107.6666666667, ans=0.0 2024-06-22 00:29:48,071 INFO [train.py:1028] (0/2) Epoch 27, batch 3750, loss[loss=0.1736, simple_loss=0.238, pruned_loss=0.05456, over 12522.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2311, pruned_loss=0.05973, over 2587419.13 frames. ], batch size: 22, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:29:48,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=489126.0, ans=0.025 2024-06-22 00:30:08,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=489162.6666666667, ans=0.125 2024-06-22 00:30:11,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=489181.0, ans=0.125 2024-06-22 00:30:12,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=489181.0, ans=0.09899494936611666 2024-06-22 00:30:18,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=489199.3333333333, ans=0.125 2024-06-22 00:30:25,342 INFO [train.py:1028] (0/2) Epoch 27, batch 3800, loss[loss=0.1816, simple_loss=0.2365, pruned_loss=0.06336, over 13243.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2308, pruned_loss=0.05945, over 2584974.14 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:30:25,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=489217.6666666667, ans=0.2 2024-06-22 00:30:32,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=489236.0, ans=0.125 2024-06-22 00:30:37,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489236.0, ans=0.1 2024-06-22 00:30:37,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=489236.0, ans=0.125 2024-06-22 00:30:40,193 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=489254.3333333333, ans=0.0 2024-06-22 00:30:51,937 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2024-06-22 00:30:52,172 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.160e+02 2.283e+02 2.492e+02 3.279e+02, threshold=4.566e+02, percent-clipped=0.0 2024-06-22 00:30:58,974 INFO [train.py:1028] (0/2) Epoch 27, batch 3850, loss[loss=0.1605, simple_loss=0.2147, pruned_loss=0.05315, over 13056.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2301, pruned_loss=0.05896, over 2585617.55 frames. ], batch size: 144, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:31:06,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489327.6666666667, ans=0.1 2024-06-22 00:31:07,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=489327.6666666667, ans=0.125 2024-06-22 00:31:07,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2024-06-22 00:31:07,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=489327.6666666667, ans=0.125 2024-06-22 00:31:08,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=489327.6666666667, ans=0.125 2024-06-22 00:31:10,840 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=22.5 2024-06-22 00:31:11,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=489327.6666666667, ans=0.0 2024-06-22 00:31:12,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489346.0, ans=0.1 2024-06-22 00:31:13,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=489346.0, ans=0.125 2024-06-22 00:31:13,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=489346.0, ans=0.2 2024-06-22 00:31:14,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=489346.0, ans=0.0 2024-06-22 00:31:25,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=489364.3333333333, ans=0.2 2024-06-22 00:31:26,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=489364.3333333333, ans=0.2 2024-06-22 00:31:26,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2024-06-22 00:31:34,785 INFO [train.py:1028] (0/2) Epoch 27, batch 3900, loss[loss=0.1599, simple_loss=0.2146, pruned_loss=0.05266, over 13244.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2303, pruned_loss=0.05914, over 2588621.62 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:31:40,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.00 vs. limit=15.0 2024-06-22 00:31:50,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=489437.6666666667, ans=0.125 2024-06-22 00:31:54,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=489456.0, ans=0.2 2024-06-22 00:31:56,861 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489456.0, ans=0.1 2024-06-22 00:31:58,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=489456.0, ans=0.125 2024-06-22 00:32:01,613 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.195e+02 2.372e+02 2.567e+02 3.418e+02, threshold=4.743e+02, percent-clipped=0.0 2024-06-22 00:32:11,807 INFO [train.py:1028] (0/2) Epoch 27, batch 3950, loss[loss=0.1746, simple_loss=0.226, pruned_loss=0.06163, over 13094.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2297, pruned_loss=0.05896, over 2591456.29 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:32:12,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=489492.6666666667, ans=0.125 2024-06-22 00:32:14,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=489492.6666666667, ans=0.125 2024-06-22 00:32:20,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489511.0, ans=0.1 2024-06-22 00:32:24,843 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.34 vs. limit=22.5 2024-06-22 00:32:27,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=489529.3333333333, ans=0.1 2024-06-22 00:32:33,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=489547.6666666667, ans=0.125 2024-06-22 00:32:43,806 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=489566.0, ans=0.125 2024-06-22 00:32:45,631 INFO [train.py:1028] (0/2) Epoch 27, batch 4000, loss[loss=0.1811, simple_loss=0.2422, pruned_loss=0.05999, over 12958.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.229, pruned_loss=0.05871, over 2585887.41 frames. ], batch size: 39, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:32:51,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2024-06-22 00:32:52,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=489602.6666666667, ans=0.2 2024-06-22 00:33:05,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=489639.3333333333, ans=0.125 2024-06-22 00:33:12,086 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.199e+02 2.291e+02 2.486e+02 3.722e+02, threshold=4.582e+02, percent-clipped=0.0 2024-06-22 00:33:21,793 INFO [train.py:1028] (0/2) Epoch 27, batch 4050, loss[loss=0.1698, simple_loss=0.222, pruned_loss=0.05879, over 11028.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.229, pruned_loss=0.05871, over 2584466.49 frames. ], batch size: 304, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:33:22,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=489676.0, ans=0.1 2024-06-22 00:33:23,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489676.0, ans=0.1 2024-06-22 00:33:39,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=489712.6666666667, ans=0.2 2024-06-22 00:33:47,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=489749.3333333333, ans=0.125 2024-06-22 00:33:49,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=489749.3333333333, ans=0.125 2024-06-22 00:33:51,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.83 vs. limit=15.0 2024-06-22 00:33:55,232 INFO [train.py:1028] (0/2) Epoch 27, batch 4100, loss[loss=0.1869, simple_loss=0.2363, pruned_loss=0.06874, over 13064.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2289, pruned_loss=0.05899, over 2580216.49 frames. ], batch size: 102, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:34:01,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=489767.6666666667, ans=0.025 2024-06-22 00:34:02,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=489767.6666666667, ans=0.125 2024-06-22 00:34:12,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=489804.3333333333, ans=0.0 2024-06-22 00:34:18,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=489822.6666666667, ans=0.125 2024-06-22 00:34:20,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=489822.6666666667, ans=0.2 2024-06-22 00:34:21,718 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2024-06-22 00:34:22,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=489822.6666666667, ans=0.125 2024-06-22 00:34:25,443 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.250e+02 2.436e+02 2.619e+02 3.709e+02, threshold=4.872e+02, percent-clipped=0.0 2024-06-22 00:34:26,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=489841.0, ans=0.125 2024-06-22 00:34:32,523 INFO [train.py:1028] (0/2) Epoch 27, batch 4150, loss[loss=0.1625, simple_loss=0.2191, pruned_loss=0.05291, over 13103.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2285, pruned_loss=0.05866, over 2578555.95 frames. ], batch size: 55, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:34:39,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=489877.6666666667, ans=0.125 2024-06-22 00:34:41,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=489877.6666666667, ans=0.125 2024-06-22 00:34:42,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=489877.6666666667, ans=0.0 2024-06-22 00:34:45,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=15.0 2024-06-22 00:34:47,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=489896.0, ans=0.5 2024-06-22 00:34:50,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.53 vs. limit=22.5 2024-06-22 00:34:55,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=489914.3333333333, ans=0.0 2024-06-22 00:34:59,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=489932.6666666667, ans=0.125 2024-06-22 00:35:03,369 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:35:04,996 INFO [train.py:1028] (0/2) Epoch 27, batch 4200, loss[loss=0.1774, simple_loss=0.2237, pruned_loss=0.06556, over 12976.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.228, pruned_loss=0.05867, over 2580710.08 frames. ], batch size: 102, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:35:11,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=489969.3333333333, ans=0.125 2024-06-22 00:35:30,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.55 vs. limit=15.0 2024-06-22 00:35:34,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.151e+02 2.269e+02 2.470e+02 3.623e+02, threshold=4.538e+02, percent-clipped=0.0 2024-06-22 00:35:41,289 INFO [train.py:1028] (0/2) Epoch 27, batch 4250, loss[loss=0.1622, simple_loss=0.2212, pruned_loss=0.05162, over 13281.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2279, pruned_loss=0.05845, over 2583487.87 frames. ], batch size: 46, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:35:44,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=490042.6666666667, ans=0.025 2024-06-22 00:35:46,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=490042.6666666667, ans=0.0 2024-06-22 00:35:50,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=490061.0, ans=0.125 2024-06-22 00:35:50,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=490061.0, ans=0.2 2024-06-22 00:35:51,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=15.0 2024-06-22 00:35:58,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=490079.3333333333, ans=0.025 2024-06-22 00:36:05,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=490097.6666666667, ans=0.2 2024-06-22 00:36:16,800 INFO [train.py:1028] (0/2) Epoch 27, batch 4300, loss[loss=0.156, simple_loss=0.2143, pruned_loss=0.04881, over 13198.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2281, pruned_loss=0.05848, over 2583791.28 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:36:27,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=490152.6666666667, ans=0.05 2024-06-22 00:36:36,473 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2024-06-22 00:36:42,304 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.164e+02 2.277e+02 2.472e+02 3.951e+02, threshold=4.554e+02, percent-clipped=0.0 2024-06-22 00:36:48,716 INFO [train.py:1028] (0/2) Epoch 27, batch 4350, loss[loss=0.1853, simple_loss=0.2442, pruned_loss=0.06323, over 13209.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2276, pruned_loss=0.05865, over 2587797.23 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:36:48,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=490226.0, ans=0.0 2024-06-22 00:36:49,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=490226.0, ans=0.0 2024-06-22 00:36:49,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=490226.0, ans=0.125 2024-06-22 00:36:49,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.03 vs. limit=10.0 2024-06-22 00:36:54,508 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.93 vs. limit=15.0 2024-06-22 00:36:57,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=490244.3333333333, ans=0.125 2024-06-22 00:37:01,980 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2024-06-22 00:37:08,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=490281.0, ans=0.0 2024-06-22 00:37:09,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.85 vs. limit=22.5 2024-06-22 00:37:12,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=490281.0, ans=0.0 2024-06-22 00:37:24,934 INFO [train.py:1028] (0/2) Epoch 27, batch 4400, loss[loss=0.1757, simple_loss=0.2185, pruned_loss=0.06645, over 13253.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2274, pruned_loss=0.05876, over 2587382.81 frames. ], batch size: 83, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:37:27,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2024-06-22 00:37:28,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=490317.6666666667, ans=0.1 2024-06-22 00:37:51,405 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.175e+02 2.331e+02 2.533e+02 3.142e+02, threshold=4.662e+02, percent-clipped=0.0 2024-06-22 00:37:56,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=490391.0, ans=0.125 2024-06-22 00:37:58,038 INFO [train.py:1028] (0/2) Epoch 27, batch 4450, loss[loss=0.1719, simple_loss=0.2297, pruned_loss=0.05711, over 13006.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2279, pruned_loss=0.05911, over 2581724.52 frames. ], batch size: 33, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:38:18,809 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.48 vs. limit=15.0 2024-06-22 00:38:20,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=490446.0, ans=0.0 2024-06-22 00:38:20,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.79 vs. limit=22.5 2024-06-22 00:38:25,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=490464.3333333333, ans=0.1 2024-06-22 00:38:35,099 INFO [train.py:1028] (0/2) Epoch 27, batch 4500, loss[loss=0.1623, simple_loss=0.2188, pruned_loss=0.05294, over 13241.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2271, pruned_loss=0.05865, over 2586490.84 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:38:39,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=490501.0, ans=0.0 2024-06-22 00:38:41,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2024-06-22 00:38:42,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=490519.3333333333, ans=0.025 2024-06-22 00:38:47,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=490537.6666666667, ans=0.125 2024-06-22 00:38:55,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=490556.0, ans=0.0 2024-06-22 00:38:57,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=490556.0, ans=0.025 2024-06-22 00:39:01,736 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.179e+02 2.263e+02 2.441e+02 3.064e+02, threshold=4.527e+02, percent-clipped=0.0 2024-06-22 00:39:02,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490574.3333333333, ans=0.125 2024-06-22 00:39:04,981 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.95 vs. limit=15.0 2024-06-22 00:39:08,478 INFO [train.py:1028] (0/2) Epoch 27, batch 4550, loss[loss=0.1709, simple_loss=0.2331, pruned_loss=0.0543, over 13239.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2273, pruned_loss=0.05868, over 2589585.88 frames. ], batch size: 52, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:39:33,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=490647.6666666667, ans=22.5 2024-06-22 00:39:34,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=490647.6666666667, ans=0.125 2024-06-22 00:39:37,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=490647.6666666667, ans=0.125 2024-06-22 00:39:40,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=490666.0, ans=0.1 2024-06-22 00:39:43,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=490666.0, ans=0.125 2024-06-22 00:39:45,647 INFO [train.py:1028] (0/2) Epoch 27, batch 4600, loss[loss=0.1871, simple_loss=0.2338, pruned_loss=0.07014, over 12562.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2273, pruned_loss=0.05873, over 2584777.88 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:39:47,327 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2024-06-22 00:40:00,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=490721.0, ans=0.025 2024-06-22 00:40:14,790 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.179e+02 2.352e+02 2.573e+02 3.426e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 00:40:14,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=490757.6666666667, ans=0.0 2024-06-22 00:40:21,291 INFO [train.py:1028] (0/2) Epoch 27, batch 4650, loss[loss=0.165, simple_loss=0.212, pruned_loss=0.05902, over 13135.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2263, pruned_loss=0.05849, over 2588056.73 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:40:21,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=490776.0, ans=0.0 2024-06-22 00:40:26,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=490776.0, ans=0.2 2024-06-22 00:40:31,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=490794.3333333333, ans=0.125 2024-06-22 00:40:33,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=490812.6666666667, ans=0.2 2024-06-22 00:40:38,609 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2024-06-22 00:40:42,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=490831.0, ans=0.125 2024-06-22 00:40:44,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=490831.0, ans=0.125 2024-06-22 00:40:46,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=490831.0, ans=0.125 2024-06-22 00:40:48,530 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2024-06-22 00:40:49,933 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.80 vs. limit=10.0 2024-06-22 00:40:54,461 INFO [train.py:1028] (0/2) Epoch 27, batch 4700, loss[loss=0.1806, simple_loss=0.244, pruned_loss=0.05861, over 12494.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2261, pruned_loss=0.05832, over 2583987.44 frames. ], batch size: 25, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:40:56,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490867.6666666667, ans=0.1 2024-06-22 00:41:05,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490886.0, ans=0.1 2024-06-22 00:41:07,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=490904.3333333333, ans=0.035 2024-06-22 00:41:09,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=490904.3333333333, ans=0.04949747468305833 2024-06-22 00:41:13,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=490922.6666666667, ans=0.0 2024-06-22 00:41:23,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2024-06-22 00:41:26,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.130e+02 2.290e+02 2.558e+02 3.458e+02, threshold=4.581e+02, percent-clipped=0.0 2024-06-22 00:41:26,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.23 vs. limit=22.5 2024-06-22 00:41:30,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=490941.0, ans=0.2 2024-06-22 00:41:33,104 INFO [train.py:1028] (0/2) Epoch 27, batch 4750, loss[loss=0.1905, simple_loss=0.2331, pruned_loss=0.07391, over 12583.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2257, pruned_loss=0.05825, over 2581194.05 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:41:38,289 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.55 vs. limit=22.5 2024-06-22 00:41:44,835 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2024-06-22 00:41:49,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=490996.0, ans=0.0 2024-06-22 00:42:06,836 INFO [train.py:1028] (0/2) Epoch 27, batch 4800, loss[loss=0.1597, simple_loss=0.2098, pruned_loss=0.05476, over 13272.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2256, pruned_loss=0.05817, over 2578105.87 frames. ], batch size: 63, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:42:09,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=491051.0, ans=0.125 2024-06-22 00:42:16,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=491051.0, ans=0.0 2024-06-22 00:42:27,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.88 vs. limit=5.0 2024-06-22 00:42:29,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=491087.6666666667, ans=0.05 2024-06-22 00:42:31,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=491106.0, ans=0.0 2024-06-22 00:42:31,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=491106.0, ans=0.125 2024-06-22 00:42:31,708 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2024-06-22 00:42:32,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=491106.0, ans=0.0 2024-06-22 00:42:37,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=491124.3333333333, ans=0.5 2024-06-22 00:42:38,261 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.189e+02 2.350e+02 2.488e+02 3.716e+02, threshold=4.699e+02, percent-clipped=0.0 2024-06-22 00:42:44,852 INFO [train.py:1028] (0/2) Epoch 27, batch 4850, loss[loss=0.1601, simple_loss=0.2058, pruned_loss=0.05723, over 13258.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2261, pruned_loss=0.05849, over 2576783.16 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:43:04,357 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:43:04,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=491179.3333333333, ans=0.125 2024-06-22 00:43:08,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=491197.6666666667, ans=0.125 2024-06-22 00:43:22,419 INFO [train.py:1028] (0/2) Epoch 27, batch 4900, loss[loss=0.1576, simple_loss=0.2198, pruned_loss=0.04773, over 13189.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.226, pruned_loss=0.05835, over 2576484.02 frames. ], batch size: 59, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:43:32,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=491252.6666666667, ans=0.0 2024-06-22 00:43:39,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=491271.0, ans=0.125 2024-06-22 00:43:48,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.194e+02 2.336e+02 2.481e+02 3.240e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-22 00:43:49,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=491307.6666666667, ans=0.125 2024-06-22 00:43:55,032 INFO [train.py:1028] (0/2) Epoch 27, batch 4950, loss[loss=0.1777, simple_loss=0.2198, pruned_loss=0.06787, over 11094.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.226, pruned_loss=0.05853, over 2570583.38 frames. ], batch size: 303, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:43:57,076 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-268000.pt 2024-06-22 00:44:06,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=491344.3333333333, ans=0.0 2024-06-22 00:44:08,401 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2024-06-22 00:44:11,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=491344.3333333333, ans=0.0 2024-06-22 00:44:24,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=491381.0, ans=0.125 2024-06-22 00:44:28,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=491381.0, ans=0.125 2024-06-22 00:44:30,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=491399.3333333333, ans=0.0 2024-06-22 00:44:33,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=491399.3333333333, ans=0.0 2024-06-22 00:44:35,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=491399.3333333333, ans=0.125 2024-06-22 00:44:37,604 INFO [train.py:1028] (0/2) Epoch 27, batch 5000, loss[loss=0.1618, simple_loss=0.2077, pruned_loss=0.05791, over 13148.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.226, pruned_loss=0.05829, over 2574128.63 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 64.0 2024-06-22 00:44:39,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491417.6666666667, ans=0.125 2024-06-22 00:44:50,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=491436.0, ans=0.125 2024-06-22 00:44:57,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=491472.6666666667, ans=0.125 2024-06-22 00:45:01,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=491472.6666666667, ans=0.0 2024-06-22 00:45:05,559 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.136e+02 2.249e+02 2.409e+02 3.064e+02, threshold=4.499e+02, percent-clipped=0.0 2024-06-22 00:45:06,255 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:45:06,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=491491.0, ans=10.0 2024-06-22 00:45:11,376 INFO [train.py:1028] (0/2) Epoch 27, batch 5050, loss[loss=0.1449, simple_loss=0.2092, pruned_loss=0.04026, over 12943.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2261, pruned_loss=0.05804, over 2574220.22 frames. ], batch size: 36, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:45:12,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.19 vs. limit=12.0 2024-06-22 00:45:13,675 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.84 vs. limit=15.0 2024-06-22 00:45:23,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=491527.6666666667, ans=0.125 2024-06-22 00:45:24,042 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=491546.0, ans=0.025 2024-06-22 00:45:30,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2024-06-22 00:45:31,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=491546.0, ans=0.125 2024-06-22 00:45:34,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=491564.3333333333, ans=0.025 2024-06-22 00:45:47,461 INFO [train.py:1028] (0/2) Epoch 27, batch 5100, loss[loss=0.153, simple_loss=0.208, pruned_loss=0.049, over 12902.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.226, pruned_loss=0.05826, over 2569906.73 frames. ], batch size: 39, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:45:48,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=491601.0, ans=0.025 2024-06-22 00:45:55,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=491619.3333333333, ans=0.2 2024-06-22 00:45:57,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=491619.3333333333, ans=0.125 2024-06-22 00:46:07,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=491656.0, ans=0.0 2024-06-22 00:46:17,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=491674.3333333333, ans=22.5 2024-06-22 00:46:18,175 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.854e+02 2.175e+02 2.347e+02 2.592e+02 3.756e+02, threshold=4.695e+02, percent-clipped=0.0 2024-06-22 00:46:20,867 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=491674.3333333333, ans=0.125 2024-06-22 00:46:22,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=491674.3333333333, ans=0.125 2024-06-22 00:46:23,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=491692.6666666667, ans=0.2 2024-06-22 00:46:23,870 INFO [train.py:1028] (0/2) Epoch 27, batch 5150, loss[loss=0.1572, simple_loss=0.2056, pruned_loss=0.05437, over 13115.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2256, pruned_loss=0.05835, over 2571732.29 frames. ], batch size: 132, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:46:24,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=491692.6666666667, ans=0.2 2024-06-22 00:46:36,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=491729.3333333333, ans=0.0 2024-06-22 00:46:43,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=491747.6666666667, ans=0.0 2024-06-22 00:46:54,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.64 vs. limit=15.0 2024-06-22 00:46:56,767 INFO [train.py:1028] (0/2) Epoch 27, batch 5200, loss[loss=0.1807, simple_loss=0.2336, pruned_loss=0.0639, over 13146.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2255, pruned_loss=0.05825, over 2574750.92 frames. ], batch size: 95, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:46:58,345 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=491784.3333333333, ans=0.2 2024-06-22 00:47:27,359 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.136e+02 2.273e+02 2.409e+02 3.380e+02, threshold=4.545e+02, percent-clipped=0.0 2024-06-22 00:47:27,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=491857.6666666667, ans=0.125 2024-06-22 00:47:30,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=491857.6666666667, ans=0.125 2024-06-22 00:47:32,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=491876.0, ans=0.125 2024-06-22 00:47:33,132 INFO [train.py:1028] (0/2) Epoch 27, batch 5250, loss[loss=0.1457, simple_loss=0.2058, pruned_loss=0.04281, over 13224.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2261, pruned_loss=0.05849, over 2570892.88 frames. ], batch size: 52, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:47:39,324 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2024-06-22 00:47:40,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=491894.3333333333, ans=0.0 2024-06-22 00:47:55,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=491931.0, ans=0.05 2024-06-22 00:47:56,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=491931.0, ans=0.125 2024-06-22 00:48:01,354 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2024-06-22 00:48:06,297 INFO [train.py:1028] (0/2) Epoch 27, batch 5300, loss[loss=0.1596, simple_loss=0.2081, pruned_loss=0.05552, over 13028.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2252, pruned_loss=0.05792, over 2567504.61 frames. ], batch size: 144, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:48:36,452 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.243e+02 2.365e+02 2.589e+02 3.156e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-22 00:48:42,840 INFO [train.py:1028] (0/2) Epoch 27, batch 5350, loss[loss=0.1652, simple_loss=0.2283, pruned_loss=0.0511, over 11971.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2249, pruned_loss=0.05805, over 2575481.72 frames. ], batch size: 17, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:48:46,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.66 vs. limit=22.5 2024-06-22 00:49:11,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=492132.6666666667, ans=0.125 2024-06-22 00:49:17,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492132.6666666667, ans=0.1 2024-06-22 00:49:18,775 INFO [train.py:1028] (0/2) Epoch 27, batch 5400, loss[loss=0.177, simple_loss=0.2209, pruned_loss=0.06655, over 12174.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2256, pruned_loss=0.05859, over 2566836.85 frames. ], batch size: 240, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:49:23,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=492151.0, ans=0.125 2024-06-22 00:49:24,064 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:49:25,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.23 vs. limit=22.5 2024-06-22 00:49:26,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=492169.3333333333, ans=0.2 2024-06-22 00:49:28,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=492169.3333333333, ans=0.0 2024-06-22 00:49:28,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=492169.3333333333, ans=0.2 2024-06-22 00:49:37,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=492187.6666666667, ans=0.025 2024-06-22 00:49:39,792 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2024-06-22 00:49:45,510 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.142e+02 2.275e+02 2.452e+02 3.433e+02, threshold=4.550e+02, percent-clipped=0.0 2024-06-22 00:49:51,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.84 vs. limit=15.0 2024-06-22 00:49:51,406 INFO [train.py:1028] (0/2) Epoch 27, batch 5450, loss[loss=0.1753, simple_loss=0.2308, pruned_loss=0.0599, over 12356.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2258, pruned_loss=0.05856, over 2569482.23 frames. ], batch size: 25, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:50:00,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=492261.0, ans=0.0 2024-06-22 00:50:01,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=492261.0, ans=0.125 2024-06-22 00:50:10,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492279.3333333333, ans=0.1 2024-06-22 00:50:19,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=492297.6666666667, ans=0.125 2024-06-22 00:50:20,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=492297.6666666667, ans=0.125 2024-06-22 00:50:28,066 INFO [train.py:1028] (0/2) Epoch 27, batch 5500, loss[loss=0.2099, simple_loss=0.2498, pruned_loss=0.08495, over 12379.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2254, pruned_loss=0.05834, over 2563586.13 frames. ], batch size: 241, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:50:37,436 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:50:43,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2024-06-22 00:50:45,071 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.65 vs. limit=15.0 2024-06-22 00:50:46,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.73 vs. limit=22.5 2024-06-22 00:50:48,070 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492389.3333333333, ans=0.1 2024-06-22 00:50:48,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=492389.3333333333, ans=0.0 2024-06-22 00:50:52,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=492389.3333333333, ans=0.2 2024-06-22 00:50:55,046 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.869e+02 2.149e+02 2.269e+02 2.464e+02 3.927e+02, threshold=4.538e+02, percent-clipped=0.0 2024-06-22 00:50:55,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492407.6666666667, ans=0.1 2024-06-22 00:50:55,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=492407.6666666667, ans=0.025 2024-06-22 00:51:01,156 INFO [train.py:1028] (0/2) Epoch 27, batch 5550, loss[loss=0.1833, simple_loss=0.2377, pruned_loss=0.06442, over 13258.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2253, pruned_loss=0.058, over 2568202.60 frames. ], batch size: 43, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:51:06,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=492426.0, ans=0.125 2024-06-22 00:51:09,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=492444.3333333333, ans=0.0 2024-06-22 00:51:19,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=492462.6666666667, ans=0.09899494936611666 2024-06-22 00:51:24,481 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2024-06-22 00:51:29,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=492481.0, ans=0.035 2024-06-22 00:51:36,805 INFO [train.py:1028] (0/2) Epoch 27, batch 5600, loss[loss=0.1738, simple_loss=0.2305, pruned_loss=0.05851, over 13214.00 frames. ], tot_loss[loss=0.1707, simple_loss=0.2252, pruned_loss=0.05809, over 2571378.54 frames. ], batch size: 89, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:51:38,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=492517.6666666667, ans=0.125 2024-06-22 00:51:52,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492554.3333333333, ans=0.1 2024-06-22 00:51:59,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=15.0 2024-06-22 00:52:02,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.70 vs. limit=10.0 2024-06-22 00:52:04,963 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.142e+02 2.278e+02 2.413e+02 3.066e+02, threshold=4.557e+02, percent-clipped=0.0 2024-06-22 00:52:06,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=492591.0, ans=0.0 2024-06-22 00:52:13,969 INFO [train.py:1028] (0/2) Epoch 27, batch 5650, loss[loss=0.1944, simple_loss=0.2453, pruned_loss=0.07171, over 12531.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2256, pruned_loss=0.05799, over 2575621.85 frames. ], batch size: 202, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:52:16,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=492609.3333333333, ans=0.125 2024-06-22 00:52:28,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=492646.0, ans=10.0 2024-06-22 00:52:34,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=492664.3333333333, ans=0.05 2024-06-22 00:52:39,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=492682.6666666667, ans=0.125 2024-06-22 00:52:41,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=492682.6666666667, ans=0.0 2024-06-22 00:52:41,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=492682.6666666667, ans=0.125 2024-06-22 00:52:46,904 INFO [train.py:1028] (0/2) Epoch 27, batch 5700, loss[loss=0.158, simple_loss=0.2174, pruned_loss=0.04924, over 13303.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2254, pruned_loss=0.05809, over 2579143.18 frames. ], batch size: 63, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:52:47,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.32 vs. limit=15.0 2024-06-22 00:52:49,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=492701.0, ans=0.125 2024-06-22 00:52:49,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=492701.0, ans=0.04949747468305833 2024-06-22 00:53:09,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=492756.0, ans=0.125 2024-06-22 00:53:17,042 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.211e+02 2.385e+02 2.602e+02 3.510e+02, threshold=4.770e+02, percent-clipped=0.0 2024-06-22 00:53:17,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492774.3333333333, ans=0.1 2024-06-22 00:53:22,953 INFO [train.py:1028] (0/2) Epoch 27, batch 5750, loss[loss=0.1873, simple_loss=0.2339, pruned_loss=0.07037, over 12684.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.2265, pruned_loss=0.0586, over 2579304.86 frames. ], batch size: 176, lr: 2.14e-03, grad_scale: 32.0 2024-06-22 00:53:32,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492811.0, ans=0.1 2024-06-22 00:53:36,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=492829.3333333333, ans=0.025 2024-06-22 00:53:39,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=492829.3333333333, ans=0.125 2024-06-22 00:53:55,407 INFO [train.py:1028] (0/2) Epoch 27, batch 5800, loss[loss=0.1984, simple_loss=0.2438, pruned_loss=0.07647, over 12766.00 frames. ], tot_loss[loss=0.1733, simple_loss=0.2278, pruned_loss=0.05935, over 2578787.44 frames. ], batch size: 176, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:54:02,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=492884.3333333333, ans=0.0 2024-06-22 00:54:03,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2024-06-22 00:54:08,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492902.6666666667, ans=0.1 2024-06-22 00:54:23,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=492939.3333333333, ans=0.2 2024-06-22 00:54:25,868 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.240e+02 2.321e+02 2.510e+02 3.563e+02, threshold=4.643e+02, percent-clipped=0.0 2024-06-22 00:54:30,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=492957.6666666667, ans=0.125 2024-06-22 00:54:31,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=492976.0, ans=0.0 2024-06-22 00:54:31,697 INFO [train.py:1028] (0/2) Epoch 27, batch 5850, loss[loss=0.2061, simple_loss=0.2568, pruned_loss=0.07771, over 12544.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2297, pruned_loss=0.06028, over 2577091.42 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:54:49,602 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.73 vs. limit=22.5 2024-06-22 00:55:05,021 INFO [train.py:1028] (0/2) Epoch 27, batch 5900, loss[loss=0.1791, simple_loss=0.2273, pruned_loss=0.06546, over 13063.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2314, pruned_loss=0.06067, over 2577062.25 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:55:05,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=493067.6666666667, ans=0.125 2024-06-22 00:55:32,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493122.6666666667, ans=0.1 2024-06-22 00:55:35,089 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.219e+02 2.360e+02 2.571e+02 3.960e+02, threshold=4.720e+02, percent-clipped=0.0 2024-06-22 00:55:41,082 INFO [train.py:1028] (0/2) Epoch 27, batch 5950, loss[loss=0.1715, simple_loss=0.2199, pruned_loss=0.06158, over 13139.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2325, pruned_loss=0.06106, over 2581542.91 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:56:13,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=493232.6666666667, ans=0.125 2024-06-22 00:56:15,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=493232.6666666667, ans=0.0 2024-06-22 00:56:17,003 INFO [train.py:1028] (0/2) Epoch 27, batch 6000, loss[loss=0.2362, simple_loss=0.2748, pruned_loss=0.09884, over 12332.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2336, pruned_loss=0.06142, over 2575519.24 frames. ], batch size: 241, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:56:17,004 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 00:56:21,561 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.9428, 2.6840, 2.6416, 2.3172, 2.5152, 2.5381, 2.7270, 2.3271], device='cuda:0') 2024-06-22 00:56:24,909 INFO [train.py:1060] (0/2) Epoch 27, validation: loss=0.192, simple_loss=0.2517, pruned_loss=0.06616, over 351949.00 frames. 2024-06-22 00:56:24,910 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 00:56:25,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493251.0, ans=0.1 2024-06-22 00:56:27,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=493251.0, ans=0.0 2024-06-22 00:56:31,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=493269.3333333333, ans=0.0 2024-06-22 00:56:36,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=493269.3333333333, ans=0.09899494936611666 2024-06-22 00:56:42,015 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 00:56:43,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=493287.6666666667, ans=0.1 2024-06-22 00:56:44,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=493306.0, ans=0.0 2024-06-22 00:56:44,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=493306.0, ans=0.0 2024-06-22 00:56:52,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.247e+02 2.417e+02 2.618e+02 3.248e+02, threshold=4.834e+02, percent-clipped=0.0 2024-06-22 00:56:58,366 INFO [train.py:1028] (0/2) Epoch 27, batch 6050, loss[loss=0.1951, simple_loss=0.2569, pruned_loss=0.06669, over 12942.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2352, pruned_loss=0.06194, over 2579211.32 frames. ], batch size: 39, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:57:19,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2024-06-22 00:57:24,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=493397.6666666667, ans=0.125 2024-06-22 00:57:25,202 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=22.5 2024-06-22 00:57:30,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.64 vs. limit=15.0 2024-06-22 00:57:34,690 INFO [train.py:1028] (0/2) Epoch 27, batch 6100, loss[loss=0.179, simple_loss=0.2288, pruned_loss=0.06459, over 13104.00 frames. ], tot_loss[loss=0.1799, simple_loss=0.236, pruned_loss=0.06196, over 2582271.29 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:57:37,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493434.3333333333, ans=0.1 2024-06-22 00:57:39,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=493434.3333333333, ans=0.125 2024-06-22 00:57:42,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=493452.6666666667, ans=0.0 2024-06-22 00:57:50,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.17 vs. limit=15.0 2024-06-22 00:57:55,188 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.99 vs. limit=15.0 2024-06-22 00:58:01,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=493507.6666666667, ans=0.125 2024-06-22 00:58:02,084 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.230e+02 2.423e+02 2.675e+02 3.878e+02, threshold=4.846e+02, percent-clipped=0.0 2024-06-22 00:58:04,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=493507.6666666667, ans=0.0 2024-06-22 00:58:07,886 INFO [train.py:1028] (0/2) Epoch 27, batch 6150, loss[loss=0.1912, simple_loss=0.2354, pruned_loss=0.07355, over 11037.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2365, pruned_loss=0.06185, over 2580876.24 frames. ], batch size: 304, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:58:24,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=493562.6666666667, ans=0.04949747468305833 2024-06-22 00:58:34,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493581.0, ans=0.1 2024-06-22 00:58:35,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=493581.0, ans=0.125 2024-06-22 00:58:40,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=493599.3333333333, ans=0.0 2024-06-22 00:58:40,777 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=493599.3333333333, ans=0.2 2024-06-22 00:58:43,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=493599.3333333333, ans=0.2 2024-06-22 00:58:45,065 INFO [train.py:1028] (0/2) Epoch 27, batch 6200, loss[loss=0.2134, simple_loss=0.2723, pruned_loss=0.07729, over 13245.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.238, pruned_loss=0.06232, over 2578653.40 frames. ], batch size: 89, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:58:48,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2024-06-22 00:58:52,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=493636.0, ans=0.125 2024-06-22 00:58:58,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=493654.3333333333, ans=0.125 2024-06-22 00:59:15,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493672.6666666667, ans=0.1 2024-06-22 00:59:17,528 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.292e+02 2.478e+02 2.849e+02 3.917e+02, threshold=4.956e+02, percent-clipped=0.0 2024-06-22 00:59:23,802 INFO [train.py:1028] (0/2) Epoch 27, batch 6250, loss[loss=0.1918, simple_loss=0.2445, pruned_loss=0.06953, over 13194.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2395, pruned_loss=0.06303, over 2572034.91 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 00:59:26,228 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2024-06-22 00:59:26,342 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2024-06-22 00:59:32,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=493727.6666666667, ans=0.125 2024-06-22 00:59:35,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=493727.6666666667, ans=0.2 2024-06-22 00:59:55,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=493801.0, ans=0.025 2024-06-22 00:59:56,232 INFO [train.py:1028] (0/2) Epoch 27, batch 6300, loss[loss=0.1691, simple_loss=0.2272, pruned_loss=0.05547, over 11351.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2398, pruned_loss=0.0631, over 2566479.35 frames. ], batch size: 16, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:00:04,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=493819.3333333333, ans=0.125 2024-06-22 01:00:05,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=493819.3333333333, ans=0.125 2024-06-22 01:00:13,824 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:00:16,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2024-06-22 01:00:16,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=493837.6666666667, ans=0.125 2024-06-22 01:00:17,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493837.6666666667, ans=0.1 2024-06-22 01:00:20,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=493856.0, ans=0.2 2024-06-22 01:00:26,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=493874.3333333333, ans=0.125 2024-06-22 01:00:27,216 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.390e+02 2.552e+02 2.867e+02 4.454e+02, threshold=5.104e+02, percent-clipped=0.0 2024-06-22 01:00:32,245 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=493874.3333333333, ans=0.2 2024-06-22 01:00:33,455 INFO [train.py:1028] (0/2) Epoch 27, batch 6350, loss[loss=0.2321, simple_loss=0.2781, pruned_loss=0.09307, over 12586.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2417, pruned_loss=0.06353, over 2576372.99 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:00:38,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2024-06-22 01:00:45,141 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=22.5 2024-06-22 01:00:45,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=493911.0, ans=0.125 2024-06-22 01:00:51,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=493929.3333333333, ans=0.0 2024-06-22 01:00:58,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2024-06-22 01:01:01,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=493966.0, ans=0.125 2024-06-22 01:01:04,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=493966.0, ans=0.0 2024-06-22 01:01:05,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=493966.0, ans=0.125 2024-06-22 01:01:07,298 INFO [train.py:1028] (0/2) Epoch 27, batch 6400, loss[loss=0.1775, simple_loss=0.2453, pruned_loss=0.0548, over 13228.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2434, pruned_loss=0.06419, over 2577469.53 frames. ], batch size: 67, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:01:07,831 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.82 vs. limit=22.5 2024-06-22 01:01:09,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=493984.3333333333, ans=0.025 2024-06-22 01:01:14,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=494002.6666666667, ans=0.125 2024-06-22 01:01:27,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=494021.0, ans=0.0 2024-06-22 01:01:39,807 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.299e+02 2.451e+02 2.641e+02 5.215e+02, threshold=4.902e+02, percent-clipped=1.0 2024-06-22 01:01:45,961 INFO [train.py:1028] (0/2) Epoch 27, batch 6450, loss[loss=0.2203, simple_loss=0.2688, pruned_loss=0.08587, over 12513.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.245, pruned_loss=0.06477, over 2583476.25 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:01:46,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=494076.0, ans=0.025 2024-06-22 01:01:50,452 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2024-06-22 01:01:57,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=494094.3333333333, ans=0.2 2024-06-22 01:02:01,041 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.50 vs. limit=12.0 2024-06-22 01:02:05,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=494131.0, ans=0.125 2024-06-22 01:02:12,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=494149.3333333333, ans=0.125 2024-06-22 01:02:19,680 INFO [train.py:1028] (0/2) Epoch 27, batch 6500, loss[loss=0.1867, simple_loss=0.2375, pruned_loss=0.06791, over 10879.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2464, pruned_loss=0.06505, over 2585479.15 frames. ], batch size: 303, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:02:28,451 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.36 vs. limit=15.0 2024-06-22 01:02:32,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=494186.0, ans=0.125 2024-06-22 01:02:50,257 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.298e+02 2.509e+02 2.796e+02 4.179e+02, threshold=5.018e+02, percent-clipped=0.0 2024-06-22 01:02:53,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=494241.0, ans=0.0 2024-06-22 01:02:53,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494241.0, ans=0.1 2024-06-22 01:02:56,230 INFO [train.py:1028] (0/2) Epoch 27, batch 6550, loss[loss=0.1702, simple_loss=0.2334, pruned_loss=0.05348, over 12773.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2471, pruned_loss=0.06494, over 2589913.52 frames. ], batch size: 22, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:03:04,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=494277.6666666667, ans=0.2 2024-06-22 01:03:06,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=494277.6666666667, ans=0.2 2024-06-22 01:03:11,189 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2024-06-22 01:03:21,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2024-06-22 01:03:32,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.46 vs. limit=15.0 2024-06-22 01:03:32,609 INFO [train.py:1028] (0/2) Epoch 27, batch 6600, loss[loss=0.1775, simple_loss=0.2361, pruned_loss=0.05944, over 13164.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2472, pruned_loss=0.06491, over 2590740.46 frames. ], batch size: 72, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:03:33,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=494351.0, ans=0.125 2024-06-22 01:03:39,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=494369.3333333333, ans=0.2 2024-06-22 01:03:48,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494387.6666666667, ans=0.1 2024-06-22 01:03:48,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2024-06-22 01:03:50,275 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.69 vs. limit=12.0 2024-06-22 01:03:51,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=494387.6666666667, ans=0.0 2024-06-22 01:03:52,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=494387.6666666667, ans=0.0 2024-06-22 01:03:58,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=494406.0, ans=0.1 2024-06-22 01:04:00,570 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.338e+02 2.524e+02 2.736e+02 4.038e+02, threshold=5.047e+02, percent-clipped=0.0 2024-06-22 01:04:00,917 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.08 vs. limit=12.0 2024-06-22 01:04:06,461 INFO [train.py:1028] (0/2) Epoch 27, batch 6650, loss[loss=0.1775, simple_loss=0.2353, pruned_loss=0.05985, over 12905.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2496, pruned_loss=0.06584, over 2585736.35 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:04:14,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=494461.0, ans=0.0 2024-06-22 01:04:20,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2024-06-22 01:04:28,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=494497.6666666667, ans=0.0 2024-06-22 01:04:39,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=494516.0, ans=0.0 2024-06-22 01:04:39,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.88 vs. limit=15.0 2024-06-22 01:04:40,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=494516.0, ans=0.125 2024-06-22 01:04:44,724 INFO [train.py:1028] (0/2) Epoch 27, batch 6700, loss[loss=0.1958, simple_loss=0.2476, pruned_loss=0.07197, over 12732.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2504, pruned_loss=0.06616, over 2584612.92 frames. ], batch size: 176, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:04:44,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=494534.3333333333, ans=0.1 2024-06-22 01:05:01,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=494571.0, ans=0.125 2024-06-22 01:05:04,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=494571.0, ans=0.125 2024-06-22 01:05:06,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=494589.3333333333, ans=0.125 2024-06-22 01:05:09,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=494589.3333333333, ans=0.125 2024-06-22 01:05:10,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=494589.3333333333, ans=0.125 2024-06-22 01:05:11,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=494589.3333333333, ans=0.125 2024-06-22 01:05:12,746 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.336e+02 2.507e+02 2.782e+02 3.537e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-22 01:05:19,070 INFO [train.py:1028] (0/2) Epoch 27, batch 6750, loss[loss=0.2609, simple_loss=0.304, pruned_loss=0.1089, over 12157.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2514, pruned_loss=0.06657, over 2579341.80 frames. ], batch size: 240, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:05:29,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=494644.3333333333, ans=0.2 2024-06-22 01:05:35,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=494662.6666666667, ans=0.125 2024-06-22 01:05:42,762 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:05:55,162 INFO [train.py:1028] (0/2) Epoch 27, batch 6800, loss[loss=0.1779, simple_loss=0.242, pruned_loss=0.05687, over 13231.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2523, pruned_loss=0.06673, over 2580557.87 frames. ], batch size: 67, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:06:08,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494754.3333333333, ans=0.1 2024-06-22 01:06:18,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=494772.6666666667, ans=0.125 2024-06-22 01:06:22,542 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.335e+02 2.508e+02 2.798e+02 3.827e+02, threshold=5.017e+02, percent-clipped=0.0 2024-06-22 01:06:27,004 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.34 vs. limit=15.0 2024-06-22 01:06:27,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=494791.0, ans=0.2 2024-06-22 01:06:28,678 INFO [train.py:1028] (0/2) Epoch 27, batch 6850, loss[loss=0.2129, simple_loss=0.2836, pruned_loss=0.07113, over 13270.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2529, pruned_loss=0.06652, over 2584297.77 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:06:41,006 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:06:49,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.97 vs. limit=22.5 2024-06-22 01:06:50,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=494846.0, ans=0.125 2024-06-22 01:06:51,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2024-06-22 01:06:57,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=494864.3333333333, ans=0.125 2024-06-22 01:07:00,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=494882.6666666667, ans=0.0 2024-06-22 01:07:00,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=494882.6666666667, ans=0.05 2024-06-22 01:07:06,124 INFO [train.py:1028] (0/2) Epoch 27, batch 6900, loss[loss=0.2056, simple_loss=0.2661, pruned_loss=0.07258, over 13269.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2536, pruned_loss=0.06666, over 2585928.61 frames. ], batch size: 49, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:07:20,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=494937.6666666667, ans=0.2 2024-06-22 01:07:26,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=494956.0, ans=0.125 2024-06-22 01:07:28,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=494956.0, ans=0.125 2024-06-22 01:07:28,713 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:07:29,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=494956.0, ans=0.125 2024-06-22 01:07:33,226 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.376e+02 2.560e+02 2.867e+02 3.923e+02, threshold=5.120e+02, percent-clipped=0.0 2024-06-22 01:07:41,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=494974.3333333333, ans=0.125 2024-06-22 01:07:43,764 INFO [train.py:1028] (0/2) Epoch 27, batch 6950, loss[loss=0.2067, simple_loss=0.2666, pruned_loss=0.07343, over 11202.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2541, pruned_loss=0.06673, over 2579627.40 frames. ], batch size: 16, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:07:45,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=494992.6666666667, ans=0.07 2024-06-22 01:07:48,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=494992.6666666667, ans=0.125 2024-06-22 01:07:53,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=495011.0, ans=0.0 2024-06-22 01:08:00,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=495029.3333333333, ans=0.1 2024-06-22 01:08:16,540 INFO [train.py:1028] (0/2) Epoch 27, batch 7000, loss[loss=0.198, simple_loss=0.2566, pruned_loss=0.06967, over 12945.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2545, pruned_loss=0.06656, over 2575741.90 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:08:17,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=495084.3333333333, ans=0.2 2024-06-22 01:08:21,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=495084.3333333333, ans=0.125 2024-06-22 01:08:25,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=495102.6666666667, ans=0.0 2024-06-22 01:08:25,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=495102.6666666667, ans=0.0 2024-06-22 01:08:27,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=495102.6666666667, ans=0.0 2024-06-22 01:08:28,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495102.6666666667, ans=0.125 2024-06-22 01:08:32,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=495121.0, ans=0.025 2024-06-22 01:08:32,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=495121.0, ans=0.125 2024-06-22 01:08:35,341 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:08:44,895 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.423e+02 2.589e+02 2.850e+02 4.194e+02, threshold=5.178e+02, percent-clipped=0.0 2024-06-22 01:08:54,087 INFO [train.py:1028] (0/2) Epoch 27, batch 7050, loss[loss=0.2129, simple_loss=0.264, pruned_loss=0.08095, over 12738.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2556, pruned_loss=0.06701, over 2583145.77 frames. ], batch size: 176, lr: 2.13e-03, grad_scale: 64.0 2024-06-22 01:08:57,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=495176.0, ans=0.125 2024-06-22 01:08:58,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.85 vs. limit=12.0 2024-06-22 01:08:59,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=495176.0, ans=0.125 2024-06-22 01:09:05,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=495194.3333333333, ans=10.0 2024-06-22 01:09:09,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=495212.6666666667, ans=0.125 2024-06-22 01:09:14,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=495231.0, ans=0.2 2024-06-22 01:09:18,018 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2024-06-22 01:09:26,584 INFO [train.py:1028] (0/2) Epoch 27, batch 7100, loss[loss=0.2203, simple_loss=0.2856, pruned_loss=0.0775, over 13209.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2563, pruned_loss=0.06786, over 2575935.67 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:09:45,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=495304.3333333333, ans=0.0 2024-06-22 01:09:47,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=495304.3333333333, ans=0.125 2024-06-22 01:09:50,374 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:09:52,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=495322.6666666667, ans=0.0 2024-06-22 01:09:53,056 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.22 vs. limit=22.5 2024-06-22 01:09:55,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=495322.6666666667, ans=0.125 2024-06-22 01:09:57,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.355e+02 2.476e+02 2.674e+02 3.483e+02, threshold=4.952e+02, percent-clipped=0.0 2024-06-22 01:10:02,806 INFO [train.py:1028] (0/2) Epoch 27, batch 7150, loss[loss=0.2183, simple_loss=0.27, pruned_loss=0.08336, over 12508.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2569, pruned_loss=0.06791, over 2574722.93 frames. ], batch size: 202, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:10:07,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495359.3333333333, ans=0.1 2024-06-22 01:10:12,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=495377.6666666667, ans=0.04949747468305833 2024-06-22 01:10:18,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=495396.0, ans=0.125 2024-06-22 01:10:28,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=495414.3333333333, ans=0.0 2024-06-22 01:10:34,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=495432.6666666667, ans=0.0 2024-06-22 01:10:34,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=495432.6666666667, ans=0.0 2024-06-22 01:10:35,501 INFO [train.py:1028] (0/2) Epoch 27, batch 7200, loss[loss=0.2177, simple_loss=0.2773, pruned_loss=0.07901, over 13105.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2582, pruned_loss=0.06822, over 2579716.07 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:10:42,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=495469.3333333333, ans=0.125 2024-06-22 01:10:43,000 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=495469.3333333333, ans=0.125 2024-06-22 01:10:43,757 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=495469.3333333333, ans=0.025 2024-06-22 01:10:46,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495469.3333333333, ans=0.1 2024-06-22 01:10:46,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=495469.3333333333, ans=0.0 2024-06-22 01:10:52,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=495487.6666666667, ans=0.0 2024-06-22 01:11:01,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=495506.0, ans=0.125 2024-06-22 01:11:06,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=495524.3333333333, ans=0.125 2024-06-22 01:11:08,517 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.366e+02 2.504e+02 2.631e+02 3.462e+02, threshold=5.009e+02, percent-clipped=0.0 2024-06-22 01:11:10,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=495524.3333333333, ans=0.125 2024-06-22 01:11:14,190 INFO [train.py:1028] (0/2) Epoch 27, batch 7250, loss[loss=0.1878, simple_loss=0.2495, pruned_loss=0.06301, over 13258.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2594, pruned_loss=0.06853, over 2580309.18 frames. ], batch size: 37, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:11:17,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=495542.6666666667, ans=0.0 2024-06-22 01:11:19,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=495542.6666666667, ans=0.025 2024-06-22 01:11:23,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=495561.0, ans=0.025 2024-06-22 01:11:41,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=15.0 2024-06-22 01:11:42,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=495616.0, ans=0.125 2024-06-22 01:11:46,865 INFO [train.py:1028] (0/2) Epoch 27, batch 7300, loss[loss=0.1873, simple_loss=0.251, pruned_loss=0.0618, over 13036.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2602, pruned_loss=0.06868, over 2579760.19 frames. ], batch size: 36, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:12:05,814 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:12:12,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=495689.3333333333, ans=0.125 2024-06-22 01:12:13,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=495689.3333333333, ans=0.125 2024-06-22 01:12:17,850 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:12:18,280 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.453e+02 2.623e+02 2.936e+02 4.004e+02, threshold=5.246e+02, percent-clipped=0.0 2024-06-22 01:12:18,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=495707.6666666667, ans=0.125 2024-06-22 01:12:23,729 INFO [train.py:1028] (0/2) Epoch 27, batch 7350, loss[loss=0.2037, simple_loss=0.2729, pruned_loss=0.06723, over 13375.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2598, pruned_loss=0.06868, over 2581843.57 frames. ], batch size: 46, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:12:25,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=495726.0, ans=0.95 2024-06-22 01:12:26,252 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.12 vs. limit=22.5 2024-06-22 01:12:27,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495726.0, ans=0.1 2024-06-22 01:12:57,209 INFO [train.py:1028] (0/2) Epoch 27, batch 7400, loss[loss=0.214, simple_loss=0.2902, pruned_loss=0.06886, over 13266.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2605, pruned_loss=0.06902, over 2587138.22 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:13:02,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=495817.6666666667, ans=0.0 2024-06-22 01:13:07,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2024-06-22 01:13:17,456 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.99 vs. limit=22.5 2024-06-22 01:13:28,981 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.415e+02 2.574e+02 2.907e+02 5.217e+02, threshold=5.147e+02, percent-clipped=0.0 2024-06-22 01:13:33,346 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.06 vs. limit=12.0 2024-06-22 01:13:34,446 INFO [train.py:1028] (0/2) Epoch 27, batch 7450, loss[loss=0.1758, simple_loss=0.2483, pruned_loss=0.05164, over 12644.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2606, pruned_loss=0.06875, over 2580162.91 frames. ], batch size: 29, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:13:38,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.68 vs. limit=12.0 2024-06-22 01:13:39,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=495909.3333333333, ans=0.025 2024-06-22 01:13:45,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=495927.6666666667, ans=0.125 2024-06-22 01:13:49,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=495946.0, ans=0.125 2024-06-22 01:13:53,241 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=495946.0, ans=0.0 2024-06-22 01:13:55,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=495964.3333333333, ans=0.125 2024-06-22 01:13:56,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=495964.3333333333, ans=0.04949747468305833 2024-06-22 01:13:56,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=495964.3333333333, ans=0.125 2024-06-22 01:14:12,156 INFO [train.py:1028] (0/2) Epoch 27, batch 7500, loss[loss=0.2296, simple_loss=0.2754, pruned_loss=0.0919, over 10802.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2612, pruned_loss=0.06941, over 2578069.10 frames. ], batch size: 304, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:14:15,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496001.0, ans=0.1 2024-06-22 01:14:31,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=496056.0, ans=0.0 2024-06-22 01:14:31,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=496056.0, ans=0.2 2024-06-22 01:14:33,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=496056.0, ans=0.2 2024-06-22 01:14:39,244 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.391e+02 2.544e+02 2.733e+02 3.554e+02, threshold=5.088e+02, percent-clipped=0.0 2024-06-22 01:14:44,410 INFO [train.py:1028] (0/2) Epoch 27, batch 7550, loss[loss=0.2113, simple_loss=0.263, pruned_loss=0.07978, over 12957.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2616, pruned_loss=0.06994, over 2577417.18 frames. ], batch size: 158, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:14:45,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=496092.6666666667, ans=0.0 2024-06-22 01:15:01,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=496129.3333333333, ans=0.125 2024-06-22 01:15:03,371 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.39 vs. limit=22.5 2024-06-22 01:15:10,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=496147.6666666667, ans=0.125 2024-06-22 01:15:10,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=496147.6666666667, ans=0.0 2024-06-22 01:15:15,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=496166.0, ans=0.02 2024-06-22 01:15:17,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=496166.0, ans=0.0 2024-06-22 01:15:20,647 INFO [train.py:1028] (0/2) Epoch 27, batch 7600, loss[loss=0.1885, simple_loss=0.2474, pruned_loss=0.06484, over 13191.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2626, pruned_loss=0.07039, over 2576493.59 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:15:27,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=496202.6666666667, ans=0.125 2024-06-22 01:15:34,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=496221.0, ans=0.04949747468305833 2024-06-22 01:15:45,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=496239.3333333333, ans=0.0 2024-06-22 01:15:47,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=496257.6666666667, ans=0.0 2024-06-22 01:15:48,961 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.392e+02 2.581e+02 2.879e+02 3.969e+02, threshold=5.163e+02, percent-clipped=0.0 2024-06-22 01:15:51,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2024-06-22 01:15:57,686 INFO [train.py:1028] (0/2) Epoch 27, batch 7650, loss[loss=0.1925, simple_loss=0.2498, pruned_loss=0.06759, over 12849.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2628, pruned_loss=0.07047, over 2573601.27 frames. ], batch size: 33, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:16:03,360 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.237e+00 2024-06-22 01:16:05,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496294.3333333333, ans=0.1 2024-06-22 01:16:13,231 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=496312.6666666667, ans=0.125 2024-06-22 01:16:24,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=496349.3333333333, ans=0.025 2024-06-22 01:16:30,856 INFO [train.py:1028] (0/2) Epoch 27, batch 7700, loss[loss=0.2049, simple_loss=0.2732, pruned_loss=0.06832, over 13254.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2628, pruned_loss=0.07043, over 2570826.46 frames. ], batch size: 63, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:16:34,091 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2024-06-22 01:16:40,854 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:16:45,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.51 vs. limit=15.0 2024-06-22 01:16:50,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=496422.6666666667, ans=0.125 2024-06-22 01:16:51,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=496422.6666666667, ans=0.2 2024-06-22 01:17:00,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=496441.0, ans=0.0 2024-06-22 01:17:01,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=496441.0, ans=0.2 2024-06-22 01:17:01,878 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.466e+02 2.691e+02 3.034e+02 3.925e+02, threshold=5.382e+02, percent-clipped=0.0 2024-06-22 01:17:02,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=496441.0, ans=0.0 2024-06-22 01:17:05,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=496441.0, ans=0.0 2024-06-22 01:17:07,026 INFO [train.py:1028] (0/2) Epoch 27, batch 7750, loss[loss=0.1845, simple_loss=0.2502, pruned_loss=0.05938, over 13280.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2635, pruned_loss=0.07079, over 2574878.22 frames. ], batch size: 72, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:17:16,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=496477.6666666667, ans=0.125 2024-06-22 01:17:17,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=496477.6666666667, ans=0.025 2024-06-22 01:17:21,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=496496.0, ans=6.0 2024-06-22 01:17:28,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=496514.3333333333, ans=0.025 2024-06-22 01:17:31,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=496514.3333333333, ans=10.0 2024-06-22 01:17:40,062 INFO [train.py:1028] (0/2) Epoch 27, batch 7800, loss[loss=0.212, simple_loss=0.274, pruned_loss=0.075, over 13145.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2644, pruned_loss=0.07089, over 2579313.58 frames. ], batch size: 95, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:17:40,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=496551.0, ans=0.125 2024-06-22 01:17:48,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496551.0, ans=0.1 2024-06-22 01:17:48,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.48 vs. limit=15.0 2024-06-22 01:18:01,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496587.6666666667, ans=0.1 2024-06-22 01:18:04,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=496606.0, ans=0.1 2024-06-22 01:18:05,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=496606.0, ans=0.0 2024-06-22 01:18:11,583 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.380e+02 2.564e+02 2.778e+02 3.612e+02, threshold=5.127e+02, percent-clipped=0.0 2024-06-22 01:18:12,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2024-06-22 01:18:13,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=496624.3333333333, ans=0.0 2024-06-22 01:18:16,721 INFO [train.py:1028] (0/2) Epoch 27, batch 7850, loss[loss=0.2098, simple_loss=0.2754, pruned_loss=0.07215, over 11102.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2655, pruned_loss=0.07145, over 2571361.44 frames. ], batch size: 16, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:18:22,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2024-06-22 01:18:26,711 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.97 vs. limit=22.5 2024-06-22 01:18:27,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=496661.0, ans=0.025 2024-06-22 01:18:29,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=496679.3333333333, ans=0.0 2024-06-22 01:18:45,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=496716.0, ans=0.125 2024-06-22 01:18:45,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=496716.0, ans=0.125 2024-06-22 01:18:52,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=496716.0, ans=0.125 2024-06-22 01:18:54,665 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.32 vs. limit=22.5 2024-06-22 01:18:54,781 INFO [train.py:1028] (0/2) Epoch 27, batch 7900, loss[loss=0.1964, simple_loss=0.2634, pruned_loss=0.06468, over 13192.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2655, pruned_loss=0.07156, over 2570800.87 frames. ], batch size: 77, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:18:57,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496734.3333333333, ans=0.0 2024-06-22 01:19:02,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496752.6666666667, ans=0.1 2024-06-22 01:19:09,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=496771.0, ans=0.2 2024-06-22 01:19:09,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=496771.0, ans=0.0 2024-06-22 01:19:22,388 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.414e+02 2.634e+02 2.877e+02 3.351e+02, threshold=5.267e+02, percent-clipped=0.0 2024-06-22 01:19:27,337 INFO [train.py:1028] (0/2) Epoch 27, batch 7950, loss[loss=0.2198, simple_loss=0.2735, pruned_loss=0.08301, over 10756.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2658, pruned_loss=0.07152, over 2573903.46 frames. ], batch size: 303, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:19:28,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=496826.0, ans=0.5 2024-06-22 01:19:49,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=496862.6666666667, ans=0.125 2024-06-22 01:19:57,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496899.3333333333, ans=0.1 2024-06-22 01:20:03,998 INFO [train.py:1028] (0/2) Epoch 27, batch 8000, loss[loss=0.1962, simple_loss=0.2599, pruned_loss=0.06628, over 12590.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.266, pruned_loss=0.07123, over 2571540.61 frames. ], batch size: 29, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:20:08,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=496917.6666666667, ans=0.125 2024-06-22 01:20:13,301 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.91 vs. limit=15.0 2024-06-22 01:20:17,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=496954.3333333333, ans=0.2 2024-06-22 01:20:20,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=496954.3333333333, ans=15.0 2024-06-22 01:20:21,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2024-06-22 01:20:26,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=496972.6666666667, ans=0.125 2024-06-22 01:20:27,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=496972.6666666667, ans=0.125 2024-06-22 01:20:31,713 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.446e+02 2.616e+02 2.837e+02 3.387e+02, threshold=5.231e+02, percent-clipped=0.0 2024-06-22 01:20:37,306 INFO [train.py:1028] (0/2) Epoch 27, batch 8050, loss[loss=0.2091, simple_loss=0.2685, pruned_loss=0.07487, over 13219.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.265, pruned_loss=0.07067, over 2571375.03 frames. ], batch size: 83, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:20:38,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497009.3333333333, ans=0.1 2024-06-22 01:20:39,441 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2024-06-22 01:20:43,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=497027.6666666667, ans=0.125 2024-06-22 01:20:52,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=497046.0, ans=0.125 2024-06-22 01:21:00,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=497064.3333333333, ans=0.125 2024-06-22 01:21:05,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=497064.3333333333, ans=0.125 2024-06-22 01:21:12,475 INFO [train.py:1028] (0/2) Epoch 27, batch 8100, loss[loss=0.2216, simple_loss=0.2796, pruned_loss=0.08179, over 13127.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2652, pruned_loss=0.07088, over 2575720.95 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:21:19,999 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=497119.3333333333, ans=0.2 2024-06-22 01:21:28,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497137.6666666667, ans=0.1 2024-06-22 01:21:40,257 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2024-06-22 01:21:40,427 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.375e+02 2.483e+02 2.663e+02 3.373e+02, threshold=4.967e+02, percent-clipped=0.0 2024-06-22 01:21:44,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=497174.3333333333, ans=0.125 2024-06-22 01:21:49,355 INFO [train.py:1028] (0/2) Epoch 27, batch 8150, loss[loss=0.216, simple_loss=0.2793, pruned_loss=0.07633, over 13099.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2657, pruned_loss=0.07068, over 2579265.58 frames. ], batch size: 121, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:22:14,566 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=22.5 2024-06-22 01:22:19,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2024-06-22 01:22:20,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=497266.0, ans=0.125 2024-06-22 01:22:22,718 INFO [train.py:1028] (0/2) Epoch 27, batch 8200, loss[loss=0.2166, simple_loss=0.2822, pruned_loss=0.07552, over 13143.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2662, pruned_loss=0.07071, over 2583727.63 frames. ], batch size: 112, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:22:26,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=497284.3333333333, ans=0.0 2024-06-22 01:22:29,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=9.31 vs. limit=15.0 2024-06-22 01:22:29,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=497302.6666666667, ans=0.0 2024-06-22 01:22:30,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=497302.6666666667, ans=0.2 2024-06-22 01:22:32,623 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2024-06-22 01:22:43,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=497339.3333333333, ans=10.0 2024-06-22 01:22:45,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497339.3333333333, ans=0.125 2024-06-22 01:22:48,023 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497339.3333333333, ans=0.1 2024-06-22 01:22:49,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=497357.6666666667, ans=0.0 2024-06-22 01:22:51,192 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.462e+02 2.652e+02 2.896e+02 3.986e+02, threshold=5.303e+02, percent-clipped=0.0 2024-06-22 01:22:51,537 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=12.0 2024-06-22 01:23:01,682 INFO [train.py:1028] (0/2) Epoch 27, batch 8250, loss[loss=0.2051, simple_loss=0.2751, pruned_loss=0.06752, over 13262.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2663, pruned_loss=0.07084, over 2583846.59 frames. ], batch size: 52, lr: 2.13e-03, grad_scale: 32.0 2024-06-22 01:23:08,366 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=497394.3333333333, ans=0.0 2024-06-22 01:23:16,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=497412.6666666667, ans=0.025 2024-06-22 01:23:32,112 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2024-06-22 01:23:34,307 INFO [train.py:1028] (0/2) Epoch 27, batch 8300, loss[loss=0.2158, simple_loss=0.2738, pruned_loss=0.07892, over 13035.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2655, pruned_loss=0.07041, over 2581122.90 frames. ], batch size: 102, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:23:59,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497522.6666666667, ans=0.1 2024-06-22 01:24:03,322 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:24:05,186 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.383e+02 2.532e+02 2.763e+02 3.503e+02, threshold=5.063e+02, percent-clipped=0.0 2024-06-22 01:24:10,265 INFO [train.py:1028] (0/2) Epoch 27, batch 8350, loss[loss=0.211, simple_loss=0.2735, pruned_loss=0.07424, over 13194.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2661, pruned_loss=0.07033, over 2582168.15 frames. ], batch size: 112, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:24:10,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=497559.3333333333, ans=0.0 2024-06-22 01:24:30,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=497614.3333333333, ans=0.0 2024-06-22 01:24:36,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=497632.6666666667, ans=0.125 2024-06-22 01:24:43,724 INFO [train.py:1028] (0/2) Epoch 27, batch 8400, loss[loss=0.2002, simple_loss=0.264, pruned_loss=0.06818, over 12900.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2661, pruned_loss=0.07044, over 2578888.00 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:24:47,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=497651.0, ans=0.125 2024-06-22 01:25:00,889 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2024-06-22 01:25:11,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2024-06-22 01:25:14,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=497724.3333333333, ans=0.1 2024-06-22 01:25:15,444 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 2.394e+02 2.531e+02 2.698e+02 3.224e+02, threshold=5.061e+02, percent-clipped=0.0 2024-06-22 01:25:20,578 INFO [train.py:1028] (0/2) Epoch 27, batch 8450, loss[loss=0.1938, simple_loss=0.2571, pruned_loss=0.06525, over 13185.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2671, pruned_loss=0.07072, over 2580288.58 frames. ], batch size: 112, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:25:23,328 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=497742.6666666667, ans=0.025 2024-06-22 01:25:24,279 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.47 vs. limit=15.0 2024-06-22 01:25:39,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=497779.3333333333, ans=0.125 2024-06-22 01:25:45,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=497797.6666666667, ans=0.125 2024-06-22 01:25:49,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=497797.6666666667, ans=0.0 2024-06-22 01:25:57,697 INFO [train.py:1028] (0/2) Epoch 27, batch 8500, loss[loss=0.1993, simple_loss=0.2643, pruned_loss=0.06712, over 12717.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2679, pruned_loss=0.07096, over 2578878.57 frames. ], batch size: 29, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:25:59,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=497834.3333333333, ans=0.2 2024-06-22 01:25:59,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=497834.3333333333, ans=0.04949747468305833 2024-06-22 01:25:59,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=497834.3333333333, ans=0.125 2024-06-22 01:26:03,987 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2024-06-22 01:26:09,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=497852.6666666667, ans=0.1 2024-06-22 01:26:09,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=497852.6666666667, ans=0.125 2024-06-22 01:26:11,962 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:26:14,087 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=497871.0, ans=0.125 2024-06-22 01:26:14,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=497871.0, ans=0.0 2024-06-22 01:26:16,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=497871.0, ans=0.0 2024-06-22 01:26:20,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=497889.3333333333, ans=0.09899494936611666 2024-06-22 01:26:26,139 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.403e+02 2.536e+02 2.770e+02 3.755e+02, threshold=5.072e+02, percent-clipped=0.0 2024-06-22 01:26:31,341 INFO [train.py:1028] (0/2) Epoch 27, batch 8550, loss[loss=0.2372, simple_loss=0.3091, pruned_loss=0.08265, over 12519.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2677, pruned_loss=0.07084, over 2575810.82 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:26:48,862 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 01:27:01,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=497999.3333333333, ans=0.125 2024-06-22 01:27:08,271 INFO [train.py:1028] (0/2) Epoch 27, batch 8600, loss[loss=0.2071, simple_loss=0.2651, pruned_loss=0.07455, over 13128.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2688, pruned_loss=0.07125, over 2573038.62 frames. ], batch size: 112, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:27:34,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=498091.0, ans=0.2 2024-06-22 01:27:35,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=498091.0, ans=0.07 2024-06-22 01:27:36,312 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.397e+02 2.555e+02 2.719e+02 3.362e+02, threshold=5.111e+02, percent-clipped=0.0 2024-06-22 01:27:38,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498091.0, ans=0.1 2024-06-22 01:27:41,800 INFO [train.py:1028] (0/2) Epoch 27, batch 8650, loss[loss=0.2108, simple_loss=0.2696, pruned_loss=0.07601, over 13015.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.269, pruned_loss=0.07116, over 2575891.67 frames. ], batch size: 102, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:27:42,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=498109.3333333333, ans=0.125 2024-06-22 01:27:43,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498109.3333333333, ans=0.125 2024-06-22 01:27:53,480 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=498127.6666666667, ans=0.125 2024-06-22 01:28:10,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=498164.3333333333, ans=0.125 2024-06-22 01:28:13,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=498182.6666666667, ans=0.125 2024-06-22 01:28:18,515 INFO [train.py:1028] (0/2) Epoch 27, batch 8700, loss[loss=0.1846, simple_loss=0.2443, pruned_loss=0.06238, over 13209.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2693, pruned_loss=0.07171, over 2572090.80 frames. ], batch size: 59, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:28:19,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=498201.0, ans=0.07 2024-06-22 01:28:19,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=498201.0, ans=0.125 2024-06-22 01:28:21,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=498201.0, ans=0.125 2024-06-22 01:28:27,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=498219.3333333333, ans=0.125 2024-06-22 01:28:35,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498237.6666666667, ans=0.1 2024-06-22 01:28:45,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=498274.3333333333, ans=0.0 2024-06-22 01:28:46,884 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.454e+02 2.639e+02 2.856e+02 4.545e+02, threshold=5.278e+02, percent-clipped=0.0 2024-06-22 01:28:50,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=498274.3333333333, ans=0.0 2024-06-22 01:28:52,242 INFO [train.py:1028] (0/2) Epoch 27, batch 8750, loss[loss=0.222, simple_loss=0.2746, pruned_loss=0.08474, over 13051.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2693, pruned_loss=0.07204, over 2567267.60 frames. ], batch size: 121, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:28:57,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=498292.6666666667, ans=0.0 2024-06-22 01:29:05,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498311.0, ans=0.0 2024-06-22 01:29:05,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=498311.0, ans=0.125 2024-06-22 01:29:09,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=498329.3333333333, ans=0.95 2024-06-22 01:29:09,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=498329.3333333333, ans=0.125 2024-06-22 01:29:22,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=498366.0, ans=0.125 2024-06-22 01:29:29,214 INFO [train.py:1028] (0/2) Epoch 27, batch 8800, loss[loss=0.1941, simple_loss=0.2673, pruned_loss=0.06047, over 13241.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2697, pruned_loss=0.07214, over 2572488.91 frames. ], batch size: 72, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:29:30,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=498384.3333333333, ans=0.0 2024-06-22 01:29:31,232 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=498384.3333333333, ans=0.0 2024-06-22 01:29:34,373 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=15.0 2024-06-22 01:29:36,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498402.6666666667, ans=0.1 2024-06-22 01:29:46,422 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:29:57,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=498439.3333333333, ans=0.0 2024-06-22 01:30:00,931 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.370e+02 2.522e+02 2.746e+02 3.463e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-22 01:30:06,518 INFO [train.py:1028] (0/2) Epoch 27, batch 8850, loss[loss=0.2324, simple_loss=0.2905, pruned_loss=0.08716, over 12495.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2699, pruned_loss=0.0725, over 2563859.50 frames. ], batch size: 202, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:30:06,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498476.0, ans=0.1 2024-06-22 01:30:10,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498476.0, ans=0.0 2024-06-22 01:30:24,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=498512.6666666667, ans=0.2 2024-06-22 01:30:27,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.56 vs. limit=15.0 2024-06-22 01:30:28,351 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:30:28,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=498531.0, ans=0.5 2024-06-22 01:30:31,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=498531.0, ans=0.0 2024-06-22 01:30:31,715 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=498531.0, ans=0.125 2024-06-22 01:30:31,744 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:30:32,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=498531.0, ans=0.0 2024-06-22 01:30:35,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=498549.3333333333, ans=0.04949747468305833 2024-06-22 01:30:40,977 INFO [train.py:1028] (0/2) Epoch 27, batch 8900, loss[loss=0.2216, simple_loss=0.2776, pruned_loss=0.08276, over 12845.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2704, pruned_loss=0.0729, over 2561347.53 frames. ], batch size: 33, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:31:03,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498604.3333333333, ans=0.125 2024-06-22 01:31:07,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=498622.6666666667, ans=0.125 2024-06-22 01:31:14,463 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.525e+02 2.729e+02 2.955e+02 3.918e+02, threshold=5.458e+02, percent-clipped=0.0 2024-06-22 01:31:19,993 INFO [train.py:1028] (0/2) Epoch 27, batch 8950, loss[loss=0.2253, simple_loss=0.2821, pruned_loss=0.08424, over 12550.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2702, pruned_loss=0.07228, over 2562893.21 frames. ], batch size: 202, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:31:22,099 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-272000.pt 2024-06-22 01:31:28,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=498659.3333333333, ans=0.125 2024-06-22 01:31:28,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=498659.3333333333, ans=0.0 2024-06-22 01:31:40,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.45 vs. limit=15.0 2024-06-22 01:31:56,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=498732.6666666667, ans=0.2 2024-06-22 01:32:02,442 INFO [train.py:1028] (0/2) Epoch 27, batch 9000, loss[loss=0.2094, simple_loss=0.2714, pruned_loss=0.07365, over 13337.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2702, pruned_loss=0.07189, over 2569749.49 frames. ], batch size: 46, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:32:02,443 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 01:32:10,391 INFO [train.py:1060] (0/2) Epoch 27, validation: loss=0.1926, simple_loss=0.2522, pruned_loss=0.06649, over 351949.00 frames. 2024-06-22 01:32:10,392 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 01:32:17,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=12.35 vs. limit=15.0 2024-06-22 01:32:19,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=498769.3333333333, ans=0.2 2024-06-22 01:32:26,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=498787.6666666667, ans=0.125 2024-06-22 01:32:27,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=498787.6666666667, ans=0.125 2024-06-22 01:32:36,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=498806.0, ans=15.0 2024-06-22 01:32:37,889 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=498824.3333333333, ans=0.1 2024-06-22 01:32:38,296 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.536e+02 2.681e+02 2.945e+02 4.161e+02, threshold=5.362e+02, percent-clipped=0.0 2024-06-22 01:32:38,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=498824.3333333333, ans=0.025 2024-06-22 01:32:39,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=498824.3333333333, ans=0.125 2024-06-22 01:32:42,578 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2024-06-22 01:32:43,653 INFO [train.py:1028] (0/2) Epoch 27, batch 9050, loss[loss=0.1778, simple_loss=0.2413, pruned_loss=0.05711, over 11420.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.271, pruned_loss=0.07224, over 2568053.70 frames. ], batch size: 17, lr: 2.12e-03, grad_scale: 32.0 2024-06-22 01:32:45,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=498842.6666666667, ans=10.0 2024-06-22 01:32:58,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498879.3333333333, ans=0.125 2024-06-22 01:32:58,165 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.14 vs. limit=10.0 2024-06-22 01:33:01,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=498879.3333333333, ans=0.0 2024-06-22 01:33:02,340 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:33:07,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498897.6666666667, ans=0.1 2024-06-22 01:33:12,885 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=498916.0, ans=0.125 2024-06-22 01:33:16,546 INFO [train.py:1028] (0/2) Epoch 27, batch 9100, loss[loss=0.1879, simple_loss=0.2575, pruned_loss=0.05919, over 13268.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2703, pruned_loss=0.07186, over 2569369.99 frames. ], batch size: 72, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:33:25,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=498952.6666666667, ans=0.2 2024-06-22 01:33:26,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498952.6666666667, ans=0.125 2024-06-22 01:33:31,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498971.0, ans=0.1 2024-06-22 01:33:35,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2024-06-22 01:33:43,280 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.400e+02 2.503e+02 2.699e+02 4.162e+02, threshold=5.007e+02, percent-clipped=0.0 2024-06-22 01:33:48,101 INFO [train.py:1028] (0/2) Epoch 27, batch 9150, loss[loss=0.2123, simple_loss=0.2797, pruned_loss=0.0724, over 13204.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2701, pruned_loss=0.07173, over 2569186.17 frames. ], batch size: 77, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:33:48,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=499026.0, ans=0.0 2024-06-22 01:34:12,747 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.42 vs. limit=10.0 2024-06-22 01:34:17,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=499099.3333333333, ans=0.025 2024-06-22 01:34:23,459 INFO [train.py:1028] (0/2) Epoch 27, batch 9200, loss[loss=0.1887, simple_loss=0.2571, pruned_loss=0.0602, over 13020.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2698, pruned_loss=0.07138, over 2572541.47 frames. ], batch size: 36, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:34:32,024 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.15 vs. limit=15.0 2024-06-22 01:34:32,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=499136.0, ans=0.04949747468305833 2024-06-22 01:34:43,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=499172.6666666667, ans=10.0 2024-06-22 01:34:44,049 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=499172.6666666667, ans=0.125 2024-06-22 01:34:44,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=499172.6666666667, ans=0.04949747468305833 2024-06-22 01:34:48,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=499191.0, ans=0.0 2024-06-22 01:34:49,618 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.443e+02 2.548e+02 2.707e+02 3.723e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-22 01:34:54,567 INFO [train.py:1028] (0/2) Epoch 27, batch 9250, loss[loss=0.2192, simple_loss=0.2843, pruned_loss=0.07711, over 13217.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.27, pruned_loss=0.07123, over 2574427.64 frames. ], batch size: 67, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:34:59,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=499209.3333333333, ans=0.125 2024-06-22 01:35:03,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499227.6666666667, ans=0.1 2024-06-22 01:35:16,426 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=499264.3333333333, ans=0.125 2024-06-22 01:35:17,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=499264.3333333333, ans=0.0 2024-06-22 01:35:26,850 INFO [train.py:1028] (0/2) Epoch 27, batch 9300, loss[loss=0.195, simple_loss=0.252, pruned_loss=0.06901, over 13257.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.27, pruned_loss=0.0712, over 2571619.39 frames. ], batch size: 40, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:35:34,042 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2024-06-22 01:35:36,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=499319.3333333333, ans=0.025 2024-06-22 01:35:41,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.74 vs. limit=10.0 2024-06-22 01:35:55,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=499374.3333333333, ans=0.125 2024-06-22 01:35:55,751 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 2.423e+02 2.579e+02 2.808e+02 3.719e+02, threshold=5.158e+02, percent-clipped=0.0 2024-06-22 01:35:58,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=499374.3333333333, ans=0.2 2024-06-22 01:36:00,799 INFO [train.py:1028] (0/2) Epoch 27, batch 9350, loss[loss=0.2133, simple_loss=0.2787, pruned_loss=0.07395, over 12562.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.27, pruned_loss=0.07126, over 2568860.36 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:36:01,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2024-06-22 01:36:02,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=499392.6666666667, ans=0.2 2024-06-22 01:36:03,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=499392.6666666667, ans=0.125 2024-06-22 01:36:10,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=499411.0, ans=0.125 2024-06-22 01:36:21,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=499447.6666666667, ans=0.0 2024-06-22 01:36:23,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499447.6666666667, ans=0.1 2024-06-22 01:36:30,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499466.0, ans=0.1 2024-06-22 01:36:31,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=499484.3333333333, ans=0.125 2024-06-22 01:36:32,037 INFO [train.py:1028] (0/2) Epoch 27, batch 9400, loss[loss=0.2056, simple_loss=0.2711, pruned_loss=0.07007, over 13184.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2703, pruned_loss=0.07159, over 2568457.32 frames. ], batch size: 52, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:36:36,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=499484.3333333333, ans=0.125 2024-06-22 01:36:38,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=499502.6666666667, ans=0.0 2024-06-22 01:36:38,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.50 vs. limit=22.5 2024-06-22 01:36:45,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=499521.0, ans=0.0 2024-06-22 01:36:55,931 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.79 vs. limit=22.5 2024-06-22 01:36:56,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.77 vs. limit=15.0 2024-06-22 01:36:57,402 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.388e+02 2.549e+02 2.800e+02 3.883e+02, threshold=5.097e+02, percent-clipped=0.0 2024-06-22 01:37:02,414 INFO [train.py:1028] (0/2) Epoch 27, batch 9450, loss[loss=0.2104, simple_loss=0.27, pruned_loss=0.07536, over 12490.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2714, pruned_loss=0.07225, over 2567736.62 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:37:14,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=499612.6666666667, ans=0.5 2024-06-22 01:37:17,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=499612.6666666667, ans=0.0 2024-06-22 01:37:28,738 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:37:33,797 INFO [train.py:1028] (0/2) Epoch 27, batch 9500, loss[loss=0.1979, simple_loss=0.2621, pruned_loss=0.0668, over 13204.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2705, pruned_loss=0.07144, over 2577091.76 frames. ], batch size: 43, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:37:37,693 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2024-06-22 01:37:41,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=499667.6666666667, ans=0.2 2024-06-22 01:37:42,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=499686.0, ans=0.0 2024-06-22 01:37:47,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=499686.0, ans=0.2 2024-06-22 01:37:56,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=499722.6666666667, ans=15.0 2024-06-22 01:37:57,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499722.6666666667, ans=0.1 2024-06-22 01:38:00,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=499722.6666666667, ans=0.1 2024-06-22 01:38:03,256 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.417e+02 2.583e+02 2.820e+02 3.896e+02, threshold=5.166e+02, percent-clipped=0.0 2024-06-22 01:38:03,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=499741.0, ans=0.0 2024-06-22 01:38:08,406 INFO [train.py:1028] (0/2) Epoch 27, batch 9550, loss[loss=0.1956, simple_loss=0.2575, pruned_loss=0.06681, over 13185.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2699, pruned_loss=0.07139, over 2571691.02 frames. ], batch size: 40, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:38:10,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=499759.3333333333, ans=0.025 2024-06-22 01:38:10,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=499759.3333333333, ans=0.0 2024-06-22 01:38:12,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=499759.3333333333, ans=0.125 2024-06-22 01:38:23,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=499796.0, ans=0.125 2024-06-22 01:38:28,496 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499814.3333333333, ans=0.1 2024-06-22 01:38:29,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.12 vs. limit=22.5 2024-06-22 01:38:32,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=499814.3333333333, ans=0.0 2024-06-22 01:38:33,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=499814.3333333333, ans=0.0 2024-06-22 01:38:35,577 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2024-06-22 01:38:41,230 INFO [train.py:1028] (0/2) Epoch 27, batch 9600, loss[loss=0.2338, simple_loss=0.2829, pruned_loss=0.09234, over 10558.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2698, pruned_loss=0.07137, over 2569773.22 frames. ], batch size: 303, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:38:42,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=499851.0, ans=0.125 2024-06-22 01:38:52,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=499887.6666666667, ans=0.125 2024-06-22 01:38:54,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=499887.6666666667, ans=0.125 2024-06-22 01:38:59,585 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2024-06-22 01:39:02,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=499906.0, ans=0.125 2024-06-22 01:39:03,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=499906.0, ans=0.025 2024-06-22 01:39:04,331 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:39:04,561 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=28.24 vs. limit=22.5 2024-06-22 01:39:06,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=499924.3333333333, ans=0.125 2024-06-22 01:39:07,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.414e+02 2.589e+02 2.883e+02 3.902e+02, threshold=5.179e+02, percent-clipped=0.0 2024-06-22 01:39:12,522 INFO [train.py:1028] (0/2) Epoch 27, batch 9650, loss[loss=0.2114, simple_loss=0.2659, pruned_loss=0.07849, over 13071.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2699, pruned_loss=0.07185, over 2561408.41 frames. ], batch size: 132, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:39:17,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=499942.6666666667, ans=10.0 2024-06-22 01:39:21,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2024-06-22 01:39:34,358 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2024-06-22 01:39:37,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=500016.0, ans=0.2 2024-06-22 01:39:40,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=500016.0, ans=0.125 2024-06-22 01:39:42,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500016.0, ans=0.1 2024-06-22 01:39:43,641 INFO [train.py:1028] (0/2) Epoch 27, batch 9700, loss[loss=0.2197, simple_loss=0.2817, pruned_loss=0.07881, over 13042.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2698, pruned_loss=0.07177, over 2557787.24 frames. ], batch size: 144, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:39:50,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=500052.6666666667, ans=0.125 2024-06-22 01:40:01,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=500071.0, ans=10.0 2024-06-22 01:40:01,662 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.04 vs. limit=6.0 2024-06-22 01:40:02,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=500071.0, ans=0.2 2024-06-22 01:40:02,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2024-06-22 01:40:04,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.86 vs. limit=15.0 2024-06-22 01:40:11,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=500107.6666666667, ans=0.125 2024-06-22 01:40:11,503 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.436e+02 2.633e+02 2.894e+02 3.808e+02, threshold=5.266e+02, percent-clipped=0.0 2024-06-22 01:40:12,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500107.6666666667, ans=0.125 2024-06-22 01:40:14,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=500107.6666666667, ans=0.125 2024-06-22 01:40:16,561 INFO [train.py:1028] (0/2) Epoch 27, batch 9750, loss[loss=0.1991, simple_loss=0.2596, pruned_loss=0.06931, over 13071.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2686, pruned_loss=0.07111, over 2554281.51 frames. ], batch size: 132, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:40:26,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=500144.3333333333, ans=0.09899494936611666 2024-06-22 01:40:29,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.48 vs. limit=15.0 2024-06-22 01:40:39,251 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:40:41,899 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=15.0 2024-06-22 01:40:42,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=500199.3333333333, ans=0.125 2024-06-22 01:40:48,944 INFO [train.py:1028] (0/2) Epoch 27, batch 9800, loss[loss=0.193, simple_loss=0.2597, pruned_loss=0.06319, over 12948.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2676, pruned_loss=0.07033, over 2547075.62 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:41:02,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500254.3333333333, ans=0.1 2024-06-22 01:41:06,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=500254.3333333333, ans=0.125 2024-06-22 01:41:07,122 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2024-06-22 01:41:07,236 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.72 vs. limit=6.0 2024-06-22 01:41:14,728 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.409e+02 2.575e+02 2.745e+02 3.684e+02, threshold=5.150e+02, percent-clipped=0.0 2024-06-22 01:41:19,630 INFO [train.py:1028] (0/2) Epoch 27, batch 9850, loss[loss=0.1905, simple_loss=0.251, pruned_loss=0.06503, over 13042.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2676, pruned_loss=0.07016, over 2538540.22 frames. ], batch size: 102, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:41:19,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=500309.3333333333, ans=0.2 2024-06-22 01:41:32,092 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=500346.0, ans=0.0 2024-06-22 01:41:32,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=500346.0, ans=0.125 2024-06-22 01:41:35,303 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.20 vs. limit=10.0 2024-06-22 01:41:41,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=500364.3333333333, ans=0.09899494936611666 2024-06-22 01:41:44,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=500364.3333333333, ans=0.125 2024-06-22 01:41:47,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=500382.6666666667, ans=0.0 2024-06-22 01:41:48,413 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2024-06-22 01:41:50,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=500382.6666666667, ans=0.125 2024-06-22 01:41:52,856 INFO [train.py:1028] (0/2) Epoch 27, batch 9900, loss[loss=0.1727, simple_loss=0.2387, pruned_loss=0.05337, over 12927.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.267, pruned_loss=0.07027, over 2531348.27 frames. ], batch size: 39, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:41:57,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500401.0, ans=0.125 2024-06-22 01:42:03,793 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=12.0 2024-06-22 01:42:11,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500456.0, ans=0.125 2024-06-22 01:42:11,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=500456.0, ans=0.125 2024-06-22 01:42:18,932 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.432e+02 2.597e+02 2.830e+02 3.579e+02, threshold=5.194e+02, percent-clipped=0.0 2024-06-22 01:42:23,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=500492.6666666667, ans=0.125 2024-06-22 01:42:23,866 INFO [train.py:1028] (0/2) Epoch 27, batch 9950, loss[loss=0.229, simple_loss=0.2897, pruned_loss=0.08412, over 12787.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2663, pruned_loss=0.07062, over 2526806.07 frames. ], batch size: 29, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:42:24,274 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.96 vs. limit=10.0 2024-06-22 01:42:32,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=500511.0, ans=0.05 2024-06-22 01:42:43,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=500547.6666666667, ans=0.125 2024-06-22 01:42:45,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=500547.6666666667, ans=0.0 2024-06-22 01:42:55,481 INFO [train.py:1028] (0/2) Epoch 27, batch 10000, loss[loss=0.2204, simple_loss=0.2973, pruned_loss=0.07179, over 12625.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2668, pruned_loss=0.07124, over 2487738.60 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:43:02,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=500602.6666666667, ans=0.2 2024-06-22 01:43:02,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=500602.6666666667, ans=0.04949747468305833 2024-06-22 01:43:04,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=500602.6666666667, ans=15.0 2024-06-22 01:43:07,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=500602.6666666667, ans=0.2 2024-06-22 01:43:11,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=500621.0, ans=0.0 2024-06-22 01:43:21,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=500639.3333333333, ans=10.0 2024-06-22 01:43:23,688 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.457e+02 2.572e+02 2.687e+02 3.633e+02, threshold=5.144e+02, percent-clipped=0.0 2024-06-22 01:43:25,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=500657.6666666667, ans=0.125 2024-06-22 01:43:25,212 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2024-06-22 01:43:27,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=500676.0, ans=0.2 2024-06-22 01:43:28,365 INFO [train.py:1028] (0/2) Epoch 27, batch 10050, loss[loss=0.2108, simple_loss=0.2708, pruned_loss=0.07538, over 12573.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2668, pruned_loss=0.07176, over 2445014.34 frames. ], batch size: 22, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:43:29,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=500676.0, ans=0.035 2024-06-22 01:43:41,495 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:43:44,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=500712.6666666667, ans=0.0 2024-06-22 01:43:45,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.52 vs. limit=6.0 2024-06-22 01:43:50,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=500731.0, ans=0.2 2024-06-22 01:43:56,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=500749.3333333333, ans=0.125 2024-06-22 01:43:57,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.53 vs. limit=15.0 2024-06-22 01:43:58,459 INFO [train.py:1028] (0/2) Epoch 27, batch 10100, loss[loss=0.1814, simple_loss=0.2413, pruned_loss=0.06072, over 10805.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2662, pruned_loss=0.07136, over 2423686.46 frames. ], batch size: 16, lr: 2.12e-03, grad_scale: 64.0 2024-06-22 01:44:02,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=500767.6666666667, ans=0.0 2024-06-22 01:44:02,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=500767.6666666667, ans=0.0 2024-06-22 01:44:06,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=500786.0, ans=0.125 2024-06-22 01:44:11,687 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-27.pt 2024-06-22 01:46:10,913 INFO [train.py:1028] (0/2) Epoch 28, batch 0, loss[loss=0.1845, simple_loss=0.2455, pruned_loss=0.06173, over 12975.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2455, pruned_loss=0.06173, over 12975.00 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:46:10,914 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 01:46:17,952 INFO [train.py:1060] (0/2) Epoch 28, validation: loss=0.1929, simple_loss=0.2534, pruned_loss=0.06623, over 351949.00 frames. 2024-06-22 01:46:17,953 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 01:46:29,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=500817.1666666667, ans=0.125 2024-06-22 01:46:34,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500835.5, ans=0.1 2024-06-22 01:46:35,114 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.334e+02 2.545e+02 2.791e+02 4.268e+02, threshold=5.090e+02, percent-clipped=0.0 2024-06-22 01:46:40,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-22 01:46:45,493 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=500872.1666666667, ans=0.2 2024-06-22 01:46:51,509 INFO [train.py:1028] (0/2) Epoch 28, batch 50, loss[loss=0.194, simple_loss=0.2584, pruned_loss=0.0648, over 12589.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2494, pruned_loss=0.06427, over 574597.59 frames. ], batch size: 29, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:46:54,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=500890.5, ans=0.125 2024-06-22 01:47:09,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=15.0 2024-06-22 01:47:12,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500945.5, ans=0.1 2024-06-22 01:47:20,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=500963.8333333333, ans=0.2 2024-06-22 01:47:26,116 INFO [train.py:1028] (0/2) Epoch 28, batch 100, loss[loss=0.1812, simple_loss=0.2443, pruned_loss=0.0591, over 13319.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2472, pruned_loss=0.0639, over 1017525.27 frames. ], batch size: 46, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:47:28,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=500982.1666666667, ans=0.0 2024-06-22 01:47:32,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=501000.5, ans=0.0 2024-06-22 01:47:37,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=501018.8333333333, ans=0.125 2024-06-22 01:47:39,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2024-06-22 01:47:41,456 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.236e+02 2.322e+02 2.548e+02 3.879e+02, threshold=4.645e+02, percent-clipped=0.0 2024-06-22 01:47:49,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=501037.1666666667, ans=10.0 2024-06-22 01:47:49,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=501037.1666666667, ans=0.125 2024-06-22 01:47:50,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501037.1666666667, ans=0.0 2024-06-22 01:47:58,866 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.48 vs. limit=10.0 2024-06-22 01:48:00,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=501073.8333333333, ans=0.125 2024-06-22 01:48:00,478 INFO [train.py:1028] (0/2) Epoch 28, batch 150, loss[loss=0.1949, simple_loss=0.2573, pruned_loss=0.06628, over 12583.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2466, pruned_loss=0.06285, over 1365603.14 frames. ], batch size: 29, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:48:04,500 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:48:10,519 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2024-06-22 01:48:17,941 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501110.5, ans=0.1 2024-06-22 01:48:21,288 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.70 vs. limit=15.0 2024-06-22 01:48:27,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501147.1666666667, ans=0.1 2024-06-22 01:48:32,549 INFO [train.py:1028] (0/2) Epoch 28, batch 200, loss[loss=0.1983, simple_loss=0.2572, pruned_loss=0.06968, over 12526.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2469, pruned_loss=0.06313, over 1635756.84 frames. ], batch size: 202, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:48:33,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=501165.5, ans=0.125 2024-06-22 01:48:48,625 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.301e+02 2.414e+02 2.577e+02 3.322e+02, threshold=4.828e+02, percent-clipped=0.0 2024-06-22 01:48:54,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=501220.5, ans=0.125 2024-06-22 01:49:04,860 INFO [train.py:1028] (0/2) Epoch 28, batch 250, loss[loss=0.1825, simple_loss=0.2271, pruned_loss=0.06894, over 13046.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2473, pruned_loss=0.06345, over 1846898.65 frames. ], batch size: 144, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:49:06,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=501257.1666666667, ans=0.025 2024-06-22 01:49:11,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501275.5, ans=0.125 2024-06-22 01:49:13,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=501275.5, ans=0.0 2024-06-22 01:49:14,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.81 vs. limit=15.0 2024-06-22 01:49:16,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=501275.5, ans=0.1 2024-06-22 01:49:22,443 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:49:27,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=501312.1666666667, ans=0.125 2024-06-22 01:49:32,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=501312.1666666667, ans=0.125 2024-06-22 01:49:34,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501330.5, ans=0.1 2024-06-22 01:49:39,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=501348.8333333333, ans=0.0 2024-06-22 01:49:39,741 INFO [train.py:1028] (0/2) Epoch 28, batch 300, loss[loss=0.1918, simple_loss=0.2499, pruned_loss=0.06685, over 13226.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.248, pruned_loss=0.06372, over 2010183.59 frames. ], batch size: 112, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:49:45,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501348.8333333333, ans=0.125 2024-06-22 01:49:46,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=501348.8333333333, ans=0.0 2024-06-22 01:49:58,958 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.370e+02 2.513e+02 2.810e+02 3.407e+02, threshold=5.026e+02, percent-clipped=0.0 2024-06-22 01:50:01,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=501403.8333333333, ans=0.025 2024-06-22 01:50:11,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=501422.1666666667, ans=0.2 2024-06-22 01:50:14,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=501440.5, ans=0.2 2024-06-22 01:50:15,040 INFO [train.py:1028] (0/2) Epoch 28, batch 350, loss[loss=0.1803, simple_loss=0.2462, pruned_loss=0.05717, over 12991.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2481, pruned_loss=0.06376, over 2139055.16 frames. ], batch size: 33, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:50:20,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=501458.8333333333, ans=0.025 2024-06-22 01:50:23,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=501458.8333333333, ans=0.0 2024-06-22 01:50:30,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2024-06-22 01:50:31,235 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=501477.1666666667, ans=0.025 2024-06-22 01:50:37,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=501495.5, ans=0.125 2024-06-22 01:50:41,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=501513.8333333333, ans=0.95 2024-06-22 01:50:45,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=501513.8333333333, ans=0.125 2024-06-22 01:50:47,071 INFO [train.py:1028] (0/2) Epoch 28, batch 400, loss[loss=0.1729, simple_loss=0.2391, pruned_loss=0.05331, over 13294.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2481, pruned_loss=0.06356, over 2240540.23 frames. ], batch size: 63, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:50:53,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=501550.5, ans=0.125 2024-06-22 01:50:54,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=501550.5, ans=0.125 2024-06-22 01:50:58,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=501550.5, ans=0.125 2024-06-22 01:51:02,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=501568.8333333333, ans=0.125 2024-06-22 01:51:03,076 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.283e+02 2.420e+02 2.702e+02 3.554e+02, threshold=4.840e+02, percent-clipped=0.0 2024-06-22 01:51:13,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501605.5, ans=0.1 2024-06-22 01:51:18,654 INFO [train.py:1028] (0/2) Epoch 28, batch 450, loss[loss=0.1826, simple_loss=0.2611, pruned_loss=0.05202, over 13248.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2479, pruned_loss=0.06344, over 2314650.62 frames. ], batch size: 67, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:51:21,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=501623.8333333333, ans=0.125 2024-06-22 01:51:22,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=501623.8333333333, ans=0.2 2024-06-22 01:51:24,048 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.87 vs. limit=22.5 2024-06-22 01:51:31,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501642.1666666667, ans=0.125 2024-06-22 01:51:37,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=12.0 2024-06-22 01:51:39,020 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2024-06-22 01:51:43,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=501678.8333333333, ans=0.125 2024-06-22 01:51:51,688 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.96 vs. limit=22.5 2024-06-22 01:51:56,879 INFO [train.py:1028] (0/2) Epoch 28, batch 500, loss[loss=0.1849, simple_loss=0.2469, pruned_loss=0.06147, over 13081.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2489, pruned_loss=0.06356, over 2377887.91 frames. ], batch size: 121, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:51:58,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=501715.5, ans=0.125 2024-06-22 01:52:00,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=501715.5, ans=0.09899494936611666 2024-06-22 01:52:02,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.44 vs. limit=15.0 2024-06-22 01:52:06,238 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=15.0 2024-06-22 01:52:10,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2024-06-22 01:52:12,274 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.337e+02 2.480e+02 2.653e+02 4.092e+02, threshold=4.960e+02, percent-clipped=0.0 2024-06-22 01:52:20,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=501770.5, ans=0.0 2024-06-22 01:52:28,110 INFO [train.py:1028] (0/2) Epoch 28, batch 550, loss[loss=0.2043, simple_loss=0.2526, pruned_loss=0.07804, over 12929.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2484, pruned_loss=0.06335, over 2422522.16 frames. ], batch size: 158, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:52:34,116 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501825.5, ans=0.125 2024-06-22 01:52:45,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=501843.8333333333, ans=0.0 2024-06-22 01:52:50,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=501862.1666666667, ans=0.125 2024-06-22 01:52:55,165 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=501880.5, ans=0.035 2024-06-22 01:52:55,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=501880.5, ans=0.0 2024-06-22 01:52:57,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=501880.5, ans=0.125 2024-06-22 01:52:59,357 INFO [train.py:1028] (0/2) Epoch 28, batch 600, loss[loss=0.1798, simple_loss=0.235, pruned_loss=0.06227, over 13015.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2482, pruned_loss=0.06342, over 2460933.66 frames. ], batch size: 144, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:53:14,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=501935.5, ans=0.2 2024-06-22 01:53:15,041 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.289e+02 2.442e+02 2.671e+02 3.387e+02, threshold=4.883e+02, percent-clipped=0.0 2024-06-22 01:53:21,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=501953.8333333333, ans=15.0 2024-06-22 01:53:25,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=501972.1666666667, ans=0.95 2024-06-22 01:53:26,626 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.88 vs. limit=10.0 2024-06-22 01:53:34,306 INFO [train.py:1028] (0/2) Epoch 28, batch 650, loss[loss=0.187, simple_loss=0.2544, pruned_loss=0.05978, over 13200.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2476, pruned_loss=0.06271, over 2492239.32 frames. ], batch size: 59, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:53:45,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502008.8333333333, ans=0.125 2024-06-22 01:54:03,587 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=502063.8333333333, ans=0.125 2024-06-22 01:54:09,613 INFO [train.py:1028] (0/2) Epoch 28, batch 700, loss[loss=0.2064, simple_loss=0.2695, pruned_loss=0.07169, over 13300.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2479, pruned_loss=0.06319, over 2515148.20 frames. ], batch size: 46, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:54:14,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502082.1666666667, ans=0.1 2024-06-22 01:54:17,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502100.5, ans=0.125 2024-06-22 01:54:22,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=502118.8333333333, ans=0.0 2024-06-22 01:54:24,864 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.304e+02 2.413e+02 2.613e+02 3.461e+02, threshold=4.827e+02, percent-clipped=0.0 2024-06-22 01:54:26,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=502118.8333333333, ans=0.125 2024-06-22 01:54:29,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.62 vs. limit=15.0 2024-06-22 01:54:32,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.17 vs. limit=10.0 2024-06-22 01:54:40,865 INFO [train.py:1028] (0/2) Epoch 28, batch 750, loss[loss=0.1796, simple_loss=0.2441, pruned_loss=0.05756, over 13241.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2478, pruned_loss=0.06289, over 2529045.39 frames. ], batch size: 63, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:54:47,323 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=502192.1666666667, ans=0.2 2024-06-22 01:54:54,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502210.5, ans=0.1 2024-06-22 01:55:10,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=502247.1666666667, ans=0.0 2024-06-22 01:55:11,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=502247.1666666667, ans=0.0 2024-06-22 01:55:12,581 INFO [train.py:1028] (0/2) Epoch 28, batch 800, loss[loss=0.1883, simple_loss=0.2501, pruned_loss=0.06326, over 12970.00 frames. ], tot_loss[loss=0.187, simple_loss=0.2481, pruned_loss=0.06296, over 2541395.89 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:55:12,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=502265.5, ans=0.5 2024-06-22 01:55:19,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=502283.8333333333, ans=0.1 2024-06-22 01:55:28,490 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.289e+02 2.382e+02 2.523e+02 3.078e+02, threshold=4.765e+02, percent-clipped=0.0 2024-06-22 01:55:38,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2024-06-22 01:55:47,882 INFO [train.py:1028] (0/2) Epoch 28, batch 850, loss[loss=0.1778, simple_loss=0.2308, pruned_loss=0.06242, over 13158.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2474, pruned_loss=0.06271, over 2550580.96 frames. ], batch size: 95, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:56:02,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=502393.8333333333, ans=0.125 2024-06-22 01:56:02,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=502393.8333333333, ans=0.125 2024-06-22 01:56:10,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2024-06-22 01:56:22,636 INFO [train.py:1028] (0/2) Epoch 28, batch 900, loss[loss=0.172, simple_loss=0.2353, pruned_loss=0.05429, over 12985.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2473, pruned_loss=0.06294, over 2555319.37 frames. ], batch size: 36, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:56:24,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=502448.8333333333, ans=0.0 2024-06-22 01:56:38,444 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.258e+02 2.367e+02 2.508e+02 3.108e+02, threshold=4.733e+02, percent-clipped=0.0 2024-06-22 01:56:41,331 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:56:44,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502503.8333333333, ans=0.1 2024-06-22 01:56:45,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=502503.8333333333, ans=0.125 2024-06-22 01:56:45,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=502503.8333333333, ans=0.09899494936611666 2024-06-22 01:56:52,621 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:56:52,898 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.55 vs. limit=15.0 2024-06-22 01:56:54,523 INFO [train.py:1028] (0/2) Epoch 28, batch 950, loss[loss=0.1833, simple_loss=0.2492, pruned_loss=0.05871, over 13004.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.247, pruned_loss=0.06274, over 2559378.17 frames. ], batch size: 39, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:56:56,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=502540.5, ans=0.0 2024-06-22 01:56:59,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=502540.5, ans=0.125 2024-06-22 01:57:01,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=502558.8333333333, ans=0.125 2024-06-22 01:57:05,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=502558.8333333333, ans=0.125 2024-06-22 01:57:08,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2024-06-22 01:57:09,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2024-06-22 01:57:14,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=502595.5, ans=0.0 2024-06-22 01:57:23,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=502613.8333333333, ans=10.0 2024-06-22 01:57:25,644 INFO [train.py:1028] (0/2) Epoch 28, batch 1000, loss[loss=0.1871, simple_loss=0.2582, pruned_loss=0.05804, over 13309.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2467, pruned_loss=0.06285, over 2563065.70 frames. ], batch size: 49, lr: 2.08e-03, grad_scale: 64.0 2024-06-22 01:57:26,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502632.1666666667, ans=0.125 2024-06-22 01:57:27,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=502632.1666666667, ans=0.0 2024-06-22 01:57:44,546 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 01:57:44,997 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.323e+02 2.464e+02 2.669e+02 3.275e+02, threshold=4.928e+02, percent-clipped=0.0 2024-06-22 01:57:48,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=502687.1666666667, ans=0.125 2024-06-22 01:58:03,749 INFO [train.py:1028] (0/2) Epoch 28, batch 1050, loss[loss=0.1872, simple_loss=0.2551, pruned_loss=0.05968, over 13188.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2476, pruned_loss=0.06281, over 2566094.68 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:58:03,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=502723.8333333333, ans=0.125 2024-06-22 01:58:05,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2024-06-22 01:58:07,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=502723.8333333333, ans=0.2 2024-06-22 01:58:07,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.10 vs. limit=10.0 2024-06-22 01:58:20,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=502760.5, ans=0.2 2024-06-22 01:58:20,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=502760.5, ans=0.125 2024-06-22 01:58:31,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=502797.1666666667, ans=0.0 2024-06-22 01:58:36,256 INFO [train.py:1028] (0/2) Epoch 28, batch 1100, loss[loss=0.1923, simple_loss=0.2487, pruned_loss=0.06794, over 13318.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2482, pruned_loss=0.0631, over 2571079.58 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:58:37,761 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2024-06-22 01:58:46,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-22 01:58:49,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=502852.1666666667, ans=10.0 2024-06-22 01:58:51,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=502852.1666666667, ans=0.125 2024-06-22 01:58:52,795 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.297e+02 2.410e+02 2.550e+02 3.088e+02, threshold=4.820e+02, percent-clipped=0.0 2024-06-22 01:58:58,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.58 vs. limit=10.0 2024-06-22 01:59:08,389 INFO [train.py:1028] (0/2) Epoch 28, batch 1150, loss[loss=0.1775, simple_loss=0.2382, pruned_loss=0.05838, over 13303.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2481, pruned_loss=0.06336, over 2572506.64 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:59:09,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=502907.1666666667, ans=0.125 2024-06-22 01:59:21,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=502943.8333333333, ans=0.125 2024-06-22 01:59:24,206 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.26 vs. limit=22.5 2024-06-22 01:59:40,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502980.5, ans=0.1 2024-06-22 01:59:42,462 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=502998.8333333333, ans=0.0 2024-06-22 01:59:42,957 INFO [train.py:1028] (0/2) Epoch 28, batch 1200, loss[loss=0.1655, simple_loss=0.2327, pruned_loss=0.0492, over 13120.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2484, pruned_loss=0.06361, over 2574098.40 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 01:59:43,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=502998.8333333333, ans=0.0 2024-06-22 01:59:43,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=502998.8333333333, ans=0.5 2024-06-22 01:59:50,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502998.8333333333, ans=0.1 2024-06-22 02:00:03,227 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.299e+02 2.448e+02 2.591e+02 3.320e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 02:00:03,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=503035.5, ans=0.1 2024-06-22 02:00:14,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=503072.1666666667, ans=0.125 2024-06-22 02:00:17,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=503072.1666666667, ans=0.2 2024-06-22 02:00:19,065 INFO [train.py:1028] (0/2) Epoch 28, batch 1250, loss[loss=0.1728, simple_loss=0.2314, pruned_loss=0.05707, over 13123.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2478, pruned_loss=0.06349, over 2583921.26 frames. ], batch size: 112, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:00:21,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.00 vs. limit=15.0 2024-06-22 02:00:47,887 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2024-06-22 02:00:51,923 INFO [train.py:1028] (0/2) Epoch 28, batch 1300, loss[loss=0.2042, simple_loss=0.259, pruned_loss=0.07465, over 12729.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2482, pruned_loss=0.06377, over 2584386.61 frames. ], batch size: 176, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:00:53,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=503182.1666666667, ans=0.05 2024-06-22 02:00:54,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=503182.1666666667, ans=0.2 2024-06-22 02:01:00,421 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=15.0 2024-06-22 02:01:02,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=503200.5, ans=0.125 2024-06-22 02:01:02,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=503200.5, ans=0.125 2024-06-22 02:01:08,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=503218.8333333333, ans=0.2 2024-06-22 02:01:09,367 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.263e+02 2.471e+02 2.669e+02 3.773e+02, threshold=4.941e+02, percent-clipped=0.0 2024-06-22 02:01:12,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=503237.1666666667, ans=0.025 2024-06-22 02:01:17,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=503237.1666666667, ans=0.125 2024-06-22 02:01:23,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=503255.5, ans=0.0 2024-06-22 02:01:23,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=503255.5, ans=0.125 2024-06-22 02:01:25,645 INFO [train.py:1028] (0/2) Epoch 28, batch 1350, loss[loss=0.1887, simple_loss=0.2499, pruned_loss=0.06375, over 13245.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2482, pruned_loss=0.06349, over 2586487.89 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:01:25,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=503273.8333333333, ans=0.2 2024-06-22 02:01:36,266 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:01:36,834 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=503292.1666666667, ans=0.125 2024-06-22 02:01:39,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=503310.5, ans=0.1 2024-06-22 02:02:03,182 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2024-06-22 02:02:03,515 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=503347.1666666667, ans=0.0 2024-06-22 02:02:04,740 INFO [train.py:1028] (0/2) Epoch 28, batch 1400, loss[loss=0.1945, simple_loss=0.2613, pruned_loss=0.06379, over 12796.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2484, pruned_loss=0.06368, over 2588330.22 frames. ], batch size: 26, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:02:06,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=503365.5, ans=0.0 2024-06-22 02:02:10,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=503383.8333333333, ans=0.0 2024-06-22 02:02:19,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=503402.1666666667, ans=0.1 2024-06-22 02:02:21,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=503402.1666666667, ans=0.125 2024-06-22 02:02:21,471 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.278e+02 2.378e+02 2.560e+02 3.214e+02, threshold=4.757e+02, percent-clipped=0.0 2024-06-22 02:02:28,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=503420.5, ans=0.09899494936611666 2024-06-22 02:02:37,789 INFO [train.py:1028] (0/2) Epoch 28, batch 1450, loss[loss=0.1889, simple_loss=0.2461, pruned_loss=0.06592, over 13089.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2478, pruned_loss=0.06355, over 2587858.92 frames. ], batch size: 121, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:02:48,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.56 vs. limit=15.0 2024-06-22 02:02:53,707 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=503493.8333333333, ans=0.125 2024-06-22 02:03:10,275 INFO [train.py:1028] (0/2) Epoch 28, batch 1500, loss[loss=0.1867, simple_loss=0.2446, pruned_loss=0.0644, over 13214.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2475, pruned_loss=0.06372, over 2589286.81 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:03:13,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=503548.8333333333, ans=0.025 2024-06-22 02:03:14,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=503548.8333333333, ans=0.125 2024-06-22 02:03:15,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=503548.8333333333, ans=0.125 2024-06-22 02:03:27,353 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.300e+02 2.420e+02 2.576e+02 3.434e+02, threshold=4.840e+02, percent-clipped=0.0 2024-06-22 02:03:30,028 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=503603.8333333333, ans=0.125 2024-06-22 02:03:47,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=15.0 2024-06-22 02:03:49,374 INFO [train.py:1028] (0/2) Epoch 28, batch 1550, loss[loss=0.2019, simple_loss=0.2539, pruned_loss=0.07493, over 12998.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2476, pruned_loss=0.06397, over 2583723.82 frames. ], batch size: 102, lr: 2.07e-03, grad_scale: 64.0 2024-06-22 02:04:04,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503677.1666666667, ans=0.1 2024-06-22 02:04:11,256 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2024-06-22 02:04:14,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.55 vs. limit=22.5 2024-06-22 02:04:22,260 INFO [train.py:1028] (0/2) Epoch 28, batch 1600, loss[loss=0.1826, simple_loss=0.246, pruned_loss=0.05959, over 13173.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2475, pruned_loss=0.0636, over 2580266.62 frames. ], batch size: 77, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:04:24,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=503732.1666666667, ans=0.0 2024-06-22 02:04:26,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=503732.1666666667, ans=0.125 2024-06-22 02:04:26,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=503732.1666666667, ans=0.04949747468305833 2024-06-22 02:04:37,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=503768.8333333333, ans=0.125 2024-06-22 02:04:38,058 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=503768.8333333333, ans=0.125 2024-06-22 02:04:39,230 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.294e+02 2.484e+02 2.746e+02 3.743e+02, threshold=4.968e+02, percent-clipped=0.0 2024-06-22 02:04:51,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=503805.5, ans=0.0 2024-06-22 02:04:54,317 INFO [train.py:1028] (0/2) Epoch 28, batch 1650, loss[loss=0.1869, simple_loss=0.2425, pruned_loss=0.06565, over 13186.00 frames. ], tot_loss[loss=0.1875, simple_loss=0.2474, pruned_loss=0.0638, over 2577191.02 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:04:56,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2024-06-22 02:05:00,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=503842.1666666667, ans=0.125 2024-06-22 02:05:02,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=503842.1666666667, ans=0.2 2024-06-22 02:05:11,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=503860.5, ans=0.025 2024-06-22 02:05:17,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503878.8333333333, ans=0.1 2024-06-22 02:05:19,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=503878.8333333333, ans=0.125 2024-06-22 02:05:21,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=503897.1666666667, ans=0.05 2024-06-22 02:05:27,721 INFO [train.py:1028] (0/2) Epoch 28, batch 1700, loss[loss=0.1738, simple_loss=0.2467, pruned_loss=0.05039, over 12376.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2477, pruned_loss=0.06374, over 2581273.08 frames. ], batch size: 25, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:05:29,130 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=503915.5, ans=0.0 2024-06-22 02:05:34,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=503933.8333333333, ans=0.125 2024-06-22 02:05:38,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=503933.8333333333, ans=0.1 2024-06-22 02:05:51,529 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.223e+02 2.351e+02 2.494e+02 4.179e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 02:05:51,737 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:05:54,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=503970.5, ans=0.125 2024-06-22 02:05:56,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2024-06-22 02:06:06,752 INFO [train.py:1028] (0/2) Epoch 28, batch 1750, loss[loss=0.1864, simple_loss=0.2607, pruned_loss=0.05606, over 12500.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2483, pruned_loss=0.06387, over 2581631.60 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:06:17,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=504025.5, ans=0.2 2024-06-22 02:06:23,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504043.8333333333, ans=0.125 2024-06-22 02:06:34,909 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=504080.5, ans=0.0 2024-06-22 02:06:39,324 INFO [train.py:1028] (0/2) Epoch 28, batch 1800, loss[loss=0.1814, simple_loss=0.2423, pruned_loss=0.06021, over 13217.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2486, pruned_loss=0.06409, over 2582851.97 frames. ], batch size: 67, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:06:48,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=504117.1666666667, ans=0.2 2024-06-22 02:06:57,301 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.300e+02 2.459e+02 2.635e+02 3.621e+02, threshold=4.917e+02, percent-clipped=0.0 2024-06-22 02:07:07,727 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504172.1666666667, ans=0.1 2024-06-22 02:07:12,740 INFO [train.py:1028] (0/2) Epoch 28, batch 1850, loss[loss=0.1955, simple_loss=0.2515, pruned_loss=0.06969, over 13187.00 frames. ], tot_loss[loss=0.189, simple_loss=0.2493, pruned_loss=0.06436, over 2584819.62 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:07:17,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=504190.5, ans=0.125 2024-06-22 02:07:22,876 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2024-06-22 02:07:35,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=504245.5, ans=0.0 2024-06-22 02:07:41,226 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:07:41,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=504263.8333333333, ans=0.125 2024-06-22 02:07:42,955 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.73 vs. limit=10.0 2024-06-22 02:07:47,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.07 vs. limit=22.5 2024-06-22 02:07:48,301 INFO [train.py:1028] (0/2) Epoch 28, batch 1900, loss[loss=0.1883, simple_loss=0.2426, pruned_loss=0.06698, over 13185.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.249, pruned_loss=0.06425, over 2587325.31 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:07:48,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=504282.1666666667, ans=0.0 2024-06-22 02:07:51,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=504282.1666666667, ans=0.125 2024-06-22 02:08:03,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=504318.8333333333, ans=0.0 2024-06-22 02:08:07,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=504318.8333333333, ans=0.0 2024-06-22 02:08:09,063 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.310e+02 2.489e+02 2.635e+02 3.817e+02, threshold=4.977e+02, percent-clipped=0.0 2024-06-22 02:08:14,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504337.1666666667, ans=0.125 2024-06-22 02:08:16,716 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=504337.1666666667, ans=0.125 2024-06-22 02:08:17,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=504355.5, ans=0.025 2024-06-22 02:08:18,874 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2024-06-22 02:08:21,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=504355.5, ans=0.05 2024-06-22 02:08:24,253 INFO [train.py:1028] (0/2) Epoch 28, batch 1950, loss[loss=0.1735, simple_loss=0.2369, pruned_loss=0.05505, over 13263.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.248, pruned_loss=0.06414, over 2593142.10 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:08:27,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=504373.8333333333, ans=0.125 2024-06-22 02:08:28,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=504373.8333333333, ans=0.2 2024-06-22 02:08:29,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=504373.8333333333, ans=0.04949747468305833 2024-06-22 02:08:30,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=504392.1666666667, ans=0.05 2024-06-22 02:08:33,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=504392.1666666667, ans=0.0 2024-06-22 02:08:39,235 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.92 vs. limit=15.0 2024-06-22 02:08:40,504 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.63 vs. limit=15.0 2024-06-22 02:08:41,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=504410.5, ans=0.0 2024-06-22 02:08:47,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=504428.8333333333, ans=0.125 2024-06-22 02:08:56,826 INFO [train.py:1028] (0/2) Epoch 28, batch 2000, loss[loss=0.1784, simple_loss=0.2451, pruned_loss=0.05588, over 12583.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2479, pruned_loss=0.06412, over 2588942.85 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:08:56,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504465.5, ans=0.1 2024-06-22 02:09:02,479 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2024-06-22 02:09:02,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504483.8333333333, ans=0.1 2024-06-22 02:09:09,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504502.1666666667, ans=0.1 2024-06-22 02:09:10,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=504502.1666666667, ans=0.125 2024-06-22 02:09:12,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=504502.1666666667, ans=0.2 2024-06-22 02:09:15,330 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.316e+02 2.431e+02 2.600e+02 4.030e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 02:09:18,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504520.5, ans=0.125 2024-06-22 02:09:18,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=504520.5, ans=0.125 2024-06-22 02:09:22,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=504520.5, ans=0.125 2024-06-22 02:09:24,073 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=504538.8333333333, ans=0.125 2024-06-22 02:09:28,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=504538.8333333333, ans=0.025 2024-06-22 02:09:30,253 INFO [train.py:1028] (0/2) Epoch 28, batch 2050, loss[loss=0.1907, simple_loss=0.2576, pruned_loss=0.06188, over 12705.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2486, pruned_loss=0.06449, over 2584211.77 frames. ], batch size: 29, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:09:32,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=504557.1666666667, ans=0.0 2024-06-22 02:09:35,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.27 vs. limit=6.0 2024-06-22 02:09:37,950 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2024-06-22 02:09:41,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=12.0 2024-06-22 02:09:46,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=504593.8333333333, ans=0.95 2024-06-22 02:10:08,346 INFO [train.py:1028] (0/2) Epoch 28, batch 2100, loss[loss=0.1879, simple_loss=0.2525, pruned_loss=0.06166, over 13191.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2486, pruned_loss=0.064, over 2587044.39 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:10:08,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=504648.8333333333, ans=0.125 2024-06-22 02:10:10,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=504648.8333333333, ans=0.2 2024-06-22 02:10:16,863 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=504667.1666666667, ans=0.025 2024-06-22 02:10:26,387 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.319e+02 2.485e+02 2.702e+02 3.516e+02, threshold=4.970e+02, percent-clipped=0.0 2024-06-22 02:10:38,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504722.1666666667, ans=0.1 2024-06-22 02:10:39,245 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.33 vs. limit=6.0 2024-06-22 02:10:40,725 INFO [train.py:1028] (0/2) Epoch 28, batch 2150, loss[loss=0.1704, simple_loss=0.2381, pruned_loss=0.05132, over 13288.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2486, pruned_loss=0.06382, over 2589614.73 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:10:43,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys.whitening_limit, batch_count=504740.5, ans=6.0 2024-06-22 02:10:46,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=504758.8333333333, ans=0.0 2024-06-22 02:10:50,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=504758.8333333333, ans=0.04949747468305833 2024-06-22 02:10:56,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=504777.1666666667, ans=0.2 2024-06-22 02:11:11,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504813.8333333333, ans=0.1 2024-06-22 02:11:12,857 INFO [train.py:1028] (0/2) Epoch 28, batch 2200, loss[loss=0.1951, simple_loss=0.2465, pruned_loss=0.07187, over 13188.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2486, pruned_loss=0.0639, over 2589601.01 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:11:17,160 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:11:24,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=504850.5, ans=0.125 2024-06-22 02:11:25,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504868.8333333333, ans=0.125 2024-06-22 02:11:26,686 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:11:30,825 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.292e+02 2.464e+02 2.683e+02 3.210e+02, threshold=4.929e+02, percent-clipped=0.0 2024-06-22 02:11:30,979 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=504868.8333333333, ans=0.2 2024-06-22 02:11:33,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=504887.1666666667, ans=0.0 2024-06-22 02:11:44,671 INFO [train.py:1028] (0/2) Epoch 28, batch 2250, loss[loss=0.1942, simple_loss=0.2592, pruned_loss=0.06459, over 13272.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2486, pruned_loss=0.06384, over 2587065.96 frames. ], batch size: 63, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:12:00,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=504942.1666666667, ans=0.025 2024-06-22 02:12:02,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=504942.1666666667, ans=0.0 2024-06-22 02:12:16,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=504978.8333333333, ans=0.04949747468305833 2024-06-22 02:12:20,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=504997.1666666667, ans=0.5 2024-06-22 02:12:23,866 INFO [train.py:1028] (0/2) Epoch 28, batch 2300, loss[loss=0.1874, simple_loss=0.2498, pruned_loss=0.06252, over 12948.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2487, pruned_loss=0.06353, over 2581465.53 frames. ], batch size: 33, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:12:24,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=505015.5, ans=0.0 2024-06-22 02:12:26,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=505015.5, ans=0.1 2024-06-22 02:12:36,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=505052.1666666667, ans=0.2 2024-06-22 02:12:41,988 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.301e+02 2.440e+02 2.653e+02 3.442e+02, threshold=4.879e+02, percent-clipped=0.0 2024-06-22 02:12:47,263 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2024-06-22 02:12:47,953 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=505070.5, ans=0.1 2024-06-22 02:12:56,740 INFO [train.py:1028] (0/2) Epoch 28, batch 2350, loss[loss=0.1978, simple_loss=0.2618, pruned_loss=0.06689, over 13272.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2495, pruned_loss=0.06409, over 2585968.38 frames. ], batch size: 67, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:12:58,730 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505107.1666666667, ans=0.125 2024-06-22 02:13:00,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=505107.1666666667, ans=0.125 2024-06-22 02:13:03,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505125.5, ans=0.125 2024-06-22 02:13:11,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505143.8333333333, ans=0.0 2024-06-22 02:13:13,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=505143.8333333333, ans=0.125 2024-06-22 02:13:18,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=505162.1666666667, ans=0.125 2024-06-22 02:13:24,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.46 vs. limit=12.0 2024-06-22 02:13:28,899 INFO [train.py:1028] (0/2) Epoch 28, batch 2400, loss[loss=0.1921, simple_loss=0.2535, pruned_loss=0.06535, over 13315.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2489, pruned_loss=0.06403, over 2588562.83 frames. ], batch size: 46, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:13:29,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505198.8333333333, ans=0.1 2024-06-22 02:13:29,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.20 vs. limit=10.0 2024-06-22 02:13:52,536 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.297e+02 2.459e+02 2.575e+02 3.877e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-22 02:13:55,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=505253.8333333333, ans=0.0 2024-06-22 02:13:59,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505253.8333333333, ans=0.1 2024-06-22 02:14:01,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=505272.1666666667, ans=0.0 2024-06-22 02:14:02,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505272.1666666667, ans=0.125 2024-06-22 02:14:03,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=505272.1666666667, ans=0.95 2024-06-22 02:14:03,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=505272.1666666667, ans=0.1 2024-06-22 02:14:06,856 INFO [train.py:1028] (0/2) Epoch 28, batch 2450, loss[loss=0.1562, simple_loss=0.2229, pruned_loss=0.0448, over 13334.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2482, pruned_loss=0.06429, over 2584643.69 frames. ], batch size: 63, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:14:13,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=505308.8333333333, ans=0.0 2024-06-22 02:14:18,126 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2024-06-22 02:14:20,180 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=12.0 2024-06-22 02:14:27,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=505345.5, ans=0.0 2024-06-22 02:14:27,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=505345.5, ans=0.5 2024-06-22 02:14:39,663 INFO [train.py:1028] (0/2) Epoch 28, batch 2500, loss[loss=0.1812, simple_loss=0.2357, pruned_loss=0.06334, over 13226.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2471, pruned_loss=0.06425, over 2587563.42 frames. ], batch size: 83, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:14:49,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.32 vs. limit=22.5 2024-06-22 02:14:51,570 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:14:55,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=505418.8333333333, ans=0.0 2024-06-22 02:14:57,643 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.291e+02 2.416e+02 2.656e+02 3.694e+02, threshold=4.831e+02, percent-clipped=0.0 2024-06-22 02:15:02,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=505437.1666666667, ans=0.0 2024-06-22 02:15:04,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=505437.1666666667, ans=0.125 2024-06-22 02:15:08,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=505455.5, ans=0.2 2024-06-22 02:15:12,189 INFO [train.py:1028] (0/2) Epoch 28, batch 2550, loss[loss=0.1805, simple_loss=0.2423, pruned_loss=0.0594, over 12385.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.246, pruned_loss=0.06392, over 2586761.22 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:15:35,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=505528.8333333333, ans=0.0 2024-06-22 02:15:42,221 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=505528.8333333333, ans=0.125 2024-06-22 02:15:46,463 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2024-06-22 02:15:49,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.38 vs. limit=15.0 2024-06-22 02:15:53,805 INFO [train.py:1028] (0/2) Epoch 28, batch 2600, loss[loss=0.1778, simple_loss=0.241, pruned_loss=0.05731, over 13247.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2444, pruned_loss=0.0633, over 2585831.18 frames. ], batch size: 52, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:16:08,424 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2024-06-22 02:16:11,744 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.253e+02 2.365e+02 2.575e+02 3.401e+02, threshold=4.730e+02, percent-clipped=0.0 2024-06-22 02:16:14,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.06 vs. limit=15.0 2024-06-22 02:16:15,940 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505620.5, ans=0.1 2024-06-22 02:16:16,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.98 vs. limit=15.0 2024-06-22 02:16:16,920 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.93 vs. limit=15.0 2024-06-22 02:16:26,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=505657.1666666667, ans=0.2 2024-06-22 02:16:26,840 INFO [train.py:1028] (0/2) Epoch 28, batch 2650, loss[loss=0.1858, simple_loss=0.2394, pruned_loss=0.06607, over 13031.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2429, pruned_loss=0.06293, over 2586154.71 frames. ], batch size: 144, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:16:35,654 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.29 vs. limit=12.0 2024-06-22 02:16:37,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=505675.5, ans=0.125 2024-06-22 02:16:59,667 INFO [train.py:1028] (0/2) Epoch 28, batch 2700, loss[loss=0.1908, simple_loss=0.2433, pruned_loss=0.06915, over 13252.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2417, pruned_loss=0.06277, over 2584713.73 frames. ], batch size: 89, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:17:03,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=505748.8333333333, ans=0.09899494936611666 2024-06-22 02:17:08,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=505767.1666666667, ans=0.125 2024-06-22 02:17:12,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=505785.5, ans=0.0 2024-06-22 02:17:18,142 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.255e+02 2.367e+02 2.561e+02 3.307e+02, threshold=4.735e+02, percent-clipped=0.0 2024-06-22 02:17:19,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=505803.8333333333, ans=0.0 2024-06-22 02:17:21,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=22.5 2024-06-22 02:17:32,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=505822.1666666667, ans=0.125 2024-06-22 02:17:38,563 INFO [train.py:1028] (0/2) Epoch 28, batch 2750, loss[loss=0.1853, simple_loss=0.2488, pruned_loss=0.06091, over 13150.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2412, pruned_loss=0.06221, over 2581578.91 frames. ], batch size: 43, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:17:38,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=505840.5, ans=0.125 2024-06-22 02:17:42,393 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505840.5, ans=0.1 2024-06-22 02:17:46,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=505858.8333333333, ans=0.2 2024-06-22 02:17:53,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=505877.1666666667, ans=0.125 2024-06-22 02:18:11,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=505932.1666666667, ans=0.0 2024-06-22 02:18:11,807 INFO [train.py:1028] (0/2) Epoch 28, batch 2800, loss[loss=0.2036, simple_loss=0.2484, pruned_loss=0.07936, over 10728.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2406, pruned_loss=0.06205, over 2578620.45 frames. ], batch size: 303, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:18:21,167 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=505950.5, ans=0.0 2024-06-22 02:18:25,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=505968.8333333333, ans=0.5 2024-06-22 02:18:30,270 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.237e+02 2.352e+02 2.551e+02 3.481e+02, threshold=4.703e+02, percent-clipped=0.0 2024-06-22 02:18:35,743 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-276000.pt 2024-06-22 02:18:42,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=505987.1666666667, ans=0.125 2024-06-22 02:18:49,743 INFO [train.py:1028] (0/2) Epoch 28, batch 2850, loss[loss=0.1897, simple_loss=0.2447, pruned_loss=0.06738, over 13317.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2399, pruned_loss=0.06193, over 2577099.66 frames. ], batch size: 49, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:18:51,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506023.8333333333, ans=0.125 2024-06-22 02:18:52,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2024-06-22 02:18:53,438 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2024-06-22 02:19:05,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=506060.5, ans=0.0 2024-06-22 02:19:10,100 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2024-06-22 02:19:19,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=506097.1666666667, ans=0.025 2024-06-22 02:19:25,601 INFO [train.py:1028] (0/2) Epoch 28, batch 2900, loss[loss=0.1899, simple_loss=0.2478, pruned_loss=0.06604, over 13078.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2386, pruned_loss=0.06162, over 2584875.32 frames. ], batch size: 55, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:19:29,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=506115.5, ans=0.0 2024-06-22 02:19:36,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506133.8333333333, ans=0.1 2024-06-22 02:19:36,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=506133.8333333333, ans=0.0 2024-06-22 02:19:44,036 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:19:47,876 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.300e+02 2.460e+02 2.673e+02 3.769e+02, threshold=4.919e+02, percent-clipped=0.0 2024-06-22 02:19:56,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=506188.8333333333, ans=0.125 2024-06-22 02:20:02,329 INFO [train.py:1028] (0/2) Epoch 28, batch 2950, loss[loss=0.1777, simple_loss=0.2368, pruned_loss=0.05932, over 13277.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2382, pruned_loss=0.06146, over 2579089.61 frames. ], batch size: 43, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:20:06,766 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2024-06-22 02:20:07,482 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2024-06-22 02:20:24,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506262.1666666667, ans=0.1 2024-06-22 02:20:26,037 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=506262.1666666667, ans=0.125 2024-06-22 02:20:35,774 INFO [train.py:1028] (0/2) Epoch 28, batch 3000, loss[loss=0.1696, simple_loss=0.237, pruned_loss=0.0511, over 13200.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2368, pruned_loss=0.06069, over 2577436.35 frames. ], batch size: 59, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:20:35,775 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 02:20:43,600 INFO [train.py:1060] (0/2) Epoch 28, validation: loss=0.1921, simple_loss=0.2514, pruned_loss=0.06643, over 351949.00 frames. 2024-06-22 02:20:43,601 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 02:20:45,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=506298.8333333333, ans=0.0 2024-06-22 02:20:49,299 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=506298.8333333333, ans=0.0 2024-06-22 02:20:52,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=506317.1666666667, ans=0.0 2024-06-22 02:20:55,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506317.1666666667, ans=0.1 2024-06-22 02:21:02,219 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.264e+02 2.400e+02 2.674e+02 3.342e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-22 02:21:07,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=506353.8333333333, ans=0.0 2024-06-22 02:21:09,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=506372.1666666667, ans=0.125 2024-06-22 02:21:10,779 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2024-06-22 02:21:16,952 INFO [train.py:1028] (0/2) Epoch 28, batch 3050, loss[loss=0.1656, simple_loss=0.2202, pruned_loss=0.05554, over 13314.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2367, pruned_loss=0.06125, over 2578591.96 frames. ], batch size: 46, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:21:17,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=506390.5, ans=0.0 2024-06-22 02:21:43,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=12.0 2024-06-22 02:21:45,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=506445.5, ans=0.125 2024-06-22 02:21:56,410 INFO [train.py:1028] (0/2) Epoch 28, batch 3100, loss[loss=0.1818, simple_loss=0.235, pruned_loss=0.06428, over 13022.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2358, pruned_loss=0.06067, over 2578851.84 frames. ], batch size: 144, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:22:01,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=506482.1666666667, ans=0.125 2024-06-22 02:22:03,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=506500.5, ans=0.125 2024-06-22 02:22:14,704 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.312e+02 2.436e+02 2.687e+02 3.466e+02, threshold=4.872e+02, percent-clipped=0.0 2024-06-22 02:22:17,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=506537.1666666667, ans=0.125 2024-06-22 02:22:23,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=506555.5, ans=0.0 2024-06-22 02:22:29,381 INFO [train.py:1028] (0/2) Epoch 28, batch 3150, loss[loss=0.1749, simple_loss=0.2266, pruned_loss=0.06159, over 12949.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2346, pruned_loss=0.06019, over 2580581.20 frames. ], batch size: 158, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:22:29,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=506573.8333333333, ans=0.5 2024-06-22 02:22:35,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=506573.8333333333, ans=0.0 2024-06-22 02:22:41,113 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=506592.1666666667, ans=0.0 2024-06-22 02:22:45,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506610.5, ans=0.125 2024-06-22 02:22:52,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-22 02:23:00,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=506647.1666666667, ans=0.0 2024-06-22 02:23:03,000 INFO [train.py:1028] (0/2) Epoch 28, batch 3200, loss[loss=0.1695, simple_loss=0.23, pruned_loss=0.05446, over 13110.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.234, pruned_loss=0.05971, over 2581347.18 frames. ], batch size: 55, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:23:09,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=506683.8333333333, ans=0.125 2024-06-22 02:23:10,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=506683.8333333333, ans=0.125 2024-06-22 02:23:17,939 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:23:20,908 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.220e+02 2.369e+02 2.514e+02 3.098e+02, threshold=4.738e+02, percent-clipped=0.0 2024-06-22 02:23:28,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=506720.5, ans=0.125 2024-06-22 02:23:35,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=506738.8333333333, ans=0.07 2024-06-22 02:23:35,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2024-06-22 02:23:36,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=506738.8333333333, ans=0.1 2024-06-22 02:23:37,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=506738.8333333333, ans=0.2 2024-06-22 02:23:41,215 INFO [train.py:1028] (0/2) Epoch 28, batch 3250, loss[loss=0.1653, simple_loss=0.2274, pruned_loss=0.05154, over 13267.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2335, pruned_loss=0.05979, over 2585384.95 frames. ], batch size: 72, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:23:41,595 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2024-06-22 02:23:49,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=506775.5, ans=0.125 2024-06-22 02:24:00,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506812.1666666667, ans=0.1 2024-06-22 02:24:03,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506812.1666666667, ans=0.1 2024-06-22 02:24:05,909 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:24:14,398 INFO [train.py:1028] (0/2) Epoch 28, batch 3300, loss[loss=0.1844, simple_loss=0.2273, pruned_loss=0.0707, over 12657.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2329, pruned_loss=0.05943, over 2581793.24 frames. ], batch size: 176, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:24:30,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=506885.5, ans=0.0 2024-06-22 02:24:30,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=506885.5, ans=0.125 2024-06-22 02:24:32,095 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.270e+02 2.420e+02 2.594e+02 3.768e+02, threshold=4.839e+02, percent-clipped=0.0 2024-06-22 02:24:33,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506903.8333333333, ans=0.1 2024-06-22 02:24:38,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=506903.8333333333, ans=0.125 2024-06-22 02:24:46,167 INFO [train.py:1028] (0/2) Epoch 28, batch 3350, loss[loss=0.1721, simple_loss=0.2241, pruned_loss=0.06, over 12933.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2322, pruned_loss=0.0596, over 2576043.95 frames. ], batch size: 158, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:24:57,215 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.93 vs. limit=22.5 2024-06-22 02:25:02,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=26.40 vs. limit=22.5 2024-06-22 02:25:08,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=506995.5, ans=0.0 2024-06-22 02:25:21,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2024-06-22 02:25:27,202 INFO [train.py:1028] (0/2) Epoch 28, batch 3400, loss[loss=0.2108, simple_loss=0.2671, pruned_loss=0.07726, over 12629.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2323, pruned_loss=0.05978, over 2575593.92 frames. ], batch size: 22, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:25:30,225 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2024-06-22 02:25:32,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=507032.1666666667, ans=0.2 2024-06-22 02:25:43,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=507068.8333333333, ans=0.0 2024-06-22 02:25:45,524 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.242e+02 2.401e+02 2.630e+02 3.658e+02, threshold=4.802e+02, percent-clipped=0.0 2024-06-22 02:25:51,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=507087.1666666667, ans=0.07 2024-06-22 02:25:53,454 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2024-06-22 02:25:55,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=507105.5, ans=0.0 2024-06-22 02:26:00,375 INFO [train.py:1028] (0/2) Epoch 28, batch 3450, loss[loss=0.1801, simple_loss=0.2361, pruned_loss=0.062, over 12749.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2319, pruned_loss=0.05969, over 2577035.60 frames. ], batch size: 176, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:26:04,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=507123.8333333333, ans=0.125 2024-06-22 02:26:11,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=507142.1666666667, ans=0.125 2024-06-22 02:26:26,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.03 vs. limit=22.5 2024-06-22 02:26:32,604 INFO [train.py:1028] (0/2) Epoch 28, batch 3500, loss[loss=0.1723, simple_loss=0.2235, pruned_loss=0.0605, over 12882.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2314, pruned_loss=0.05941, over 2575399.32 frames. ], batch size: 33, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:26:50,779 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.178e+02 2.309e+02 2.461e+02 3.259e+02, threshold=4.618e+02, percent-clipped=0.0 2024-06-22 02:26:52,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=507270.5, ans=0.0 2024-06-22 02:26:57,531 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.74 vs. limit=12.0 2024-06-22 02:27:00,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=507288.8333333333, ans=0.1 2024-06-22 02:27:01,649 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=15.05 vs. limit=15.0 2024-06-22 02:27:03,290 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507288.8333333333, ans=0.1 2024-06-22 02:27:05,046 INFO [train.py:1028] (0/2) Epoch 28, batch 3550, loss[loss=0.1751, simple_loss=0.227, pruned_loss=0.06159, over 13173.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.2315, pruned_loss=0.05901, over 2577714.23 frames. ], batch size: 95, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:27:08,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=507307.1666666667, ans=0.05 2024-06-22 02:27:12,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=507325.5, ans=0.0 2024-06-22 02:27:16,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=507325.5, ans=0.0 2024-06-22 02:27:28,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=507343.8333333333, ans=0.125 2024-06-22 02:27:28,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=507343.8333333333, ans=0.125 2024-06-22 02:27:30,886 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=507362.1666666667, ans=0.125 2024-06-22 02:27:41,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2024-06-22 02:27:43,115 INFO [train.py:1028] (0/2) Epoch 28, batch 3600, loss[loss=0.1664, simple_loss=0.2303, pruned_loss=0.05123, over 13277.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2307, pruned_loss=0.05883, over 2581708.89 frames. ], batch size: 49, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:27:44,533 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=507398.8333333333, ans=0.125 2024-06-22 02:27:45,357 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=507398.8333333333, ans=0.125 2024-06-22 02:27:52,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=507417.1666666667, ans=0.0 2024-06-22 02:27:53,444 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2024-06-22 02:27:57,773 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=507435.5, ans=0.5 2024-06-22 02:28:01,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.281e+02 2.430e+02 2.623e+02 3.597e+02, threshold=4.860e+02, percent-clipped=0.0 2024-06-22 02:28:07,075 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=507453.8333333333, ans=0.2 2024-06-22 02:28:07,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=507453.8333333333, ans=0.025 2024-06-22 02:28:16,032 INFO [train.py:1028] (0/2) Epoch 28, batch 3650, loss[loss=0.1659, simple_loss=0.221, pruned_loss=0.05537, over 13063.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2308, pruned_loss=0.05882, over 2579283.79 frames. ], batch size: 102, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:28:18,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507490.5, ans=0.125 2024-06-22 02:28:20,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=507490.5, ans=0.125 2024-06-22 02:28:23,681 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.26 vs. limit=15.0 2024-06-22 02:28:27,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507508.8333333333, ans=0.1 2024-06-22 02:28:29,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=507527.1666666667, ans=0.125 2024-06-22 02:28:37,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507545.5, ans=0.1 2024-06-22 02:28:38,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507545.5, ans=0.1 2024-06-22 02:28:40,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2024-06-22 02:28:49,312 INFO [train.py:1028] (0/2) Epoch 28, batch 3700, loss[loss=0.1728, simple_loss=0.2314, pruned_loss=0.05716, over 13256.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2299, pruned_loss=0.05848, over 2585199.36 frames. ], batch size: 72, lr: 2.07e-03, grad_scale: 32.0 2024-06-22 02:28:50,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=15.0 2024-06-22 02:28:51,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=507582.1666666667, ans=0.125 2024-06-22 02:28:59,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2024-06-22 02:29:04,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=507618.8333333333, ans=0.125 2024-06-22 02:29:07,497 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.190e+02 2.378e+02 2.549e+02 3.471e+02, threshold=4.756e+02, percent-clipped=0.0 2024-06-22 02:29:12,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=507637.1666666667, ans=0.025 2024-06-22 02:29:12,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=507637.1666666667, ans=0.125 2024-06-22 02:29:15,757 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.94 vs. limit=10.0 2024-06-22 02:29:28,150 INFO [train.py:1028] (0/2) Epoch 28, batch 3750, loss[loss=0.1831, simple_loss=0.2392, pruned_loss=0.06347, over 12374.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2295, pruned_loss=0.05832, over 2586863.35 frames. ], batch size: 22, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:29:33,133 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=507673.8333333333, ans=0.125 2024-06-22 02:29:40,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=507710.5, ans=0.125 2024-06-22 02:29:41,581 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=507710.5, ans=0.07 2024-06-22 02:29:46,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=507710.5, ans=0.125 2024-06-22 02:29:59,703 INFO [train.py:1028] (0/2) Epoch 28, batch 3800, loss[loss=0.1722, simple_loss=0.2345, pruned_loss=0.05494, over 13180.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2295, pruned_loss=0.05841, over 2584773.26 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:30:16,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=507802.1666666667, ans=0.1 2024-06-22 02:30:18,669 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.200e+02 2.344e+02 2.553e+02 3.068e+02, threshold=4.689e+02, percent-clipped=0.0 2024-06-22 02:30:20,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=507820.5, ans=0.05 2024-06-22 02:30:25,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507820.5, ans=0.0 2024-06-22 02:30:34,309 INFO [train.py:1028] (0/2) Epoch 28, batch 3850, loss[loss=0.1675, simple_loss=0.2175, pruned_loss=0.05871, over 13021.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2288, pruned_loss=0.05795, over 2584340.95 frames. ], batch size: 144, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:30:42,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.57 vs. limit=15.0 2024-06-22 02:30:54,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507912.1666666667, ans=0.1 2024-06-22 02:30:58,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507912.1666666667, ans=0.1 2024-06-22 02:31:06,959 INFO [train.py:1028] (0/2) Epoch 28, batch 3900, loss[loss=0.1801, simple_loss=0.2306, pruned_loss=0.06482, over 13160.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.2286, pruned_loss=0.05787, over 2587173.97 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:31:08,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=507948.8333333333, ans=0.1 2024-06-22 02:31:15,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507967.1666666667, ans=0.125 2024-06-22 02:31:18,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=507967.1666666667, ans=0.025 2024-06-22 02:31:20,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=507985.5, ans=0.125 2024-06-22 02:31:28,425 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.168e+02 2.320e+02 2.490e+02 3.133e+02, threshold=4.639e+02, percent-clipped=0.0 2024-06-22 02:31:33,780 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.33 vs. limit=22.5 2024-06-22 02:31:37,790 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=508003.8333333333, ans=0.125 2024-06-22 02:31:42,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=508022.1666666667, ans=10.0 2024-06-22 02:31:46,303 INFO [train.py:1028] (0/2) Epoch 28, batch 3950, loss[loss=0.1672, simple_loss=0.2175, pruned_loss=0.0585, over 13099.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2281, pruned_loss=0.05764, over 2587739.34 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:31:49,004 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:32:00,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=508077.1666666667, ans=0.035 2024-06-22 02:32:03,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.05 vs. limit=15.0 2024-06-22 02:32:05,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=508095.5, ans=0.95 2024-06-22 02:32:06,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.57 vs. limit=22.5 2024-06-22 02:32:12,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=508113.8333333333, ans=10.0 2024-06-22 02:32:13,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=508113.8333333333, ans=0.0 2024-06-22 02:32:14,200 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=508113.8333333333, ans=0.125 2024-06-22 02:32:14,417 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2024-06-22 02:32:18,993 INFO [train.py:1028] (0/2) Epoch 28, batch 4000, loss[loss=0.1798, simple_loss=0.2464, pruned_loss=0.0566, over 13055.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.228, pruned_loss=0.05768, over 2582504.18 frames. ], batch size: 39, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:32:34,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=508168.8333333333, ans=0.025 2024-06-22 02:32:37,261 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.251e+02 2.352e+02 2.537e+02 3.336e+02, threshold=4.704e+02, percent-clipped=0.0 2024-06-22 02:32:40,175 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2024-06-22 02:32:49,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=508205.5, ans=0.125 2024-06-22 02:32:51,946 INFO [train.py:1028] (0/2) Epoch 28, batch 4050, loss[loss=0.1872, simple_loss=0.2341, pruned_loss=0.07013, over 11022.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2278, pruned_loss=0.05775, over 2580883.32 frames. ], batch size: 304, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:32:52,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=508223.8333333333, ans=0.0 2024-06-22 02:32:52,798 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=508223.8333333333, ans=0.0 2024-06-22 02:33:03,444 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2024-06-22 02:33:03,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=508242.1666666667, ans=0.0 2024-06-22 02:33:04,526 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=508260.5, ans=0.025 2024-06-22 02:33:05,847 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508260.5, ans=0.1 2024-06-22 02:33:13,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=508278.8333333333, ans=0.025 2024-06-22 02:33:14,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=508278.8333333333, ans=0.0 2024-06-22 02:33:28,160 INFO [train.py:1028] (0/2) Epoch 28, batch 4100, loss[loss=0.174, simple_loss=0.2221, pruned_loss=0.063, over 13002.00 frames. ], tot_loss[loss=0.1718, simple_loss=0.2277, pruned_loss=0.05793, over 2577387.58 frames. ], batch size: 102, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:33:28,267 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=508315.5, ans=10.0 2024-06-22 02:33:31,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=508315.5, ans=0.125 2024-06-22 02:33:40,767 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=508333.8333333333, ans=0.0 2024-06-22 02:33:49,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=508352.1666666667, ans=0.2 2024-06-22 02:33:49,597 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.240e+02 2.367e+02 2.585e+02 3.602e+02, threshold=4.734e+02, percent-clipped=0.0 2024-06-22 02:33:53,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=508370.5, ans=0.0 2024-06-22 02:33:56,302 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=508370.5, ans=0.1 2024-06-22 02:34:01,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=508388.8333333333, ans=0.125 2024-06-22 02:34:04,023 INFO [train.py:1028] (0/2) Epoch 28, batch 4150, loss[loss=0.1652, simple_loss=0.2185, pruned_loss=0.0559, over 13106.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2271, pruned_loss=0.05766, over 2574947.94 frames. ], batch size: 55, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:34:06,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=508407.1666666667, ans=0.0 2024-06-22 02:34:14,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508425.5, ans=0.1 2024-06-22 02:34:16,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=508443.8333333333, ans=0.0 2024-06-22 02:34:24,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2024-06-22 02:34:36,572 INFO [train.py:1028] (0/2) Epoch 28, batch 4200, loss[loss=0.1742, simple_loss=0.2328, pruned_loss=0.05778, over 12997.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2267, pruned_loss=0.05742, over 2577950.61 frames. ], batch size: 102, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:34:41,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=508498.8333333333, ans=0.0 2024-06-22 02:34:43,811 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=508517.1666666667, ans=0.1 2024-06-22 02:34:45,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=508517.1666666667, ans=0.0 2024-06-22 02:34:45,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=508517.1666666667, ans=0.025 2024-06-22 02:34:47,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2024-06-22 02:34:47,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.68 vs. limit=10.0 2024-06-22 02:34:50,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=508535.5, ans=0.0 2024-06-22 02:34:54,419 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.163e+02 2.299e+02 2.410e+02 3.221e+02, threshold=4.599e+02, percent-clipped=0.0 2024-06-22 02:34:59,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=508553.8333333333, ans=0.125 2024-06-22 02:35:09,289 INFO [train.py:1028] (0/2) Epoch 28, batch 4250, loss[loss=0.1607, simple_loss=0.2308, pruned_loss=0.04531, over 13296.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2265, pruned_loss=0.05712, over 2579820.26 frames. ], batch size: 46, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:35:09,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=508590.5, ans=0.025 2024-06-22 02:35:13,970 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:35:19,461 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2024-06-22 02:35:23,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=508627.1666666667, ans=0.2 2024-06-22 02:35:48,036 INFO [train.py:1028] (0/2) Epoch 28, batch 4300, loss[loss=0.1789, simple_loss=0.2326, pruned_loss=0.0626, over 13216.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2265, pruned_loss=0.05719, over 2580329.82 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:36:05,691 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.244e+02 2.349e+02 2.533e+02 3.198e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-22 02:36:10,888 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=508737.1666666667, ans=0.125 2024-06-22 02:36:11,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=508737.1666666667, ans=0.1 2024-06-22 02:36:14,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=508755.5, ans=0.125 2024-06-22 02:36:19,835 INFO [train.py:1028] (0/2) Epoch 28, batch 4350, loss[loss=0.1719, simple_loss=0.2341, pruned_loss=0.05485, over 13172.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2263, pruned_loss=0.05733, over 2584857.25 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:36:26,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=508792.1666666667, ans=0.0 2024-06-22 02:36:36,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=508810.5, ans=0.0 2024-06-22 02:36:47,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=508847.1666666667, ans=0.1 2024-06-22 02:36:52,555 INFO [train.py:1028] (0/2) Epoch 28, batch 4400, loss[loss=0.1687, simple_loss=0.2209, pruned_loss=0.05828, over 13169.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2263, pruned_loss=0.05742, over 2585287.93 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:37:02,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=508883.8333333333, ans=0.125 2024-06-22 02:37:04,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508883.8333333333, ans=0.125 2024-06-22 02:37:04,629 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=508883.8333333333, ans=0.125 2024-06-22 02:37:07,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=508902.1666666667, ans=0.0 2024-06-22 02:37:10,725 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.259e+02 2.411e+02 2.583e+02 3.312e+02, threshold=4.822e+02, percent-clipped=0.0 2024-06-22 02:37:28,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=508938.8333333333, ans=0.125 2024-06-22 02:37:30,233 INFO [train.py:1028] (0/2) Epoch 28, batch 4450, loss[loss=0.1515, simple_loss=0.209, pruned_loss=0.04698, over 13009.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2267, pruned_loss=0.05783, over 2580910.23 frames. ], batch size: 33, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:37:32,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508957.1666666667, ans=0.125 2024-06-22 02:37:36,568 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.62 vs. limit=12.0 2024-06-22 02:37:37,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=508957.1666666667, ans=0.125 2024-06-22 02:37:39,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=508975.5, ans=0.0 2024-06-22 02:37:42,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=508975.5, ans=0.125 2024-06-22 02:37:45,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=508993.8333333333, ans=0.125 2024-06-22 02:38:05,338 INFO [train.py:1028] (0/2) Epoch 28, batch 4500, loss[loss=0.1646, simple_loss=0.219, pruned_loss=0.05512, over 13236.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2267, pruned_loss=0.05784, over 2586245.21 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:38:11,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=509067.1666666667, ans=0.0 2024-06-22 02:38:14,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509067.1666666667, ans=0.1 2024-06-22 02:38:15,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509067.1666666667, ans=0.1 2024-06-22 02:38:16,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=509067.1666666667, ans=0.1 2024-06-22 02:38:23,538 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.192e+02 2.313e+02 2.537e+02 3.154e+02, threshold=4.626e+02, percent-clipped=0.0 2024-06-22 02:38:36,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=509122.1666666667, ans=0.2 2024-06-22 02:38:38,138 INFO [train.py:1028] (0/2) Epoch 28, batch 4550, loss[loss=0.1653, simple_loss=0.2259, pruned_loss=0.05231, over 13213.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2263, pruned_loss=0.05748, over 2589643.29 frames. ], batch size: 52, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:38:40,183 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=509140.5, ans=0.2 2024-06-22 02:38:43,928 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=509158.8333333333, ans=0.0 2024-06-22 02:38:45,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509158.8333333333, ans=0.125 2024-06-22 02:39:10,655 INFO [train.py:1028] (0/2) Epoch 28, batch 4600, loss[loss=0.1705, simple_loss=0.2219, pruned_loss=0.05956, over 12548.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2265, pruned_loss=0.05741, over 2585452.67 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:39:32,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.200e+02 2.308e+02 2.530e+02 2.971e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-22 02:39:33,114 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=15.0 2024-06-22 02:39:38,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=509287.1666666667, ans=0.2 2024-06-22 02:39:39,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=509287.1666666667, ans=0.015 2024-06-22 02:39:46,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=509305.5, ans=0.125 2024-06-22 02:39:49,384 INFO [train.py:1028] (0/2) Epoch 28, batch 4650, loss[loss=0.1727, simple_loss=0.2259, pruned_loss=0.05972, over 13066.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2264, pruned_loss=0.05759, over 2588080.20 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:39:56,870 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=509342.1666666667, ans=0.125 2024-06-22 02:40:06,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=509360.5, ans=0.0 2024-06-22 02:40:11,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=509378.8333333333, ans=0.0 2024-06-22 02:40:22,988 INFO [train.py:1028] (0/2) Epoch 28, batch 4700, loss[loss=0.1597, simple_loss=0.2114, pruned_loss=0.05401, over 12263.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2264, pruned_loss=0.05744, over 2582324.20 frames. ], batch size: 25, lr: 2.06e-03, grad_scale: 64.0 2024-06-22 02:40:25,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=509415.5, ans=0.2 2024-06-22 02:40:29,736 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.01 vs. limit=12.0 2024-06-22 02:40:35,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509433.8333333333, ans=0.1 2024-06-22 02:40:41,378 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.189e+02 2.305e+02 2.464e+02 3.540e+02, threshold=4.609e+02, percent-clipped=0.0 2024-06-22 02:40:45,393 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.75 vs. limit=10.0 2024-06-22 02:40:46,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=509470.5, ans=0.025 2024-06-22 02:40:53,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=509488.8333333333, ans=0.125 2024-06-22 02:40:56,365 INFO [train.py:1028] (0/2) Epoch 28, batch 4750, loss[loss=0.1928, simple_loss=0.2451, pruned_loss=0.07029, over 12597.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2257, pruned_loss=0.05744, over 2579372.91 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:41:06,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509525.5, ans=0.1 2024-06-22 02:41:07,902 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2024-06-22 02:41:10,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=509543.8333333333, ans=15.0 2024-06-22 02:41:12,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=509543.8333333333, ans=0.5 2024-06-22 02:41:18,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=509562.1666666667, ans=0.125 2024-06-22 02:41:19,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509562.1666666667, ans=0.1 2024-06-22 02:41:31,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=509580.5, ans=0.125 2024-06-22 02:41:35,857 INFO [train.py:1028] (0/2) Epoch 28, batch 4800, loss[loss=0.1627, simple_loss=0.2199, pruned_loss=0.05273, over 13268.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2253, pruned_loss=0.05726, over 2575449.66 frames. ], batch size: 63, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:41:55,000 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.245e+02 2.353e+02 2.511e+02 3.167e+02, threshold=4.706e+02, percent-clipped=0.0 2024-06-22 02:42:08,786 INFO [train.py:1028] (0/2) Epoch 28, batch 4850, loss[loss=0.1701, simple_loss=0.2232, pruned_loss=0.05853, over 13312.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2253, pruned_loss=0.05721, over 2575238.38 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:42:21,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509727.1666666667, ans=0.1 2024-06-22 02:42:21,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509727.1666666667, ans=0.1 2024-06-22 02:42:24,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=509727.1666666667, ans=0.05 2024-06-22 02:42:28,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=509745.5, ans=0.125 2024-06-22 02:42:33,621 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=509745.5, ans=0.125 2024-06-22 02:42:41,992 INFO [train.py:1028] (0/2) Epoch 28, batch 4900, loss[loss=0.1561, simple_loss=0.2193, pruned_loss=0.04644, over 13211.00 frames. ], tot_loss[loss=0.17, simple_loss=0.2255, pruned_loss=0.05722, over 2576958.65 frames. ], batch size: 59, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:42:44,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=509782.1666666667, ans=0.125 2024-06-22 02:42:46,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2024-06-22 02:42:53,004 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:43:00,990 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.177e+02 2.287e+02 2.460e+02 3.421e+02, threshold=4.575e+02, percent-clipped=0.0 2024-06-22 02:43:01,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=15.0 2024-06-22 02:43:17,547 INFO [train.py:1028] (0/2) Epoch 28, batch 4950, loss[loss=0.1844, simple_loss=0.2323, pruned_loss=0.06826, over 11238.00 frames. ], tot_loss[loss=0.17, simple_loss=0.2253, pruned_loss=0.05733, over 2569780.85 frames. ], batch size: 304, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:43:17,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=509873.8333333333, ans=0.125 2024-06-22 02:43:18,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=509873.8333333333, ans=0.0 2024-06-22 02:43:20,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=509873.8333333333, ans=0.0 2024-06-22 02:43:23,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=509892.1666666667, ans=0.2 2024-06-22 02:43:26,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=509892.1666666667, ans=0.0 2024-06-22 02:43:35,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=509910.5, ans=0.125 2024-06-22 02:43:47,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=509947.1666666667, ans=0.125 2024-06-22 02:43:52,614 INFO [train.py:1028] (0/2) Epoch 28, batch 5000, loss[loss=0.1649, simple_loss=0.2152, pruned_loss=0.05732, over 13172.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2243, pruned_loss=0.05664, over 2573647.68 frames. ], batch size: 95, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:43:53,009 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.25 vs. limit=15.0 2024-06-22 02:44:03,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2024-06-22 02:44:08,717 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=510002.1666666667, ans=0.125 2024-06-22 02:44:09,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=510002.1666666667, ans=0.125 2024-06-22 02:44:11,855 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.148e+02 2.250e+02 2.438e+02 3.117e+02, threshold=4.500e+02, percent-clipped=0.0 2024-06-22 02:44:16,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=510020.5, ans=0.125 2024-06-22 02:44:24,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=510038.8333333333, ans=0.0 2024-06-22 02:44:26,230 INFO [train.py:1028] (0/2) Epoch 28, batch 5050, loss[loss=0.1665, simple_loss=0.222, pruned_loss=0.05545, over 12960.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2242, pruned_loss=0.05643, over 2571516.17 frames. ], batch size: 36, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:44:26,526 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2024-06-22 02:44:27,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=510057.1666666667, ans=0.05 2024-06-22 02:44:45,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=510093.8333333333, ans=0.125 2024-06-22 02:44:54,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=510130.5, ans=0.2 2024-06-22 02:44:55,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=510130.5, ans=0.125 2024-06-22 02:44:59,381 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.86 vs. limit=22.5 2024-06-22 02:45:00,239 INFO [train.py:1028] (0/2) Epoch 28, batch 5100, loss[loss=0.1619, simple_loss=0.2234, pruned_loss=0.0502, over 12895.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2238, pruned_loss=0.05665, over 2568828.69 frames. ], batch size: 39, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:45:01,336 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.58 vs. limit=6.0 2024-06-22 02:45:01,748 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=510148.8333333333, ans=0.0 2024-06-22 02:45:23,316 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.204e+02 2.349e+02 2.512e+02 3.432e+02, threshold=4.698e+02, percent-clipped=0.0 2024-06-22 02:45:38,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.80 vs. limit=22.5 2024-06-22 02:45:40,268 INFO [train.py:1028] (0/2) Epoch 28, batch 5150, loss[loss=0.161, simple_loss=0.2046, pruned_loss=0.05863, over 13125.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.224, pruned_loss=0.05708, over 2571043.06 frames. ], batch size: 132, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:45:45,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=510240.5, ans=0.125 2024-06-22 02:45:47,441 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=510258.8333333333, ans=0.0 2024-06-22 02:45:48,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2024-06-22 02:45:55,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=510277.1666666667, ans=0.2 2024-06-22 02:45:56,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=510277.1666666667, ans=10.0 2024-06-22 02:46:03,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=510295.5, ans=0.0 2024-06-22 02:46:13,010 INFO [train.py:1028] (0/2) Epoch 28, batch 5200, loss[loss=0.1643, simple_loss=0.2166, pruned_loss=0.05603, over 13154.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.224, pruned_loss=0.05705, over 2573955.07 frames. ], batch size: 95, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:46:15,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=510332.1666666667, ans=0.2 2024-06-22 02:46:17,261 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2024-06-22 02:46:19,519 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510350.5, ans=0.1 2024-06-22 02:46:32,107 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.187e+02 2.335e+02 2.531e+02 4.037e+02, threshold=4.669e+02, percent-clipped=0.0 2024-06-22 02:46:37,317 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2024-06-22 02:46:39,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=510405.5, ans=0.125 2024-06-22 02:46:45,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=510423.8333333333, ans=0.0 2024-06-22 02:46:46,178 INFO [train.py:1028] (0/2) Epoch 28, batch 5250, loss[loss=0.1599, simple_loss=0.2225, pruned_loss=0.04867, over 13283.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2243, pruned_loss=0.05713, over 2570562.23 frames. ], batch size: 52, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:46:49,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=510423.8333333333, ans=0.2 2024-06-22 02:46:49,630 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=510423.8333333333, ans=0.0 2024-06-22 02:47:00,452 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=510460.5, ans=0.1 2024-06-22 02:47:04,166 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2024-06-22 02:47:07,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510478.8333333333, ans=0.1 2024-06-22 02:47:13,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=510478.8333333333, ans=0.125 2024-06-22 02:47:14,439 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510478.8333333333, ans=0.1 2024-06-22 02:47:15,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510497.1666666667, ans=0.1 2024-06-22 02:47:22,704 INFO [train.py:1028] (0/2) Epoch 28, batch 5300, loss[loss=0.1667, simple_loss=0.2213, pruned_loss=0.05611, over 13072.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2248, pruned_loss=0.0573, over 2567710.19 frames. ], batch size: 144, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:47:24,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=510515.5, ans=0.125 2024-06-22 02:47:26,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510515.5, ans=0.125 2024-06-22 02:47:32,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=510533.8333333333, ans=0.0 2024-06-22 02:47:34,766 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=510533.8333333333, ans=0.0 2024-06-22 02:47:44,522 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.20 vs. limit=15.0 2024-06-22 02:47:45,334 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.211e+02 2.309e+02 2.467e+02 2.967e+02, threshold=4.619e+02, percent-clipped=0.0 2024-06-22 02:47:46,985 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2024-06-22 02:47:49,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=510570.5, ans=0.1 2024-06-22 02:47:52,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=510588.8333333333, ans=0.0 2024-06-22 02:47:59,067 INFO [train.py:1028] (0/2) Epoch 28, batch 5350, loss[loss=0.1762, simple_loss=0.2368, pruned_loss=0.05781, over 11145.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2247, pruned_loss=0.05733, over 2573672.87 frames. ], batch size: 16, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:48:00,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=510607.1666666667, ans=0.05 2024-06-22 02:48:00,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=510607.1666666667, ans=0.07 2024-06-22 02:48:03,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=510607.1666666667, ans=0.125 2024-06-22 02:48:12,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=510643.8333333333, ans=0.2 2024-06-22 02:48:12,615 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=510643.8333333333, ans=0.0 2024-06-22 02:48:18,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510662.1666666667, ans=0.125 2024-06-22 02:48:26,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=510680.5, ans=0.125 2024-06-22 02:48:28,612 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.45 vs. limit=6.0 2024-06-22 02:48:28,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=510680.5, ans=0.04949747468305833 2024-06-22 02:48:31,213 INFO [train.py:1028] (0/2) Epoch 28, batch 5400, loss[loss=0.1851, simple_loss=0.231, pruned_loss=0.06956, over 12143.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2249, pruned_loss=0.05759, over 2566563.07 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:48:35,267 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2024-06-22 02:48:41,942 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=510717.1666666667, ans=0.125 2024-06-22 02:48:48,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2024-06-22 02:48:50,556 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.205e+02 2.308e+02 2.460e+02 3.091e+02, threshold=4.617e+02, percent-clipped=0.0 2024-06-22 02:48:52,679 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=510753.8333333333, ans=0.0 2024-06-22 02:48:54,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=510753.8333333333, ans=0.2 2024-06-22 02:48:58,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=510772.1666666667, ans=0.125 2024-06-22 02:48:59,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=510772.1666666667, ans=0.125 2024-06-22 02:49:04,608 INFO [train.py:1028] (0/2) Epoch 28, batch 5450, loss[loss=0.1914, simple_loss=0.2442, pruned_loss=0.06935, over 12940.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2252, pruned_loss=0.05757, over 2571515.36 frames. ], batch size: 26, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:49:05,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=510790.5, ans=0.125 2024-06-22 02:49:15,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.67 vs. limit=6.0 2024-06-22 02:49:16,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=510808.8333333333, ans=0.125 2024-06-22 02:49:18,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=510808.8333333333, ans=0.125 2024-06-22 02:49:27,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510827.1666666667, ans=0.1 2024-06-22 02:49:37,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=510863.8333333333, ans=0.2 2024-06-22 02:49:44,178 INFO [train.py:1028] (0/2) Epoch 28, batch 5500, loss[loss=0.184, simple_loss=0.2344, pruned_loss=0.06684, over 12263.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2252, pruned_loss=0.05753, over 2564469.72 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:49:46,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=510882.1666666667, ans=0.2 2024-06-22 02:49:56,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=510900.5, ans=0.125 2024-06-22 02:50:01,043 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=26.99 vs. limit=22.5 2024-06-22 02:50:04,657 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.249e+02 2.374e+02 2.590e+02 3.505e+02, threshold=4.748e+02, percent-clipped=0.0 2024-06-22 02:50:07,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=510937.1666666667, ans=0.025 2024-06-22 02:50:10,953 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2024-06-22 02:50:18,207 INFO [train.py:1028] (0/2) Epoch 28, batch 5550, loss[loss=0.1736, simple_loss=0.2335, pruned_loss=0.05689, over 13248.00 frames. ], tot_loss[loss=0.1692, simple_loss=0.2244, pruned_loss=0.05701, over 2568498.60 frames. ], batch size: 43, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:50:35,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=511010.5, ans=0.035 2024-06-22 02:50:37,102 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=511010.5, ans=0.125 2024-06-22 02:50:55,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.43 vs. limit=15.0 2024-06-22 02:50:59,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511065.5, ans=0.1 2024-06-22 02:50:59,815 INFO [train.py:1028] (0/2) Epoch 28, batch 5600, loss[loss=0.1564, simple_loss=0.2016, pruned_loss=0.05557, over 13215.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2241, pruned_loss=0.05701, over 2569964.24 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:51:18,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=511102.1666666667, ans=0.0 2024-06-22 02:51:23,103 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.201e+02 2.313e+02 2.495e+02 3.105e+02, threshold=4.625e+02, percent-clipped=0.0 2024-06-22 02:51:29,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.16 vs. limit=22.5 2024-06-22 02:51:38,747 INFO [train.py:1028] (0/2) Epoch 28, batch 5650, loss[loss=0.177, simple_loss=0.2267, pruned_loss=0.0636, over 12452.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2239, pruned_loss=0.05666, over 2573400.51 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:51:47,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=511157.1666666667, ans=0.125 2024-06-22 02:51:54,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=511175.5, ans=0.125 2024-06-22 02:51:54,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=511175.5, ans=0.125 2024-06-22 02:51:59,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511193.8333333333, ans=0.125 2024-06-22 02:52:06,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=511212.1666666667, ans=0.125 2024-06-22 02:52:09,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.71 vs. limit=15.0 2024-06-22 02:52:17,410 INFO [train.py:1028] (0/2) Epoch 28, batch 5700, loss[loss=0.1509, simple_loss=0.2113, pruned_loss=0.04529, over 13306.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2238, pruned_loss=0.05677, over 2577112.77 frames. ], batch size: 63, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:52:17,848 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.82 vs. limit=15.0 2024-06-22 02:52:19,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=511248.8333333333, ans=0.1 2024-06-22 02:52:26,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=511267.1666666667, ans=0.025 2024-06-22 02:52:28,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511267.1666666667, ans=0.1 2024-06-22 02:52:28,947 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:52:35,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=511285.5, ans=0.125 2024-06-22 02:52:36,019 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.222e+02 2.355e+02 2.470e+02 3.464e+02, threshold=4.710e+02, percent-clipped=0.0 2024-06-22 02:52:42,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=511303.8333333333, ans=0.125 2024-06-22 02:52:50,063 INFO [train.py:1028] (0/2) Epoch 28, batch 5750, loss[loss=0.1751, simple_loss=0.2283, pruned_loss=0.0609, over 12796.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2242, pruned_loss=0.05673, over 2577273.59 frames. ], batch size: 176, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:52:53,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=511340.5, ans=0.125 2024-06-22 02:52:56,874 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 02:52:57,014 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2024-06-22 02:52:57,407 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=511358.8333333333, ans=0.1 2024-06-22 02:52:59,601 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.73 vs. limit=6.0 2024-06-22 02:53:09,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=12.0 2024-06-22 02:53:11,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=511395.5, ans=0.125 2024-06-22 02:53:25,863 INFO [train.py:1028] (0/2) Epoch 28, batch 5800, loss[loss=0.1947, simple_loss=0.2462, pruned_loss=0.07166, over 12747.00 frames. ], tot_loss[loss=0.1709, simple_loss=0.226, pruned_loss=0.05786, over 2577115.84 frames. ], batch size: 176, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:53:27,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=511432.1666666667, ans=0.0 2024-06-22 02:53:27,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=511432.1666666667, ans=0.125 2024-06-22 02:53:48,154 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.308e+02 2.448e+02 2.639e+02 3.828e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 02:53:53,264 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=511487.1666666667, ans=0.07 2024-06-22 02:54:00,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=511505.5, ans=0.2 2024-06-22 02:54:02,233 INFO [train.py:1028] (0/2) Epoch 28, batch 5850, loss[loss=0.1832, simple_loss=0.2301, pruned_loss=0.0682, over 12480.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2279, pruned_loss=0.0586, over 2575922.46 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:54:02,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=511523.8333333333, ans=0.5 2024-06-22 02:54:08,633 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=15.0 2024-06-22 02:54:10,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=511542.1666666667, ans=0.125 2024-06-22 02:54:10,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511542.1666666667, ans=0.1 2024-06-22 02:54:13,388 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=511542.1666666667, ans=0.125 2024-06-22 02:54:15,702 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.88 vs. limit=15.0 2024-06-22 02:54:19,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=511560.5, ans=0.125 2024-06-22 02:54:31,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=12.78 vs. limit=15.0 2024-06-22 02:54:34,871 INFO [train.py:1028] (0/2) Epoch 28, batch 5900, loss[loss=0.1617, simple_loss=0.2119, pruned_loss=0.05578, over 13044.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2287, pruned_loss=0.05864, over 2576647.23 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:54:35,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=511615.5, ans=0.07 2024-06-22 02:54:53,804 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.207e+02 2.363e+02 2.599e+02 3.698e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-22 02:54:59,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=511670.5, ans=0.125 2024-06-22 02:55:07,356 INFO [train.py:1028] (0/2) Epoch 28, batch 5950, loss[loss=0.1692, simple_loss=0.222, pruned_loss=0.05823, over 13118.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.23, pruned_loss=0.05895, over 2582003.52 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:55:21,816 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-22 02:55:28,405 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=511743.8333333333, ans=0.0 2024-06-22 02:55:29,276 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2024-06-22 02:55:41,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511780.5, ans=0.1 2024-06-22 02:55:44,796 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2024-06-22 02:55:48,344 INFO [train.py:1028] (0/2) Epoch 28, batch 6000, loss[loss=0.2164, simple_loss=0.2638, pruned_loss=0.08453, over 12112.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2316, pruned_loss=0.05955, over 2573967.67 frames. ], batch size: 240, lr: 2.06e-03, grad_scale: 32.0 2024-06-22 02:55:48,345 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 02:55:56,204 INFO [train.py:1060] (0/2) Epoch 28, validation: loss=0.193, simple_loss=0.2523, pruned_loss=0.06681, over 351949.00 frames. 2024-06-22 02:55:56,205 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 02:55:57,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=511798.8333333333, ans=0.2 2024-06-22 02:56:01,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=511798.8333333333, ans=0.0 2024-06-22 02:56:16,039 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.308e+02 2.445e+02 2.736e+02 3.381e+02, threshold=4.889e+02, percent-clipped=0.0 2024-06-22 02:56:17,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=511853.8333333333, ans=0.2 2024-06-22 02:56:27,296 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.60 vs. limit=15.0 2024-06-22 02:56:27,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511872.1666666667, ans=0.125 2024-06-22 02:56:29,433 INFO [train.py:1028] (0/2) Epoch 28, batch 6050, loss[loss=0.1732, simple_loss=0.2318, pruned_loss=0.05727, over 13265.00 frames. ], tot_loss[loss=0.1764, simple_loss=0.2329, pruned_loss=0.05995, over 2577776.42 frames. ], batch size: 40, lr: 2.06e-03, grad_scale: 16.0 2024-06-22 02:56:30,618 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.71 vs. limit=15.0 2024-06-22 02:56:33,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=511890.5, ans=0.025 2024-06-22 02:56:39,063 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.37 vs. limit=15.0 2024-06-22 02:56:51,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511945.5, ans=0.1 2024-06-22 02:57:02,474 INFO [train.py:1028] (0/2) Epoch 28, batch 6100, loss[loss=0.1727, simple_loss=0.2255, pruned_loss=0.05991, over 13043.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2341, pruned_loss=0.06023, over 2579154.02 frames. ], batch size: 121, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:57:02,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=511982.1666666667, ans=0.2 2024-06-22 02:57:28,702 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.261e+02 2.383e+02 2.685e+02 4.094e+02, threshold=4.767e+02, percent-clipped=0.0 2024-06-22 02:57:34,485 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=512055.5, ans=0.0 2024-06-22 02:57:38,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=512055.5, ans=0.0 2024-06-22 02:57:41,585 INFO [train.py:1028] (0/2) Epoch 28, batch 6150, loss[loss=0.1965, simple_loss=0.2465, pruned_loss=0.0732, over 10869.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2353, pruned_loss=0.06068, over 2578480.59 frames. ], batch size: 304, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:57:42,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=512073.8333333333, ans=0.2 2024-06-22 02:57:48,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=512073.8333333333, ans=0.125 2024-06-22 02:57:49,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.41 vs. limit=22.5 2024-06-22 02:57:53,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=512092.1666666667, ans=0.125 2024-06-22 02:57:58,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2024-06-22 02:57:59,202 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=512110.5, ans=0.0 2024-06-22 02:58:09,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=512128.8333333333, ans=0.125 2024-06-22 02:58:13,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=512147.1666666667, ans=0.0 2024-06-22 02:58:16,873 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=512147.1666666667, ans=0.125 2024-06-22 02:58:18,579 INFO [train.py:1028] (0/2) Epoch 28, batch 6200, loss[loss=0.2112, simple_loss=0.2654, pruned_loss=0.07855, over 13273.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2368, pruned_loss=0.06104, over 2574519.74 frames. ], batch size: 89, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:58:20,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2024-06-22 02:58:22,676 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2024-06-22 02:58:22,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=28.35 vs. limit=22.5 2024-06-22 02:58:31,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=512202.1666666667, ans=10.0 2024-06-22 02:58:39,568 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.324e+02 2.521e+02 2.855e+02 4.564e+02, threshold=5.041e+02, percent-clipped=0.0 2024-06-22 02:58:52,400 INFO [train.py:1028] (0/2) Epoch 28, batch 6250, loss[loss=0.1817, simple_loss=0.2333, pruned_loss=0.06498, over 13184.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2385, pruned_loss=0.06167, over 2568023.91 frames. ], batch size: 83, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:58:56,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=512257.1666666667, ans=0.125 2024-06-22 02:58:56,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=512257.1666666667, ans=0.125 2024-06-22 02:59:02,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.26 vs. limit=6.0 2024-06-22 02:59:08,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=512293.8333333333, ans=0.125 2024-06-22 02:59:10,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=512293.8333333333, ans=0.04949747468305833 2024-06-22 02:59:11,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=512312.1666666667, ans=0.0 2024-06-22 02:59:11,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=512312.1666666667, ans=0.2 2024-06-22 02:59:25,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=512330.5, ans=0.125 2024-06-22 02:59:25,688 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=512330.5, ans=10.0 2024-06-22 02:59:28,071 INFO [train.py:1028] (0/2) Epoch 28, batch 6300, loss[loss=0.1815, simple_loss=0.2434, pruned_loss=0.05976, over 11112.00 frames. ], tot_loss[loss=0.182, simple_loss=0.24, pruned_loss=0.06201, over 2563235.64 frames. ], batch size: 16, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 02:59:28,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=512348.8333333333, ans=0.0 2024-06-22 02:59:35,137 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.75 vs. limit=15.0 2024-06-22 02:59:38,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2024-06-22 02:59:50,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=512403.8333333333, ans=0.125 2024-06-22 02:59:51,426 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.395e+02 2.577e+02 2.834e+02 4.218e+02, threshold=5.155e+02, percent-clipped=0.0 2024-06-22 03:00:03,717 INFO [train.py:1028] (0/2) Epoch 28, batch 6350, loss[loss=0.2267, simple_loss=0.278, pruned_loss=0.08766, over 12580.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2414, pruned_loss=0.06238, over 2572495.97 frames. ], batch size: 202, lr: 2.06e-03, grad_scale: 8.0 2024-06-22 03:00:08,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.56 vs. limit=15.0 2024-06-22 03:00:10,258 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.88 vs. limit=12.0 2024-06-22 03:00:31,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512513.8333333333, ans=0.1 2024-06-22 03:00:38,841 INFO [train.py:1028] (0/2) Epoch 28, batch 6400, loss[loss=0.1558, simple_loss=0.2233, pruned_loss=0.04415, over 13237.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2428, pruned_loss=0.06269, over 2574809.90 frames. ], batch size: 67, lr: 2.06e-03, grad_scale: 16.0 2024-06-22 03:00:39,378 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.95 vs. limit=10.0 2024-06-22 03:00:48,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=512550.5, ans=0.0 2024-06-22 03:00:51,564 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=512568.8333333333, ans=0.2 2024-06-22 03:00:59,681 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.882e+02 2.347e+02 2.471e+02 2.800e+02 4.021e+02, threshold=4.942e+02, percent-clipped=0.0 2024-06-22 03:01:08,484 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=512605.5, ans=0.025 2024-06-22 03:01:11,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=512623.8333333333, ans=0.125 2024-06-22 03:01:12,264 INFO [train.py:1028] (0/2) Epoch 28, batch 6450, loss[loss=0.2057, simple_loss=0.2601, pruned_loss=0.07562, over 12472.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2444, pruned_loss=0.06354, over 2580592.53 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:01:39,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=512678.8333333333, ans=0.125 2024-06-22 03:01:45,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=512697.1666666667, ans=0.0 2024-06-22 03:01:50,663 INFO [train.py:1028] (0/2) Epoch 28, batch 6500, loss[loss=0.2041, simple_loss=0.2566, pruned_loss=0.07579, over 10725.00 frames. ], tot_loss[loss=0.1867, simple_loss=0.2459, pruned_loss=0.06377, over 2583672.80 frames. ], batch size: 304, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:01:54,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=512715.5, ans=0.0 2024-06-22 03:01:54,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=512715.5, ans=0.2 2024-06-22 03:02:00,427 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.45 vs. limit=22.5 2024-06-22 03:02:02,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=512733.8333333333, ans=0.125 2024-06-22 03:02:02,908 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.94 vs. limit=15.0 2024-06-22 03:02:07,073 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2024-06-22 03:02:14,784 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.063e+02 2.332e+02 2.481e+02 2.683e+02 3.389e+02, threshold=4.963e+02, percent-clipped=0.0 2024-06-22 03:02:27,183 INFO [train.py:1028] (0/2) Epoch 28, batch 6550, loss[loss=0.17, simple_loss=0.2372, pruned_loss=0.05135, over 12578.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2463, pruned_loss=0.06364, over 2588088.54 frames. ], batch size: 22, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:02:27,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=512807.1666666667, ans=0.2 2024-06-22 03:02:28,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=512807.1666666667, ans=0.1 2024-06-22 03:02:33,440 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=512825.5, ans=0.125 2024-06-22 03:02:35,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=512825.5, ans=0.125 2024-06-22 03:02:50,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512862.1666666667, ans=0.1 2024-06-22 03:02:50,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=512862.1666666667, ans=0.0 2024-06-22 03:02:58,158 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:03:00,778 INFO [train.py:1028] (0/2) Epoch 28, batch 6600, loss[loss=0.1727, simple_loss=0.2448, pruned_loss=0.05029, over 13236.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2471, pruned_loss=0.0637, over 2590101.32 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:03:07,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=512917.1666666667, ans=0.2 2024-06-22 03:03:21,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=512953.8333333333, ans=0.1 2024-06-22 03:03:22,217 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.448e+02 2.635e+02 2.897e+02 3.626e+02, threshold=5.270e+02, percent-clipped=0.0 2024-06-22 03:03:32,095 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=15.0 2024-06-22 03:03:35,011 INFO [train.py:1028] (0/2) Epoch 28, batch 6650, loss[loss=0.2005, simple_loss=0.2605, pruned_loss=0.07024, over 12931.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2487, pruned_loss=0.06439, over 2585381.03 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:03:45,106 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=15.0 2024-06-22 03:03:57,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=513027.1666666667, ans=0.0 2024-06-22 03:04:13,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=513063.8333333333, ans=0.07 2024-06-22 03:04:15,931 INFO [train.py:1028] (0/2) Epoch 28, batch 6700, loss[loss=0.1981, simple_loss=0.2567, pruned_loss=0.0698, over 12748.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2495, pruned_loss=0.06485, over 2584050.43 frames. ], batch size: 176, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:04:15,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=513082.1666666667, ans=0.125 2024-06-22 03:04:18,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=513082.1666666667, ans=0.035 2024-06-22 03:04:20,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=513082.1666666667, ans=0.125 2024-06-22 03:04:23,047 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2024-06-22 03:04:32,151 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513118.8333333333, ans=0.1 2024-06-22 03:04:36,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.416e+02 2.555e+02 2.742e+02 4.045e+02, threshold=5.109e+02, percent-clipped=0.0 2024-06-22 03:04:47,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513155.5, ans=0.125 2024-06-22 03:04:48,802 INFO [train.py:1028] (0/2) Epoch 28, batch 6750, loss[loss=0.2259, simple_loss=0.2809, pruned_loss=0.08549, over 12213.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2505, pruned_loss=0.06529, over 2576432.33 frames. ], batch size: 241, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:04:52,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=513173.8333333333, ans=0.2 2024-06-22 03:04:52,674 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513173.8333333333, ans=0.125 2024-06-22 03:04:53,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=513173.8333333333, ans=0.125 2024-06-22 03:04:54,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=513192.1666666667, ans=0.09899494936611666 2024-06-22 03:05:01,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=513210.5, ans=0.125 2024-06-22 03:05:10,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=513228.8333333333, ans=0.125 2024-06-22 03:05:12,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=513228.8333333333, ans=0.125 2024-06-22 03:05:15,223 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2024-06-22 03:05:21,390 INFO [train.py:1028] (0/2) Epoch 28, batch 6800, loss[loss=0.1652, simple_loss=0.2274, pruned_loss=0.05156, over 13230.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2522, pruned_loss=0.06574, over 2578455.38 frames. ], batch size: 67, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:05:34,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=513302.1666666667, ans=0.125 2024-06-22 03:05:38,658 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=513302.1666666667, ans=0.0 2024-06-22 03:05:41,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.376e+02 2.528e+02 2.856e+02 4.508e+02, threshold=5.057e+02, percent-clipped=0.0 2024-06-22 03:05:48,412 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2024-06-22 03:05:48,771 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-280000.pt 2024-06-22 03:06:01,384 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2024-06-22 03:06:04,450 INFO [train.py:1028] (0/2) Epoch 28, batch 6850, loss[loss=0.1933, simple_loss=0.259, pruned_loss=0.0638, over 13221.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2521, pruned_loss=0.0657, over 2581733.83 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:06:05,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=513357.1666666667, ans=0.125 2024-06-22 03:06:09,920 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=513357.1666666667, ans=0.125 2024-06-22 03:06:11,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=513375.5, ans=0.125 2024-06-22 03:06:23,605 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2024-06-22 03:06:32,698 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=12.0 2024-06-22 03:06:34,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.33 vs. limit=12.0 2024-06-22 03:06:41,278 INFO [train.py:1028] (0/2) Epoch 28, batch 6900, loss[loss=0.1907, simple_loss=0.2509, pruned_loss=0.06522, over 13337.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2523, pruned_loss=0.06574, over 2584326.33 frames. ], batch size: 49, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:06:41,437 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=513448.8333333333, ans=0.0 2024-06-22 03:06:53,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=513467.1666666667, ans=0.0 2024-06-22 03:06:56,153 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=513485.5, ans=0.0 2024-06-22 03:07:02,067 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.506e+02 2.734e+02 3.126e+02 4.309e+02, threshold=5.468e+02, percent-clipped=0.0 2024-06-22 03:07:03,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=513503.8333333333, ans=0.0 2024-06-22 03:07:03,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513503.8333333333, ans=0.1 2024-06-22 03:07:07,883 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.73 vs. limit=15.0 2024-06-22 03:07:15,228 INFO [train.py:1028] (0/2) Epoch 28, batch 6950, loss[loss=0.1933, simple_loss=0.2562, pruned_loss=0.06522, over 11019.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2529, pruned_loss=0.06564, over 2579235.94 frames. ], batch size: 16, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:07:15,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=513540.5, ans=0.025 2024-06-22 03:07:15,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=513540.5, ans=0.0 2024-06-22 03:07:16,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513540.5, ans=0.1 2024-06-22 03:07:30,872 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=513577.1666666667, ans=0.125 2024-06-22 03:07:30,921 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513577.1666666667, ans=0.125 2024-06-22 03:07:41,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513613.8333333333, ans=0.1 2024-06-22 03:07:47,749 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513613.8333333333, ans=0.125 2024-06-22 03:07:48,752 INFO [train.py:1028] (0/2) Epoch 28, batch 7000, loss[loss=0.1995, simple_loss=0.256, pruned_loss=0.07144, over 12937.00 frames. ], tot_loss[loss=0.192, simple_loss=0.253, pruned_loss=0.06546, over 2575600.21 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:07:51,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.90 vs. limit=15.0 2024-06-22 03:08:03,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=513650.5, ans=0.2 2024-06-22 03:08:04,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=513650.5, ans=0.125 2024-06-22 03:08:12,974 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.440e+02 2.563e+02 2.722e+02 3.382e+02, threshold=5.126e+02, percent-clipped=0.0 2024-06-22 03:08:13,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=513687.1666666667, ans=0.2 2024-06-22 03:08:13,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=513687.1666666667, ans=0.0 2024-06-22 03:08:18,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2024-06-22 03:08:24,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=513705.5, ans=0.0 2024-06-22 03:08:26,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=513705.5, ans=0.0 2024-06-22 03:08:29,909 INFO [train.py:1028] (0/2) Epoch 28, batch 7050, loss[loss=0.1956, simple_loss=0.2532, pruned_loss=0.06893, over 12748.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2538, pruned_loss=0.06559, over 2583483.66 frames. ], batch size: 176, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:08:44,305 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.65 vs. limit=22.5 2024-06-22 03:08:58,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513797.1666666667, ans=0.125 2024-06-22 03:08:59,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=513797.1666666667, ans=0.2 2024-06-22 03:09:02,806 INFO [train.py:1028] (0/2) Epoch 28, batch 7100, loss[loss=0.2156, simple_loss=0.2795, pruned_loss=0.07583, over 13184.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2543, pruned_loss=0.06602, over 2575855.95 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:09:03,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2024-06-22 03:09:07,333 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.34 vs. limit=6.0 2024-06-22 03:09:11,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=513833.8333333333, ans=0.125 2024-06-22 03:09:16,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513852.1666666667, ans=0.1 2024-06-22 03:09:19,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.50 vs. limit=22.5 2024-06-22 03:09:23,352 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.404e+02 2.579e+02 2.811e+02 3.907e+02, threshold=5.158e+02, percent-clipped=0.0 2024-06-22 03:09:29,439 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2024-06-22 03:09:36,350 INFO [train.py:1028] (0/2) Epoch 28, batch 7150, loss[loss=0.2194, simple_loss=0.2761, pruned_loss=0.08139, over 12555.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2555, pruned_loss=0.0661, over 2573262.21 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:09:37,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=513907.1666666667, ans=0.125 2024-06-22 03:09:49,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=513943.8333333333, ans=0.125 2024-06-22 03:09:49,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=513943.8333333333, ans=0.125 2024-06-22 03:09:50,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=513943.8333333333, ans=0.04949747468305833 2024-06-22 03:09:56,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=513962.1666666667, ans=0.0 2024-06-22 03:10:03,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=513962.1666666667, ans=0.125 2024-06-22 03:10:04,177 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2024-06-22 03:10:12,660 INFO [train.py:1028] (0/2) Epoch 28, batch 7200, loss[loss=0.2015, simple_loss=0.2622, pruned_loss=0.07045, over 13144.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2569, pruned_loss=0.06659, over 2578482.80 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:10:12,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=513998.8333333333, ans=0.0 2024-06-22 03:10:20,142 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2024-06-22 03:10:25,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.44 vs. limit=12.0 2024-06-22 03:10:25,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=514035.5, ans=0.125 2024-06-22 03:10:27,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=514035.5, ans=0.1 2024-06-22 03:10:35,908 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.414e+02 2.571e+02 2.823e+02 4.177e+02, threshold=5.142e+02, percent-clipped=0.0 2024-06-22 03:10:38,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-22 03:10:39,450 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=514053.8333333333, ans=0.09899494936611666 2024-06-22 03:10:42,260 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=514072.1666666667, ans=0.125 2024-06-22 03:10:48,643 INFO [train.py:1028] (0/2) Epoch 28, batch 7250, loss[loss=0.1641, simple_loss=0.2271, pruned_loss=0.05054, over 13014.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2578, pruned_loss=0.06679, over 2580172.63 frames. ], batch size: 36, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:10:53,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=514090.5, ans=0.125 2024-06-22 03:10:58,363 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:11:06,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=514127.1666666667, ans=0.125 2024-06-22 03:11:11,270 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=514145.5, ans=0.125 2024-06-22 03:11:18,952 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=514163.8333333333, ans=0.0 2024-06-22 03:11:21,314 INFO [train.py:1028] (0/2) Epoch 28, batch 7300, loss[loss=0.1987, simple_loss=0.2618, pruned_loss=0.0678, over 12932.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2592, pruned_loss=0.06752, over 2579903.14 frames. ], batch size: 36, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:11:39,815 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=514218.8333333333, ans=0.125 2024-06-22 03:11:40,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=514237.1666666667, ans=0.0 2024-06-22 03:11:41,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=514237.1666666667, ans=0.0 2024-06-22 03:11:41,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.425e+02 2.624e+02 2.897e+02 4.391e+02, threshold=5.249e+02, percent-clipped=0.0 2024-06-22 03:11:49,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=514255.5, ans=0.125 2024-06-22 03:11:52,894 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.74 vs. limit=22.5 2024-06-22 03:11:54,006 INFO [train.py:1028] (0/2) Epoch 28, batch 7350, loss[loss=0.2344, simple_loss=0.2918, pruned_loss=0.0885, over 13342.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2598, pruned_loss=0.06805, over 2582047.33 frames. ], batch size: 46, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:11:56,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=514273.8333333333, ans=0.125 2024-06-22 03:12:14,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=514310.5, ans=0.125 2024-06-22 03:12:21,842 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:12:23,793 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=514347.1666666667, ans=0.0 2024-06-22 03:12:30,944 INFO [train.py:1028] (0/2) Epoch 28, batch 7400, loss[loss=0.2176, simple_loss=0.2842, pruned_loss=0.07547, over 13254.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2598, pruned_loss=0.06777, over 2587631.64 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:12:31,166 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:12:37,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2024-06-22 03:12:50,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514402.1666666667, ans=0.1 2024-06-22 03:12:51,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2024-06-22 03:12:51,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=514402.1666666667, ans=0.125 2024-06-22 03:12:55,362 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.432e+02 2.586e+02 2.828e+02 3.873e+02, threshold=5.173e+02, percent-clipped=0.0 2024-06-22 03:13:08,567 INFO [train.py:1028] (0/2) Epoch 28, batch 7450, loss[loss=0.1862, simple_loss=0.2503, pruned_loss=0.0611, over 12744.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2595, pruned_loss=0.06764, over 2581020.09 frames. ], batch size: 29, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:13:08,970 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.36 vs. limit=15.0 2024-06-22 03:13:16,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=514475.5, ans=0.125 2024-06-22 03:13:18,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2024-06-22 03:13:19,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=514475.5, ans=0.1 2024-06-22 03:13:19,965 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=514475.5, ans=0.025 2024-06-22 03:13:23,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=514493.8333333333, ans=0.0 2024-06-22 03:13:28,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=514512.1666666667, ans=0.2 2024-06-22 03:13:31,266 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514512.1666666667, ans=0.1 2024-06-22 03:13:31,880 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=514512.1666666667, ans=0.125 2024-06-22 03:13:32,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=514512.1666666667, ans=0.0 2024-06-22 03:13:42,419 INFO [train.py:1028] (0/2) Epoch 28, batch 7500, loss[loss=0.2058, simple_loss=0.2586, pruned_loss=0.07649, over 10528.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2605, pruned_loss=0.06826, over 2578252.45 frames. ], batch size: 303, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:14:02,522 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.451e+02 2.630e+02 2.857e+02 4.051e+02, threshold=5.260e+02, percent-clipped=0.0 2024-06-22 03:14:04,180 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=514603.8333333333, ans=0.07 2024-06-22 03:14:04,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=514603.8333333333, ans=0.0 2024-06-22 03:14:12,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=514622.1666666667, ans=0.125 2024-06-22 03:14:17,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=514640.5, ans=0.125 2024-06-22 03:14:17,970 INFO [train.py:1028] (0/2) Epoch 28, batch 7550, loss[loss=0.1946, simple_loss=0.2475, pruned_loss=0.07086, over 12865.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.261, pruned_loss=0.06888, over 2577757.94 frames. ], batch size: 158, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:14:18,150 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=514640.5, ans=0.1 2024-06-22 03:14:23,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=514658.8333333333, ans=0.0 2024-06-22 03:14:23,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=514658.8333333333, ans=0.0 2024-06-22 03:14:25,025 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514658.8333333333, ans=0.1 2024-06-22 03:14:44,191 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2024-06-22 03:14:45,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=6.0 2024-06-22 03:14:47,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=514713.8333333333, ans=0.95 2024-06-22 03:14:53,828 INFO [train.py:1028] (0/2) Epoch 28, batch 7600, loss[loss=0.2084, simple_loss=0.2657, pruned_loss=0.07552, over 13196.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2621, pruned_loss=0.06952, over 2576523.42 frames. ], batch size: 83, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:14:54,686 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=514732.1666666667, ans=0.125 2024-06-22 03:15:06,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514750.5, ans=0.0 2024-06-22 03:15:14,857 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.464e+02 2.688e+02 2.984e+02 3.842e+02, threshold=5.375e+02, percent-clipped=0.0 2024-06-22 03:15:25,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=514805.5, ans=0.015 2024-06-22 03:15:27,648 INFO [train.py:1028] (0/2) Epoch 28, batch 7650, loss[loss=0.2021, simple_loss=0.2625, pruned_loss=0.07087, over 12977.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2621, pruned_loss=0.06963, over 2573351.58 frames. ], batch size: 33, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:15:30,856 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.13 vs. limit=6.0 2024-06-22 03:15:32,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2024-06-22 03:15:38,745 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=514842.1666666667, ans=0.025 2024-06-22 03:15:38,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.32 vs. limit=10.0 2024-06-22 03:15:44,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=514860.5, ans=0.125 2024-06-22 03:15:45,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=514860.5, ans=0.125 2024-06-22 03:15:56,179 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.77 vs. limit=15.0 2024-06-22 03:16:05,095 INFO [train.py:1028] (0/2) Epoch 28, batch 7700, loss[loss=0.2202, simple_loss=0.2843, pruned_loss=0.07807, over 13295.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2625, pruned_loss=0.06945, over 2569068.42 frames. ], batch size: 63, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:16:06,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=514915.5, ans=0.125 2024-06-22 03:16:10,326 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=514915.5, ans=0.2 2024-06-22 03:16:25,047 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.510e+02 2.776e+02 2.982e+02 3.995e+02, threshold=5.552e+02, percent-clipped=0.0 2024-06-22 03:16:25,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514970.5, ans=0.1 2024-06-22 03:16:37,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=514988.8333333333, ans=0.2 2024-06-22 03:16:41,339 INFO [train.py:1028] (0/2) Epoch 28, batch 7750, loss[loss=0.1867, simple_loss=0.2595, pruned_loss=0.05688, over 13087.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2635, pruned_loss=0.07014, over 2573576.40 frames. ], batch size: 71, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:16:48,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=515025.5, ans=0.025 2024-06-22 03:16:52,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=515025.5, ans=0.025 2024-06-22 03:17:04,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=515062.1666666667, ans=0.2 2024-06-22 03:17:15,134 INFO [train.py:1028] (0/2) Epoch 28, batch 7800, loss[loss=0.2146, simple_loss=0.2746, pruned_loss=0.07724, over 13149.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2638, pruned_loss=0.06994, over 2578947.05 frames. ], batch size: 95, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:17:27,195 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=515117.1666666667, ans=0.125 2024-06-22 03:17:31,495 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=515135.5, ans=0.125 2024-06-22 03:17:36,160 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.475e+02 2.626e+02 2.878e+02 4.182e+02, threshold=5.252e+02, percent-clipped=0.0 2024-06-22 03:17:38,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=515153.8333333333, ans=10.0 2024-06-22 03:17:49,190 INFO [train.py:1028] (0/2) Epoch 28, batch 7850, loss[loss=0.1608, simple_loss=0.2213, pruned_loss=0.05014, over 11696.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2645, pruned_loss=0.07011, over 2573842.57 frames. ], batch size: 17, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:18:02,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=515208.8333333333, ans=0.0 2024-06-22 03:18:06,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=515227.1666666667, ans=0.0 2024-06-22 03:18:09,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=515227.1666666667, ans=0.0 2024-06-22 03:18:16,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=515245.5, ans=0.0 2024-06-22 03:18:28,700 INFO [train.py:1028] (0/2) Epoch 28, batch 7900, loss[loss=0.1825, simple_loss=0.2529, pruned_loss=0.05603, over 13130.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2642, pruned_loss=0.06998, over 2572955.73 frames. ], batch size: 77, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:18:30,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=515282.1666666667, ans=0.125 2024-06-22 03:18:38,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=515300.5, ans=0.125 2024-06-22 03:18:43,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.61 vs. limit=22.5 2024-06-22 03:18:46,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.49 vs. limit=22.5 2024-06-22 03:18:48,239 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:18:49,292 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.459e+02 2.575e+02 2.929e+02 4.217e+02, threshold=5.150e+02, percent-clipped=0.0 2024-06-22 03:19:01,849 INFO [train.py:1028] (0/2) Epoch 28, batch 7950, loss[loss=0.2116, simple_loss=0.2642, pruned_loss=0.07953, over 10628.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2637, pruned_loss=0.06959, over 2576199.00 frames. ], batch size: 304, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:19:11,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=515392.1666666667, ans=0.125 2024-06-22 03:19:20,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515410.5, ans=0.1 2024-06-22 03:19:24,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=515428.8333333333, ans=0.125 2024-06-22 03:19:34,997 INFO [train.py:1028] (0/2) Epoch 28, batch 8000, loss[loss=0.1915, simple_loss=0.2616, pruned_loss=0.06069, over 12729.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2645, pruned_loss=0.0698, over 2573518.90 frames. ], batch size: 29, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:19:35,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=515465.5, ans=0.125 2024-06-22 03:19:42,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=515483.8333333333, ans=0.07 2024-06-22 03:19:58,477 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.417e+02 2.578e+02 2.776e+02 3.520e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 03:20:05,160 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.70 vs. limit=5.0 2024-06-22 03:20:05,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=515538.8333333333, ans=0.0 2024-06-22 03:20:05,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2024-06-22 03:20:06,803 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=515538.8333333333, ans=0.0 2024-06-22 03:20:10,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=515538.8333333333, ans=0.125 2024-06-22 03:20:11,248 INFO [train.py:1028] (0/2) Epoch 28, batch 8050, loss[loss=0.2188, simple_loss=0.277, pruned_loss=0.08028, over 13249.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.264, pruned_loss=0.06969, over 2573485.31 frames. ], batch size: 83, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:20:15,555 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.12 vs. limit=10.0 2024-06-22 03:20:17,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=515575.5, ans=0.125 2024-06-22 03:20:36,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=515612.1666666667, ans=0.2 2024-06-22 03:20:39,114 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:20:43,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=515630.5, ans=0.125 2024-06-22 03:20:47,689 INFO [train.py:1028] (0/2) Epoch 28, batch 8100, loss[loss=0.2052, simple_loss=0.264, pruned_loss=0.0732, over 13163.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2647, pruned_loss=0.06993, over 2576917.83 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:20:49,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=515648.8333333333, ans=0.0 2024-06-22 03:20:53,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=515667.1666666667, ans=0.2 2024-06-22 03:21:01,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=515685.5, ans=0.0 2024-06-22 03:21:08,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.373e+02 2.514e+02 2.726e+02 3.558e+02, threshold=5.028e+02, percent-clipped=0.0 2024-06-22 03:21:19,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515722.1666666667, ans=0.1 2024-06-22 03:21:21,064 INFO [train.py:1028] (0/2) Epoch 28, batch 8150, loss[loss=0.2026, simple_loss=0.2637, pruned_loss=0.07073, over 13096.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.265, pruned_loss=0.06963, over 2580106.89 frames. ], batch size: 121, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:21:22,682 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.51 vs. limit=10.0 2024-06-22 03:21:23,343 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=515740.5, ans=0.125 2024-06-22 03:21:25,479 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=515740.5, ans=0.0 2024-06-22 03:21:30,932 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=515758.8333333333, ans=15.0 2024-06-22 03:21:33,076 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=515758.8333333333, ans=0.1 2024-06-22 03:21:36,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=515777.1666666667, ans=0.025 2024-06-22 03:21:38,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=515777.1666666667, ans=0.1 2024-06-22 03:21:40,785 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=515795.5, ans=0.125 2024-06-22 03:21:53,954 INFO [train.py:1028] (0/2) Epoch 28, batch 8200, loss[loss=0.2141, simple_loss=0.2748, pruned_loss=0.07666, over 13118.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2656, pruned_loss=0.06977, over 2583003.05 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:21:54,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515832.1666666667, ans=0.1 2024-06-22 03:21:57,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=515832.1666666667, ans=0.125 2024-06-22 03:22:04,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=515850.5, ans=0.125 2024-06-22 03:22:09,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=515850.5, ans=0.0 2024-06-22 03:22:10,567 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=515868.8333333333, ans=0.125 2024-06-22 03:22:18,217 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.538e+02 2.667e+02 2.950e+02 3.581e+02, threshold=5.334e+02, percent-clipped=0.0 2024-06-22 03:22:27,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=515905.5, ans=0.5 2024-06-22 03:22:29,753 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2024-06-22 03:22:34,366 INFO [train.py:1028] (0/2) Epoch 28, batch 8250, loss[loss=0.2007, simple_loss=0.2675, pruned_loss=0.06693, over 13251.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2654, pruned_loss=0.06965, over 2582944.90 frames. ], batch size: 52, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:22:37,332 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2024-06-22 03:22:40,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=515942.1666666667, ans=0.125 2024-06-22 03:22:52,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=515960.5, ans=0.0 2024-06-22 03:22:53,637 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:22:57,258 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=515978.8333333333, ans=0.0 2024-06-22 03:23:06,030 INFO [train.py:1028] (0/2) Epoch 28, batch 8300, loss[loss=0.2033, simple_loss=0.2606, pruned_loss=0.07298, over 13010.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.265, pruned_loss=0.06961, over 2579818.25 frames. ], batch size: 102, lr: 2.05e-03, grad_scale: 64.0 2024-06-22 03:23:09,484 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2024-06-22 03:23:10,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=516015.5, ans=0.125 2024-06-22 03:23:14,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516033.8333333333, ans=0.1 2024-06-22 03:23:26,199 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.524e+02 2.659e+02 2.850e+02 3.606e+02, threshold=5.319e+02, percent-clipped=0.0 2024-06-22 03:23:27,467 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.49 vs. limit=15.0 2024-06-22 03:23:37,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=516088.8333333333, ans=0.2 2024-06-22 03:23:38,876 INFO [train.py:1028] (0/2) Epoch 28, batch 8350, loss[loss=0.1964, simple_loss=0.2632, pruned_loss=0.06486, over 13171.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2652, pruned_loss=0.06943, over 2580857.33 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:23:53,155 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=516143.8333333333, ans=0.0 2024-06-22 03:24:05,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=516162.1666666667, ans=0.1 2024-06-22 03:24:07,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=516162.1666666667, ans=0.025 2024-06-22 03:24:08,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=516180.5, ans=0.0 2024-06-22 03:24:15,671 INFO [train.py:1028] (0/2) Epoch 28, batch 8400, loss[loss=0.1843, simple_loss=0.2463, pruned_loss=0.0611, over 12957.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2655, pruned_loss=0.06986, over 2576704.62 frames. ], batch size: 39, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:24:18,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516198.8333333333, ans=0.125 2024-06-22 03:24:23,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=516217.1666666667, ans=0.025 2024-06-22 03:24:35,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=516235.5, ans=10.0 2024-06-22 03:24:40,127 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.546e+02 2.779e+02 3.098e+02 4.044e+02, threshold=5.557e+02, percent-clipped=0.0 2024-06-22 03:24:42,462 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.05 vs. limit=15.0 2024-06-22 03:24:42,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=516253.8333333333, ans=0.025 2024-06-22 03:24:48,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.01 vs. limit=15.0 2024-06-22 03:24:51,568 INFO [train.py:1028] (0/2) Epoch 28, batch 8450, loss[loss=0.1931, simple_loss=0.259, pruned_loss=0.06364, over 13151.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2665, pruned_loss=0.0702, over 2578862.28 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:25:00,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=516308.8333333333, ans=0.0 2024-06-22 03:25:06,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2024-06-22 03:25:09,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=516327.1666666667, ans=0.125 2024-06-22 03:25:16,566 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=516345.5, ans=0.0 2024-06-22 03:25:24,260 INFO [train.py:1028] (0/2) Epoch 28, batch 8500, loss[loss=0.2114, simple_loss=0.2728, pruned_loss=0.07498, over 12601.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2673, pruned_loss=0.0703, over 2578473.42 frames. ], batch size: 29, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:25:30,404 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.01 vs. limit=15.0 2024-06-22 03:25:40,404 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=516418.8333333333, ans=0.04949747468305833 2024-06-22 03:25:41,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=516418.8333333333, ans=0.1 2024-06-22 03:25:41,863 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:25:42,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=516418.8333333333, ans=0.0 2024-06-22 03:25:44,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=516437.1666666667, ans=0.125 2024-06-22 03:25:45,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.443e+02 2.667e+02 2.913e+02 4.137e+02, threshold=5.335e+02, percent-clipped=0.0 2024-06-22 03:25:53,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=516455.5, ans=0.125 2024-06-22 03:25:57,655 INFO [train.py:1028] (0/2) Epoch 28, batch 8550, loss[loss=0.1789, simple_loss=0.2442, pruned_loss=0.05678, over 12570.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2668, pruned_loss=0.0701, over 2575967.49 frames. ], batch size: 22, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:26:17,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516510.5, ans=0.1 2024-06-22 03:26:19,138 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=516510.5, ans=0.125 2024-06-22 03:26:30,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=516547.1666666667, ans=0.2 2024-06-22 03:26:37,507 INFO [train.py:1028] (0/2) Epoch 28, batch 8600, loss[loss=0.1924, simple_loss=0.2486, pruned_loss=0.06806, over 13123.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.267, pruned_loss=0.06996, over 2573670.56 frames. ], batch size: 112, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:26:40,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=516565.5, ans=0.125 2024-06-22 03:26:59,206 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.447e+02 2.628e+02 2.794e+02 3.705e+02, threshold=5.257e+02, percent-clipped=0.0 2024-06-22 03:27:05,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=516638.8333333333, ans=0.125 2024-06-22 03:27:11,353 INFO [train.py:1028] (0/2) Epoch 28, batch 8650, loss[loss=0.2064, simple_loss=0.2667, pruned_loss=0.07303, over 13070.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2673, pruned_loss=0.06993, over 2576484.33 frames. ], batch size: 102, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:27:11,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=516657.1666666667, ans=0.125 2024-06-22 03:27:15,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=516657.1666666667, ans=0.0 2024-06-22 03:27:16,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=516657.1666666667, ans=0.125 2024-06-22 03:27:19,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=516675.5, ans=0.0 2024-06-22 03:27:41,210 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.28 vs. limit=15.0 2024-06-22 03:27:44,270 INFO [train.py:1028] (0/2) Epoch 28, batch 8700, loss[loss=0.21, simple_loss=0.2712, pruned_loss=0.07441, over 13231.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2681, pruned_loss=0.07032, over 2573647.55 frames. ], batch size: 59, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:27:49,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=516748.8333333333, ans=0.0 2024-06-22 03:27:53,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=516767.1666666667, ans=0.125 2024-06-22 03:27:53,549 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2024-06-22 03:28:01,089 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2024-06-22 03:28:01,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=516785.5, ans=0.125 2024-06-22 03:28:09,968 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 2.413e+02 2.623e+02 2.842e+02 4.444e+02, threshold=5.246e+02, percent-clipped=0.0 2024-06-22 03:28:10,760 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=516803.8333333333, ans=0.2 2024-06-22 03:28:18,751 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:28:21,689 INFO [train.py:1028] (0/2) Epoch 28, batch 8750, loss[loss=0.204, simple_loss=0.2634, pruned_loss=0.07227, over 13098.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2682, pruned_loss=0.07051, over 2569892.03 frames. ], batch size: 121, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:28:22,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=516840.5, ans=0.2 2024-06-22 03:28:32,478 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=516858.8333333333, ans=0.0 2024-06-22 03:28:33,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=516858.8333333333, ans=0.125 2024-06-22 03:28:48,600 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2024-06-22 03:28:56,196 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.64 vs. limit=15.0 2024-06-22 03:28:57,626 INFO [train.py:1028] (0/2) Epoch 28, batch 8800, loss[loss=0.2016, simple_loss=0.2716, pruned_loss=0.06576, over 13247.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2688, pruned_loss=0.07102, over 2574661.03 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:29:13,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=516968.8333333333, ans=0.125 2024-06-22 03:29:14,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=516968.8333333333, ans=0.125 2024-06-22 03:29:19,591 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.598e+02 2.774e+02 3.033e+02 4.031e+02, threshold=5.548e+02, percent-clipped=0.0 2024-06-22 03:29:21,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=516987.1666666667, ans=0.2 2024-06-22 03:29:31,720 INFO [train.py:1028] (0/2) Epoch 28, batch 8850, loss[loss=0.2367, simple_loss=0.295, pruned_loss=0.08927, over 12572.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2688, pruned_loss=0.07095, over 2563234.67 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 32.0 2024-06-22 03:29:36,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=517023.8333333333, ans=0.125 2024-06-22 03:29:52,728 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=517060.5, ans=0.2 2024-06-22 03:30:00,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=12.0 2024-06-22 03:30:10,767 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.74 vs. limit=10.0 2024-06-22 03:30:17,934 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.20 vs. limit=22.5 2024-06-22 03:30:18,806 INFO [train.py:1028] (0/2) Epoch 28, batch 8900, loss[loss=0.2095, simple_loss=0.2676, pruned_loss=0.07571, over 12980.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.269, pruned_loss=0.07106, over 2561925.30 frames. ], batch size: 33, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:30:18,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=517115.5, ans=0.125 2024-06-22 03:30:22,365 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.13 vs. limit=15.0 2024-06-22 03:30:29,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=517133.8333333333, ans=0.125 2024-06-22 03:30:30,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517133.8333333333, ans=0.1 2024-06-22 03:30:45,173 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.543e+02 2.773e+02 3.005e+02 4.376e+02, threshold=5.546e+02, percent-clipped=0.0 2024-06-22 03:30:50,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=517188.8333333333, ans=0.09899494936611666 2024-06-22 03:30:56,993 INFO [train.py:1028] (0/2) Epoch 28, batch 8950, loss[loss=0.2167, simple_loss=0.2783, pruned_loss=0.07756, over 12518.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.269, pruned_loss=0.07086, over 2561211.48 frames. ], batch size: 202, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:30:58,878 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=8.0 2024-06-22 03:31:01,594 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.62 vs. limit=10.0 2024-06-22 03:31:11,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=517243.8333333333, ans=0.125 2024-06-22 03:31:19,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=517262.1666666667, ans=0.125 2024-06-22 03:31:24,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=517280.5, ans=0.125 2024-06-22 03:31:25,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517280.5, ans=0.125 2024-06-22 03:31:26,489 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=517280.5, ans=0.0 2024-06-22 03:31:26,839 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2024-06-22 03:31:29,779 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=517280.5, ans=0.02 2024-06-22 03:31:31,449 INFO [train.py:1028] (0/2) Epoch 28, batch 9000, loss[loss=0.2197, simple_loss=0.28, pruned_loss=0.0797, over 13304.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2693, pruned_loss=0.07071, over 2566666.74 frames. ], batch size: 46, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:31:31,450 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 03:31:38,550 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([4.8765, 4.9709, 4.9513, 4.9643], device='cuda:0') 2024-06-22 03:31:39,689 INFO [train.py:1060] (0/2) Epoch 28, validation: loss=0.194, simple_loss=0.2527, pruned_loss=0.06771, over 351949.00 frames. 2024-06-22 03:31:39,690 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 03:31:41,824 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=517298.8333333333, ans=0.2 2024-06-22 03:31:45,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.02 vs. limit=22.5 2024-06-22 03:32:01,820 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.478e+02 2.643e+02 2.834e+02 3.341e+02, threshold=5.287e+02, percent-clipped=0.0 2024-06-22 03:32:06,547 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2024-06-22 03:32:09,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=517372.1666666667, ans=0.125 2024-06-22 03:32:13,468 INFO [train.py:1028] (0/2) Epoch 28, batch 9050, loss[loss=0.1842, simple_loss=0.2505, pruned_loss=0.05899, over 12073.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.27, pruned_loss=0.07121, over 2566901.49 frames. ], batch size: 18, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:32:17,810 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=517390.5, ans=0.125 2024-06-22 03:32:20,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517408.8333333333, ans=0.125 2024-06-22 03:32:35,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517445.5, ans=0.125 2024-06-22 03:32:39,910 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517463.8333333333, ans=0.125 2024-06-22 03:32:41,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=517463.8333333333, ans=0.025 2024-06-22 03:32:44,524 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.40 vs. limit=6.0 2024-06-22 03:32:46,114 INFO [train.py:1028] (0/2) Epoch 28, batch 9100, loss[loss=0.2114, simple_loss=0.2799, pruned_loss=0.07143, over 13281.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2696, pruned_loss=0.07081, over 2567720.48 frames. ], batch size: 72, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:33:02,120 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2024-06-22 03:33:11,889 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.492e+02 2.616e+02 2.811e+02 3.924e+02, threshold=5.232e+02, percent-clipped=0.0 2024-06-22 03:33:22,722 INFO [train.py:1028] (0/2) Epoch 28, batch 9150, loss[loss=0.2088, simple_loss=0.2827, pruned_loss=0.0674, over 13217.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2693, pruned_loss=0.07079, over 2569296.29 frames. ], batch size: 77, lr: 2.05e-03, grad_scale: 16.0 2024-06-22 03:33:23,118 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.16 vs. limit=6.0 2024-06-22 03:33:44,995 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:33:54,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517647.1666666667, ans=0.1 2024-06-22 03:33:55,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=517647.1666666667, ans=0.0 2024-06-22 03:33:57,778 INFO [train.py:1028] (0/2) Epoch 28, batch 9200, loss[loss=0.197, simple_loss=0.2715, pruned_loss=0.06119, over 12918.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2691, pruned_loss=0.07025, over 2572772.91 frames. ], batch size: 36, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:34:00,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=517665.5, ans=0.05 2024-06-22 03:34:01,264 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=15.0 2024-06-22 03:34:02,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517665.5, ans=0.125 2024-06-22 03:34:06,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=517683.8333333333, ans=0.125 2024-06-22 03:34:08,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=517683.8333333333, ans=0.125 2024-06-22 03:34:17,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2024-06-22 03:34:19,550 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.453e+02 2.562e+02 2.702e+02 3.598e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-22 03:34:23,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=517720.5, ans=0.125 2024-06-22 03:34:30,521 INFO [train.py:1028] (0/2) Epoch 28, batch 9250, loss[loss=0.1947, simple_loss=0.2625, pruned_loss=0.06347, over 13262.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.269, pruned_loss=0.07031, over 2575703.72 frames. ], batch size: 67, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:34:36,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=517775.5, ans=0.125 2024-06-22 03:35:02,485 INFO [train.py:1028] (0/2) Epoch 28, batch 9300, loss[loss=0.1803, simple_loss=0.246, pruned_loss=0.05729, over 12987.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2687, pruned_loss=0.0699, over 2571700.35 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:35:12,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517867.1666666667, ans=0.1 2024-06-22 03:35:18,422 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.05 vs. limit=22.5 2024-06-22 03:35:21,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=517903.8333333333, ans=0.2 2024-06-22 03:35:23,081 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.455e+02 2.627e+02 2.790e+02 3.283e+02, threshold=5.254e+02, percent-clipped=0.0 2024-06-22 03:35:34,117 INFO [train.py:1028] (0/2) Epoch 28, batch 9350, loss[loss=0.2029, simple_loss=0.2676, pruned_loss=0.06913, over 12485.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2691, pruned_loss=0.07028, over 2568866.57 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:35:48,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=517977.1666666667, ans=0.0 2024-06-22 03:35:52,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=517995.5, ans=0.025 2024-06-22 03:36:00,982 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=518013.8333333333, ans=0.125 2024-06-22 03:36:04,558 INFO [train.py:1028] (0/2) Epoch 28, batch 9400, loss[loss=0.2106, simple_loss=0.2734, pruned_loss=0.07392, over 13275.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2699, pruned_loss=0.07107, over 2568098.14 frames. ], batch size: 52, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:36:11,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=518032.1666666667, ans=0.05 2024-06-22 03:36:23,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=518068.8333333333, ans=0.025 2024-06-22 03:36:28,627 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.467e+02 2.619e+02 2.846e+02 3.317e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 03:36:35,319 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2024-06-22 03:36:39,609 INFO [train.py:1028] (0/2) Epoch 28, batch 9450, loss[loss=0.184, simple_loss=0.2593, pruned_loss=0.0543, over 12563.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2706, pruned_loss=0.07124, over 2566965.12 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:36:40,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518123.8333333333, ans=0.1 2024-06-22 03:36:40,997 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=518123.8333333333, ans=0.0 2024-06-22 03:36:51,097 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2024-06-22 03:36:54,370 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=518160.5, ans=0.0 2024-06-22 03:37:12,071 INFO [train.py:1028] (0/2) Epoch 28, batch 9500, loss[loss=0.21, simple_loss=0.2725, pruned_loss=0.07377, over 13235.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2703, pruned_loss=0.071, over 2576950.21 frames. ], batch size: 43, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:37:14,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=518215.5, ans=0.125 2024-06-22 03:37:30,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=518270.5, ans=0.015 2024-06-22 03:37:32,787 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.479e+02 2.685e+02 2.929e+02 4.298e+02, threshold=5.369e+02, percent-clipped=0.0 2024-06-22 03:37:38,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=518288.8333333333, ans=0.125 2024-06-22 03:37:42,678 INFO [train.py:1028] (0/2) Epoch 28, batch 9550, loss[loss=0.2104, simple_loss=0.2736, pruned_loss=0.07358, over 13025.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2696, pruned_loss=0.07074, over 2573469.31 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:37:46,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=518307.1666666667, ans=0.125 2024-06-22 03:37:57,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=518343.8333333333, ans=0.125 2024-06-22 03:38:02,528 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=518362.1666666667, ans=0.125 2024-06-22 03:38:06,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2024-06-22 03:38:11,040 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=518380.5, ans=0.0 2024-06-22 03:38:13,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=518398.8333333333, ans=0.05 2024-06-22 03:38:13,570 INFO [train.py:1028] (0/2) Epoch 28, batch 9600, loss[loss=0.2248, simple_loss=0.2694, pruned_loss=0.09015, over 10392.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2692, pruned_loss=0.07045, over 2572386.61 frames. ], batch size: 303, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:38:15,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=518398.8333333333, ans=0.0 2024-06-22 03:38:22,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=518417.1666666667, ans=0.125 2024-06-22 03:38:26,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=518435.5, ans=0.0 2024-06-22 03:38:31,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=518453.8333333333, ans=0.125 2024-06-22 03:38:34,112 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 2.512e+02 2.670e+02 3.033e+02 4.551e+02, threshold=5.340e+02, percent-clipped=0.0 2024-06-22 03:38:36,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=518453.8333333333, ans=0.0 2024-06-22 03:38:44,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=518472.1666666667, ans=0.0 2024-06-22 03:38:45,988 INFO [train.py:1028] (0/2) Epoch 28, batch 9650, loss[loss=0.1914, simple_loss=0.2511, pruned_loss=0.06587, over 13124.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2688, pruned_loss=0.07075, over 2561817.54 frames. ], batch size: 132, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:38:47,608 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2024-06-22 03:38:57,065 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2024-06-22 03:38:57,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=518508.8333333333, ans=0.125 2024-06-22 03:38:57,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2024-06-22 03:39:17,863 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.28 vs. limit=12.0 2024-06-22 03:39:18,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=518582.1666666667, ans=0.0 2024-06-22 03:39:18,801 INFO [train.py:1028] (0/2) Epoch 28, batch 9700, loss[loss=0.1877, simple_loss=0.2467, pruned_loss=0.06431, over 13021.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2684, pruned_loss=0.07053, over 2556088.41 frames. ], batch size: 144, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:39:20,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=518582.1666666667, ans=0.125 2024-06-22 03:39:21,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518582.1666666667, ans=0.1 2024-06-22 03:39:26,048 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=518600.5, ans=0.2 2024-06-22 03:39:26,693 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518600.5, ans=0.125 2024-06-22 03:39:39,445 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.535e+02 2.677e+02 2.927e+02 4.534e+02, threshold=5.353e+02, percent-clipped=0.0 2024-06-22 03:39:49,011 INFO [train.py:1028] (0/2) Epoch 28, batch 9750, loss[loss=0.1966, simple_loss=0.2566, pruned_loss=0.06832, over 13092.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2669, pruned_loss=0.06998, over 2552756.35 frames. ], batch size: 132, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:39:49,185 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:39:49,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=518673.8333333333, ans=0.0 2024-06-22 03:40:03,530 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=518710.5, ans=0.0 2024-06-22 03:40:06,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=518710.5, ans=0.2 2024-06-22 03:40:06,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=518710.5, ans=0.2 2024-06-22 03:40:08,253 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.39 vs. limit=22.5 2024-06-22 03:40:09,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2024-06-22 03:40:10,750 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.83 vs. limit=22.5 2024-06-22 03:40:12,844 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=518728.8333333333, ans=0.125 2024-06-22 03:40:16,904 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.69 vs. limit=10.0 2024-06-22 03:40:20,118 INFO [train.py:1028] (0/2) Epoch 28, batch 9800, loss[loss=0.1963, simple_loss=0.2574, pruned_loss=0.06757, over 12901.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2667, pruned_loss=0.06964, over 2545120.38 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:40:22,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=518765.5, ans=0.0 2024-06-22 03:40:42,491 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.463e+02 2.598e+02 2.789e+02 3.504e+02, threshold=5.197e+02, percent-clipped=0.0 2024-06-22 03:40:51,978 INFO [train.py:1028] (0/2) Epoch 28, batch 9850, loss[loss=0.2005, simple_loss=0.2624, pruned_loss=0.06928, over 12996.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2653, pruned_loss=0.06886, over 2537970.14 frames. ], batch size: 102, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:41:11,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=518912.1666666667, ans=0.2 2024-06-22 03:41:14,217 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=518912.1666666667, ans=0.125 2024-06-22 03:41:22,831 INFO [train.py:1028] (0/2) Epoch 28, batch 9900, loss[loss=0.1889, simple_loss=0.2509, pruned_loss=0.06348, over 12960.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.265, pruned_loss=0.06916, over 2530552.20 frames. ], batch size: 39, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:41:27,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=518948.8333333333, ans=0.125 2024-06-22 03:41:31,145 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:41:45,949 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 2.493e+02 2.584e+02 2.775e+02 3.456e+02, threshold=5.167e+02, percent-clipped=0.0 2024-06-22 03:41:48,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=519022.1666666667, ans=15.0 2024-06-22 03:41:50,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=519022.1666666667, ans=0.0 2024-06-22 03:41:55,326 INFO [train.py:1028] (0/2) Epoch 28, batch 9950, loss[loss=0.2052, simple_loss=0.2732, pruned_loss=0.06864, over 12804.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2641, pruned_loss=0.06939, over 2525645.05 frames. ], batch size: 29, lr: 2.04e-03, grad_scale: 16.0 2024-06-22 03:41:57,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=519040.5, ans=0.025 2024-06-22 03:41:58,516 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=519040.5, ans=0.125 2024-06-22 03:42:02,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=519058.8333333333, ans=0.125 2024-06-22 03:42:09,712 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=519077.1666666667, ans=0.125 2024-06-22 03:42:12,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=519077.1666666667, ans=0.125 2024-06-22 03:42:17,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519095.5, ans=0.125 2024-06-22 03:42:23,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=519113.8333333333, ans=0.0 2024-06-22 03:42:27,451 INFO [train.py:1028] (0/2) Epoch 28, batch 10000, loss[loss=0.2128, simple_loss=0.272, pruned_loss=0.07677, over 12711.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2643, pruned_loss=0.06985, over 2487584.09 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:42:28,339 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.87 vs. limit=15.0 2024-06-22 03:42:34,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519150.5, ans=0.125 2024-06-22 03:42:39,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.16 vs. limit=22.5 2024-06-22 03:42:40,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=519168.8333333333, ans=0.1 2024-06-22 03:42:43,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=519168.8333333333, ans=0.125 2024-06-22 03:42:49,950 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.442e+02 2.601e+02 2.839e+02 4.168e+02, threshold=5.203e+02, percent-clipped=0.0 2024-06-22 03:42:59,660 INFO [train.py:1028] (0/2) Epoch 28, batch 10050, loss[loss=0.1998, simple_loss=0.2598, pruned_loss=0.06989, over 12615.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2642, pruned_loss=0.07037, over 2445349.71 frames. ], batch size: 22, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:43:13,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=30.44 vs. limit=22.5 2024-06-22 03:43:16,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=519260.5, ans=0.125 2024-06-22 03:43:18,897 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.66 vs. limit=22.5 2024-06-22 03:43:29,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519297.1666666667, ans=0.1 2024-06-22 03:43:30,419 INFO [train.py:1028] (0/2) Epoch 28, batch 10100, loss[loss=0.1878, simple_loss=0.2433, pruned_loss=0.06615, over 11350.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2635, pruned_loss=0.06944, over 2426992.68 frames. ], batch size: 17, lr: 2.04e-03, grad_scale: 32.0 2024-06-22 03:43:31,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=519315.5, ans=0.125 2024-06-22 03:43:32,970 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=519315.5, ans=0.0 2024-06-22 03:43:44,143 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-28.pt 2024-06-22 03:45:46,421 INFO [train.py:1028] (0/2) Epoch 29, batch 0, loss[loss=0.1766, simple_loss=0.2398, pruned_loss=0.05671, over 13028.00 frames. ], tot_loss[loss=0.1766, simple_loss=0.2398, pruned_loss=0.05671, over 13028.00 frames. ], batch size: 36, lr: 2.01e-03, grad_scale: 32.0 2024-06-22 03:45:46,422 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 03:45:53,880 INFO [train.py:1060] (0/2) Epoch 29, validation: loss=0.1942, simple_loss=0.2536, pruned_loss=0.06743, over 351949.00 frames. 2024-06-22 03:45:53,880 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 03:46:04,576 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519365.0, ans=0.1 2024-06-22 03:46:08,842 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.442e+02 2.601e+02 2.796e+02 3.819e+02, threshold=5.201e+02, percent-clipped=0.0 2024-06-22 03:46:30,583 INFO [train.py:1028] (0/2) Epoch 29, batch 50, loss[loss=0.1887, simple_loss=0.2552, pruned_loss=0.06113, over 12745.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.248, pruned_loss=0.06418, over 573511.45 frames. ], batch size: 29, lr: 2.01e-03, grad_scale: 32.0 2024-06-22 03:46:32,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=519438.3333333333, ans=0.5 2024-06-22 03:46:36,881 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.34 vs. limit=22.5 2024-06-22 03:46:45,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=519475.0, ans=0.2 2024-06-22 03:46:46,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.19 vs. limit=12.0 2024-06-22 03:46:55,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=519493.3333333333, ans=0.125 2024-06-22 03:47:04,430 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=519530.0, ans=0.2 2024-06-22 03:47:04,968 INFO [train.py:1028] (0/2) Epoch 29, batch 100, loss[loss=0.1555, simple_loss=0.2245, pruned_loss=0.04322, over 13292.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2466, pruned_loss=0.06315, over 1016290.70 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:47:08,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2024-06-22 03:47:15,801 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.327e+02 2.406e+02 2.611e+02 3.499e+02, threshold=4.812e+02, percent-clipped=0.0 2024-06-22 03:47:20,138 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:47:21,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=519566.6666666667, ans=0.0 2024-06-22 03:47:28,181 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2024-06-22 03:47:35,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=519603.3333333333, ans=0.2 2024-06-22 03:47:36,222 INFO [train.py:1028] (0/2) Epoch 29, batch 150, loss[loss=0.1751, simple_loss=0.244, pruned_loss=0.05308, over 12970.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2466, pruned_loss=0.06241, over 1364297.75 frames. ], batch size: 30, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:48:11,113 INFO [train.py:1028] (0/2) Epoch 29, batch 200, loss[loss=0.1972, simple_loss=0.257, pruned_loss=0.06871, over 12511.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2463, pruned_loss=0.0622, over 1634749.48 frames. ], batch size: 202, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:48:14,011 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.98 vs. limit=15.0 2024-06-22 03:48:22,495 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.329e+02 2.457e+02 2.674e+02 3.354e+02, threshold=4.915e+02, percent-clipped=0.0 2024-06-22 03:48:27,392 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2024-06-22 03:48:30,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=519768.3333333333, ans=0.0 2024-06-22 03:48:30,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2024-06-22 03:48:30,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=519768.3333333333, ans=0.125 2024-06-22 03:48:40,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519786.6666666667, ans=0.1 2024-06-22 03:48:42,716 INFO [train.py:1028] (0/2) Epoch 29, batch 250, loss[loss=0.1772, simple_loss=0.2247, pruned_loss=0.06485, over 13051.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2455, pruned_loss=0.06216, over 1846197.39 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:48:43,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=519805.0, ans=0.125 2024-06-22 03:48:48,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.87 vs. limit=15.0 2024-06-22 03:48:48,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=519823.3333333333, ans=0.125 2024-06-22 03:48:49,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=519823.3333333333, ans=10.0 2024-06-22 03:48:50,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=519823.3333333333, ans=0.0 2024-06-22 03:49:05,211 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519860.0, ans=0.1 2024-06-22 03:49:14,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=519878.3333333333, ans=0.125 2024-06-22 03:49:18,428 INFO [train.py:1028] (0/2) Epoch 29, batch 300, loss[loss=0.1773, simple_loss=0.2342, pruned_loss=0.06018, over 13190.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2459, pruned_loss=0.06228, over 2008650.52 frames. ], batch size: 112, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:49:22,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=519896.6666666667, ans=0.0 2024-06-22 03:49:22,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=519896.6666666667, ans=0.025 2024-06-22 03:49:23,003 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=519896.6666666667, ans=10.0 2024-06-22 03:49:30,242 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.335e+02 2.446e+02 2.676e+02 3.689e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 03:49:46,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=519970.0, ans=0.125 2024-06-22 03:49:46,684 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2024-06-22 03:49:50,864 INFO [train.py:1028] (0/2) Epoch 29, batch 350, loss[loss=0.1742, simple_loss=0.2415, pruned_loss=0.05349, over 12841.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2457, pruned_loss=0.0622, over 2138096.42 frames. ], batch size: 33, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:49:55,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519988.3333333333, ans=0.1 2024-06-22 03:49:59,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=520006.6666666667, ans=0.0 2024-06-22 03:49:59,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=520006.6666666667, ans=0.125 2024-06-22 03:50:08,244 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2024-06-22 03:50:09,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=520025.0, ans=0.125 2024-06-22 03:50:13,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=520043.3333333333, ans=0.125 2024-06-22 03:50:17,681 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:50:25,091 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=520061.6666666667, ans=0.125 2024-06-22 03:50:26,165 INFO [train.py:1028] (0/2) Epoch 29, batch 400, loss[loss=0.1766, simple_loss=0.2462, pruned_loss=0.05351, over 13267.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2465, pruned_loss=0.0622, over 2239371.35 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:50:30,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=520080.0, ans=0.125 2024-06-22 03:50:37,471 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.286e+02 2.426e+02 2.676e+02 3.372e+02, threshold=4.852e+02, percent-clipped=0.0 2024-06-22 03:50:39,483 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=520116.6666666667, ans=0.125 2024-06-22 03:50:42,829 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2024-06-22 03:50:51,545 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.50 vs. limit=22.5 2024-06-22 03:51:00,649 INFO [train.py:1028] (0/2) Epoch 29, batch 450, loss[loss=0.1687, simple_loss=0.2288, pruned_loss=0.05433, over 13205.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2462, pruned_loss=0.06212, over 2314260.23 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:51:06,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=520171.6666666667, ans=15.0 2024-06-22 03:51:09,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=520190.0, ans=0.125 2024-06-22 03:51:12,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=520190.0, ans=0.1 2024-06-22 03:51:18,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=520208.3333333333, ans=0.125 2024-06-22 03:51:20,971 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.09 vs. limit=12.0 2024-06-22 03:51:32,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=520263.3333333333, ans=0.125 2024-06-22 03:51:32,704 INFO [train.py:1028] (0/2) Epoch 29, batch 500, loss[loss=0.1768, simple_loss=0.2317, pruned_loss=0.0609, over 13100.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2467, pruned_loss=0.06207, over 2376970.93 frames. ], batch size: 121, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:51:33,506 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=520263.3333333333, ans=0.0 2024-06-22 03:51:34,086 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=520263.3333333333, ans=0.2 2024-06-22 03:51:44,382 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.281e+02 2.398e+02 2.609e+02 3.237e+02, threshold=4.796e+02, percent-clipped=0.0 2024-06-22 03:51:51,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=520318.3333333333, ans=0.2 2024-06-22 03:52:06,514 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2024-06-22 03:52:07,344 INFO [train.py:1028] (0/2) Epoch 29, batch 550, loss[loss=0.1823, simple_loss=0.2341, pruned_loss=0.06521, over 12929.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2466, pruned_loss=0.062, over 2421892.51 frames. ], batch size: 158, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:52:07,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=520355.0, ans=0.0 2024-06-22 03:52:10,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=520355.0, ans=0.125 2024-06-22 03:52:22,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=520391.6666666667, ans=0.0 2024-06-22 03:52:35,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=520428.3333333333, ans=15.0 2024-06-22 03:52:39,898 INFO [train.py:1028] (0/2) Epoch 29, batch 600, loss[loss=0.1701, simple_loss=0.2241, pruned_loss=0.05805, over 13008.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2464, pruned_loss=0.06187, over 2459924.03 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:52:44,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=520446.6666666667, ans=0.2 2024-06-22 03:52:49,919 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2024-06-22 03:52:52,182 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.356e+02 2.448e+02 2.616e+02 3.167e+02, threshold=4.896e+02, percent-clipped=0.0 2024-06-22 03:53:08,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=520520.0, ans=0.125 2024-06-22 03:53:13,309 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=520520.0, ans=0.1 2024-06-22 03:53:15,123 INFO [train.py:1028] (0/2) Epoch 29, batch 650, loss[loss=0.1931, simple_loss=0.253, pruned_loss=0.06659, over 13180.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2466, pruned_loss=0.06163, over 2490349.12 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:53:17,785 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 03:53:17,902 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=520538.3333333333, ans=0.0 2024-06-22 03:53:18,824 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2024-06-22 03:53:20,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=520538.3333333333, ans=0.0 2024-06-22 03:53:23,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=520556.6666666667, ans=10.0 2024-06-22 03:53:32,798 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.79 vs. limit=15.0 2024-06-22 03:53:47,489 INFO [train.py:1028] (0/2) Epoch 29, batch 700, loss[loss=0.1952, simple_loss=0.262, pruned_loss=0.06418, over 13339.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2467, pruned_loss=0.06221, over 2513551.72 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:53:49,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=520630.0, ans=0.05 2024-06-22 03:53:59,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=520648.3333333333, ans=0.2 2024-06-22 03:53:59,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.301e+02 2.442e+02 2.617e+02 3.752e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 03:53:59,943 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-284000.pt 2024-06-22 03:54:17,132 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=520685.0, ans=0.125 2024-06-22 03:54:19,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=520685.0, ans=0.0 2024-06-22 03:54:28,240 INFO [train.py:1028] (0/2) Epoch 29, batch 750, loss[loss=0.1767, simple_loss=0.2372, pruned_loss=0.0581, over 13283.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2467, pruned_loss=0.0617, over 2528875.25 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:54:34,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=520740.0, ans=0.125 2024-06-22 03:54:35,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=520740.0, ans=0.125 2024-06-22 03:54:40,063 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=520758.3333333333, ans=0.125 2024-06-22 03:54:44,359 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=520758.3333333333, ans=0.0 2024-06-22 03:54:47,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=520776.6666666667, ans=0.025 2024-06-22 03:54:52,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=520776.6666666667, ans=0.1 2024-06-22 03:55:00,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=520813.3333333333, ans=0.2 2024-06-22 03:55:00,742 INFO [train.py:1028] (0/2) Epoch 29, batch 800, loss[loss=0.1806, simple_loss=0.2513, pruned_loss=0.05499, over 12957.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2464, pruned_loss=0.06158, over 2542887.77 frames. ], batch size: 36, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:55:03,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.14 vs. limit=15.0 2024-06-22 03:55:06,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=520813.3333333333, ans=0.125 2024-06-22 03:55:16,439 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.350e+02 2.476e+02 2.735e+02 3.510e+02, threshold=4.951e+02, percent-clipped=0.0 2024-06-22 03:55:17,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=520850.0, ans=0.2 2024-06-22 03:55:36,967 INFO [train.py:1028] (0/2) Epoch 29, batch 850, loss[loss=0.1654, simple_loss=0.2222, pruned_loss=0.05427, over 13102.00 frames. ], tot_loss[loss=0.184, simple_loss=0.2458, pruned_loss=0.06106, over 2552636.52 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:55:41,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=520905.0, ans=0.125 2024-06-22 03:55:43,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=520923.3333333333, ans=0.0 2024-06-22 03:55:45,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=520923.3333333333, ans=0.125 2024-06-22 03:56:01,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=520978.3333333333, ans=0.2 2024-06-22 03:56:05,607 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2024-06-22 03:56:11,809 INFO [train.py:1028] (0/2) Epoch 29, batch 900, loss[loss=0.1728, simple_loss=0.2394, pruned_loss=0.05309, over 12911.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2456, pruned_loss=0.0615, over 2556785.19 frames. ], batch size: 36, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:56:18,094 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.07 vs. limit=15.0 2024-06-22 03:56:23,818 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=521015.0, ans=0.125 2024-06-22 03:56:25,022 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.377e+02 2.562e+02 2.835e+02 3.581e+02, threshold=5.124e+02, percent-clipped=0.0 2024-06-22 03:56:36,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521051.6666666667, ans=0.1 2024-06-22 03:56:40,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=521070.0, ans=0.025 2024-06-22 03:56:42,755 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521070.0, ans=0.125 2024-06-22 03:56:44,654 INFO [train.py:1028] (0/2) Epoch 29, batch 950, loss[loss=0.1972, simple_loss=0.2715, pruned_loss=0.06147, over 13166.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2457, pruned_loss=0.06126, over 2560819.88 frames. ], batch size: 40, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:56:49,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=521088.3333333333, ans=0.0 2024-06-22 03:56:51,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=521106.6666666667, ans=0.125 2024-06-22 03:56:57,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=521125.0, ans=0.125 2024-06-22 03:57:12,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=15.0 2024-06-22 03:57:14,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521161.6666666667, ans=0.1 2024-06-22 03:57:19,496 INFO [train.py:1028] (0/2) Epoch 29, batch 1000, loss[loss=0.1888, simple_loss=0.2496, pruned_loss=0.06395, over 13229.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2461, pruned_loss=0.06181, over 2563078.73 frames. ], batch size: 49, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:57:20,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=521180.0, ans=0.125 2024-06-22 03:57:29,997 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.13 vs. limit=22.5 2024-06-22 03:57:32,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=521216.6666666667, ans=0.125 2024-06-22 03:57:32,725 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.347e+02 2.496e+02 2.727e+02 3.623e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-22 03:57:38,313 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.41 vs. limit=10.0 2024-06-22 03:57:47,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521253.3333333333, ans=0.1 2024-06-22 03:57:50,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=521253.3333333333, ans=0.125 2024-06-22 03:57:52,123 INFO [train.py:1028] (0/2) Epoch 29, batch 1050, loss[loss=0.1802, simple_loss=0.2473, pruned_loss=0.05652, over 13160.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2463, pruned_loss=0.06187, over 2566287.24 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:57:56,103 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=521271.6666666667, ans=0.0 2024-06-22 03:57:56,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=521271.6666666667, ans=0.2 2024-06-22 03:58:01,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521290.0, ans=0.125 2024-06-22 03:58:12,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=521308.3333333333, ans=0.07 2024-06-22 03:58:12,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=521308.3333333333, ans=0.2 2024-06-22 03:58:18,402 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521326.6666666667, ans=0.125 2024-06-22 03:58:21,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=521345.0, ans=0.125 2024-06-22 03:58:26,095 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=521345.0, ans=0.125 2024-06-22 03:58:27,290 INFO [train.py:1028] (0/2) Epoch 29, batch 1100, loss[loss=0.1895, simple_loss=0.2529, pruned_loss=0.06308, over 13277.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2467, pruned_loss=0.06192, over 2570837.84 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:58:34,697 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2024-06-22 03:58:40,232 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.277e+02 2.425e+02 2.541e+02 3.423e+02, threshold=4.850e+02, percent-clipped=0.0 2024-06-22 03:58:50,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=521418.3333333333, ans=0.0 2024-06-22 03:58:59,803 INFO [train.py:1028] (0/2) Epoch 29, batch 1150, loss[loss=0.1812, simple_loss=0.2489, pruned_loss=0.05676, over 13267.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2465, pruned_loss=0.06198, over 2571901.23 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 03:59:09,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=521473.3333333333, ans=0.125 2024-06-22 03:59:18,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=521491.6666666667, ans=0.0 2024-06-22 03:59:26,277 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=521510.0, ans=0.0 2024-06-22 03:59:28,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=521528.3333333333, ans=0.125 2024-06-22 03:59:30,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=521528.3333333333, ans=0.2 2024-06-22 03:59:32,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=521528.3333333333, ans=0.125 2024-06-22 03:59:35,168 INFO [train.py:1028] (0/2) Epoch 29, batch 1200, loss[loss=0.1642, simple_loss=0.2265, pruned_loss=0.05092, over 13193.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2464, pruned_loss=0.06214, over 2573828.17 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 03:59:47,596 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.416e+02 2.587e+02 2.851e+02 3.693e+02, threshold=5.175e+02, percent-clipped=0.0 2024-06-22 03:59:55,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=521601.6666666667, ans=0.125 2024-06-22 04:00:09,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=12.0 2024-06-22 04:00:09,337 INFO [train.py:1028] (0/2) Epoch 29, batch 1250, loss[loss=0.1796, simple_loss=0.2372, pruned_loss=0.06105, over 13177.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2458, pruned_loss=0.06196, over 2582666.26 frames. ], batch size: 112, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:00:14,094 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=521638.3333333333, ans=0.025 2024-06-22 04:00:30,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=521693.3333333333, ans=0.025 2024-06-22 04:00:40,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=521711.6666666667, ans=0.0 2024-06-22 04:00:42,877 INFO [train.py:1028] (0/2) Epoch 29, batch 1300, loss[loss=0.2106, simple_loss=0.2626, pruned_loss=0.07931, over 12718.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2468, pruned_loss=0.06242, over 2583772.84 frames. ], batch size: 176, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:00:43,713 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=521730.0, ans=0.0 2024-06-22 04:00:46,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=521730.0, ans=0.0 2024-06-22 04:00:48,050 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:00:50,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=521748.3333333333, ans=0.125 2024-06-22 04:00:56,017 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.307e+02 2.473e+02 2.693e+02 3.722e+02, threshold=4.947e+02, percent-clipped=0.0 2024-06-22 04:01:00,256 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=521766.6666666667, ans=0.1 2024-06-22 04:01:02,283 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=521785.0, ans=0.125 2024-06-22 04:01:05,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=521785.0, ans=0.2 2024-06-22 04:01:06,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2024-06-22 04:01:19,524 INFO [train.py:1028] (0/2) Epoch 29, batch 1350, loss[loss=0.1956, simple_loss=0.2577, pruned_loss=0.06677, over 13203.00 frames. ], tot_loss[loss=0.186, simple_loss=0.247, pruned_loss=0.06245, over 2585195.30 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:01:43,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.94 vs. limit=15.0 2024-06-22 04:01:45,461 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:01:53,136 INFO [train.py:1028] (0/2) Epoch 29, batch 1400, loss[loss=0.1677, simple_loss=0.2362, pruned_loss=0.04957, over 12786.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2467, pruned_loss=0.06216, over 2587562.13 frames. ], batch size: 26, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:02:05,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=521950.0, ans=0.125 2024-06-22 04:02:08,667 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.312e+02 2.451e+02 2.701e+02 3.676e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-22 04:02:16,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=521968.3333333333, ans=0.0 2024-06-22 04:02:17,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521968.3333333333, ans=0.1 2024-06-22 04:02:28,472 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.56 vs. limit=10.0 2024-06-22 04:02:28,589 INFO [train.py:1028] (0/2) Epoch 29, batch 1450, loss[loss=0.1819, simple_loss=0.2355, pruned_loss=0.06412, over 13075.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2466, pruned_loss=0.06233, over 2588369.30 frames. ], batch size: 121, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:02:30,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=522005.0, ans=0.0 2024-06-22 04:02:37,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.82 vs. limit=10.0 2024-06-22 04:02:38,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=522023.3333333333, ans=0.125 2024-06-22 04:02:39,523 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2024-06-22 04:02:51,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.72 vs. limit=10.0 2024-06-22 04:02:52,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=522060.0, ans=0.0 2024-06-22 04:03:00,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=522096.6666666667, ans=0.2 2024-06-22 04:03:00,859 INFO [train.py:1028] (0/2) Epoch 29, batch 1500, loss[loss=0.1736, simple_loss=0.2328, pruned_loss=0.05719, over 13219.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2467, pruned_loss=0.06229, over 2590170.52 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:03:16,639 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.331e+02 2.514e+02 2.730e+02 3.846e+02, threshold=5.028e+02, percent-clipped=0.0 2024-06-22 04:03:17,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=522133.3333333333, ans=0.125 2024-06-22 04:03:17,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=522133.3333333333, ans=0.125 2024-06-22 04:03:22,153 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.71 vs. limit=22.5 2024-06-22 04:03:27,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2024-06-22 04:03:29,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=522170.0, ans=0.0 2024-06-22 04:03:35,857 INFO [train.py:1028] (0/2) Epoch 29, batch 1550, loss[loss=0.1906, simple_loss=0.2464, pruned_loss=0.06737, over 13010.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2466, pruned_loss=0.06227, over 2584634.24 frames. ], batch size: 102, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:03:47,942 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.75 vs. limit=10.0 2024-06-22 04:03:54,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=522225.0, ans=0.125 2024-06-22 04:03:54,302 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.00 vs. limit=15.0 2024-06-22 04:03:58,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=522243.3333333333, ans=0.2 2024-06-22 04:04:10,433 INFO [train.py:1028] (0/2) Epoch 29, batch 1600, loss[loss=0.1939, simple_loss=0.264, pruned_loss=0.06191, over 13111.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2467, pruned_loss=0.06207, over 2579796.97 frames. ], batch size: 77, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:04:13,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2024-06-22 04:04:22,844 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.325e+02 2.452e+02 2.626e+02 3.155e+02, threshold=4.904e+02, percent-clipped=0.0 2024-06-22 04:04:24,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=522316.6666666667, ans=0.1 2024-06-22 04:04:39,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=522353.3333333333, ans=0.0 2024-06-22 04:04:42,041 INFO [train.py:1028] (0/2) Epoch 29, batch 1650, loss[loss=0.1973, simple_loss=0.2509, pruned_loss=0.07185, over 13152.00 frames. ], tot_loss[loss=0.186, simple_loss=0.247, pruned_loss=0.06247, over 2576488.23 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:04:46,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.85 vs. limit=10.0 2024-06-22 04:05:10,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=522445.0, ans=0.0 2024-06-22 04:05:17,264 INFO [train.py:1028] (0/2) Epoch 29, batch 1700, loss[loss=0.2002, simple_loss=0.2652, pruned_loss=0.06758, over 12833.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2473, pruned_loss=0.06224, over 2581185.10 frames. ], batch size: 26, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:05:20,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.46 vs. limit=15.0 2024-06-22 04:05:28,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=522481.6666666667, ans=0.0 2024-06-22 04:05:30,137 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.349e+02 2.492e+02 2.692e+02 3.299e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-22 04:05:32,762 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.18 vs. limit=10.0 2024-06-22 04:05:35,771 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2024-06-22 04:05:38,056 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=522518.3333333333, ans=0.07 2024-06-22 04:05:39,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=522518.3333333333, ans=0.0 2024-06-22 04:05:49,853 INFO [train.py:1028] (0/2) Epoch 29, batch 1750, loss[loss=0.2036, simple_loss=0.2723, pruned_loss=0.06744, over 12528.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2479, pruned_loss=0.06249, over 2582279.73 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:05:55,649 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:05:58,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=522573.3333333333, ans=0.125 2024-06-22 04:06:07,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=522591.6666666667, ans=0.0 2024-06-22 04:06:10,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=522591.6666666667, ans=0.2 2024-06-22 04:06:22,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=522628.3333333333, ans=0.125 2024-06-22 04:06:24,350 INFO [train.py:1028] (0/2) Epoch 29, batch 1800, loss[loss=0.1823, simple_loss=0.2425, pruned_loss=0.06106, over 13255.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2475, pruned_loss=0.06248, over 2582477.44 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:06:28,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=522646.6666666667, ans=0.0 2024-06-22 04:06:32,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522665.0, ans=0.1 2024-06-22 04:06:34,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2024-06-22 04:06:36,973 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.301e+02 2.462e+02 2.610e+02 3.115e+02, threshold=4.924e+02, percent-clipped=0.0 2024-06-22 04:06:39,911 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.97 vs. limit=10.0 2024-06-22 04:06:40,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=522683.3333333333, ans=0.125 2024-06-22 04:06:43,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=522701.6666666667, ans=0.015 2024-06-22 04:06:56,342 INFO [train.py:1028] (0/2) Epoch 29, batch 1850, loss[loss=0.1867, simple_loss=0.2418, pruned_loss=0.06578, over 13210.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2476, pruned_loss=0.06241, over 2582927.59 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:06:57,159 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=522738.3333333333, ans=0.0 2024-06-22 04:07:02,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=522756.6666666667, ans=0.95 2024-06-22 04:07:13,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2024-06-22 04:07:14,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=522775.0, ans=0.125 2024-06-22 04:07:31,348 INFO [train.py:1028] (0/2) Epoch 29, batch 1900, loss[loss=0.17, simple_loss=0.2333, pruned_loss=0.05338, over 13169.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2468, pruned_loss=0.0623, over 2585135.15 frames. ], batch size: 95, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:07:32,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=522830.0, ans=0.1 2024-06-22 04:07:38,554 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2024-06-22 04:07:40,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=522848.3333333333, ans=0.125 2024-06-22 04:07:42,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=522848.3333333333, ans=0.04949747468305833 2024-06-22 04:07:44,745 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.375e+02 2.458e+02 2.563e+02 3.310e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-22 04:07:48,182 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=522866.6666666667, ans=0.0 2024-06-22 04:08:06,695 INFO [train.py:1028] (0/2) Epoch 29, batch 1950, loss[loss=0.1901, simple_loss=0.2512, pruned_loss=0.0645, over 13240.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2463, pruned_loss=0.06257, over 2590765.73 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:08:09,624 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2024-06-22 04:08:12,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522921.6666666667, ans=0.1 2024-06-22 04:08:13,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=522940.0, ans=0.125 2024-06-22 04:08:13,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.58 vs. limit=15.0 2024-06-22 04:08:16,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=522940.0, ans=0.2 2024-06-22 04:08:18,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=522940.0, ans=0.0 2024-06-22 04:08:23,249 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2024-06-22 04:08:24,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=522958.3333333333, ans=0.125 2024-06-22 04:08:38,835 INFO [train.py:1028] (0/2) Epoch 29, batch 2000, loss[loss=0.1876, simple_loss=0.2539, pruned_loss=0.06068, over 12694.00 frames. ], tot_loss[loss=0.1861, simple_loss=0.2466, pruned_loss=0.06278, over 2586758.66 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:08:39,781 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=523013.3333333333, ans=12.0 2024-06-22 04:08:49,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=523031.6666666667, ans=0.07 2024-06-22 04:08:50,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=523031.6666666667, ans=0.0 2024-06-22 04:08:51,713 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.422e+02 2.535e+02 2.811e+02 3.422e+02, threshold=5.070e+02, percent-clipped=0.0 2024-06-22 04:08:54,021 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523050.0, ans=0.125 2024-06-22 04:09:00,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=523050.0, ans=0.0 2024-06-22 04:09:00,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=523050.0, ans=0.0 2024-06-22 04:09:04,758 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=523068.3333333333, ans=0.025 2024-06-22 04:09:10,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.74 vs. limit=10.0 2024-06-22 04:09:12,988 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=523086.6666666667, ans=0.0 2024-06-22 04:09:13,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=523086.6666666667, ans=0.125 2024-06-22 04:09:14,776 INFO [train.py:1028] (0/2) Epoch 29, batch 2050, loss[loss=0.1842, simple_loss=0.2485, pruned_loss=0.05992, over 12794.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2468, pruned_loss=0.06316, over 2582774.37 frames. ], batch size: 29, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:09:16,954 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=523105.0, ans=0.1 2024-06-22 04:09:27,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=523141.6666666667, ans=0.125 2024-06-22 04:09:39,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=523178.3333333333, ans=0.125 2024-06-22 04:09:42,845 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=523178.3333333333, ans=0.025 2024-06-22 04:09:44,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=523178.3333333333, ans=0.2 2024-06-22 04:09:46,407 INFO [train.py:1028] (0/2) Epoch 29, batch 2100, loss[loss=0.1901, simple_loss=0.2558, pruned_loss=0.0622, over 13223.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2462, pruned_loss=0.06221, over 2585634.57 frames. ], batch size: 59, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:09:53,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=523215.0, ans=0.0 2024-06-22 04:10:02,412 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.355e+02 2.530e+02 2.727e+02 3.451e+02, threshold=5.060e+02, percent-clipped=0.0 2024-06-22 04:10:13,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=523251.6666666667, ans=0.125 2024-06-22 04:10:21,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=523288.3333333333, ans=0.0 2024-06-22 04:10:22,031 INFO [train.py:1028] (0/2) Epoch 29, batch 2150, loss[loss=0.1864, simple_loss=0.2486, pruned_loss=0.06212, over 13242.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2463, pruned_loss=0.06199, over 2588586.22 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:10:32,755 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:10:34,433 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.15 vs. limit=22.5 2024-06-22 04:10:40,000 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2024-06-22 04:10:53,321 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=523361.6666666667, ans=0.125 2024-06-22 04:10:53,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=523361.6666666667, ans=0.04949747468305833 2024-06-22 04:10:54,537 INFO [train.py:1028] (0/2) Epoch 29, batch 2200, loss[loss=0.186, simple_loss=0.2404, pruned_loss=0.0658, over 13206.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2469, pruned_loss=0.0624, over 2588317.02 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:10:59,416 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:11:06,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=523398.3333333333, ans=0.0 2024-06-22 04:11:08,338 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.342e+02 2.490e+02 2.648e+02 3.564e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-22 04:11:09,110 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:11:17,659 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=523435.0, ans=0.025 2024-06-22 04:11:21,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=523435.0, ans=0.0 2024-06-22 04:11:25,466 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=523453.3333333333, ans=0.1 2024-06-22 04:11:29,941 INFO [train.py:1028] (0/2) Epoch 29, batch 2250, loss[loss=0.1637, simple_loss=0.2334, pruned_loss=0.04694, over 13236.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2469, pruned_loss=0.06212, over 2587153.75 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:11:31,726 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.32 vs. limit=10.0 2024-06-22 04:11:39,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=523490.0, ans=0.0 2024-06-22 04:11:44,231 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=22.5 2024-06-22 04:11:45,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=523508.3333333333, ans=0.125 2024-06-22 04:11:46,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523508.3333333333, ans=0.1 2024-06-22 04:12:06,090 INFO [train.py:1028] (0/2) Epoch 29, batch 2300, loss[loss=0.1622, simple_loss=0.2258, pruned_loss=0.04923, over 12918.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2473, pruned_loss=0.062, over 2580739.04 frames. ], batch size: 33, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:12:06,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=523563.3333333333, ans=0.0 2024-06-22 04:12:14,627 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=15.0 2024-06-22 04:12:14,967 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=523581.6666666667, ans=0.2 2024-06-22 04:12:20,320 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.336e+02 2.451e+02 2.626e+02 3.376e+02, threshold=4.901e+02, percent-clipped=0.0 2024-06-22 04:12:32,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.75 vs. limit=15.0 2024-06-22 04:12:34,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=523636.6666666667, ans=0.125 2024-06-22 04:12:39,134 INFO [train.py:1028] (0/2) Epoch 29, batch 2350, loss[loss=0.1845, simple_loss=0.2471, pruned_loss=0.06101, over 13277.00 frames. ], tot_loss[loss=0.1858, simple_loss=0.2472, pruned_loss=0.06216, over 2584901.84 frames. ], batch size: 67, lr: 2.00e-03, grad_scale: 16.0 2024-06-22 04:12:46,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=523673.3333333333, ans=0.09899494936611666 2024-06-22 04:12:54,020 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523691.6666666667, ans=0.1 2024-06-22 04:13:14,988 INFO [train.py:1028] (0/2) Epoch 29, batch 2400, loss[loss=0.174, simple_loss=0.2434, pruned_loss=0.05227, over 13302.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2462, pruned_loss=0.06202, over 2588426.44 frames. ], batch size: 46, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:13:29,350 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.367e+02 2.501e+02 2.666e+02 3.627e+02, threshold=5.001e+02, percent-clipped=0.0 2024-06-22 04:13:32,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=523783.3333333333, ans=0.125 2024-06-22 04:13:41,320 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.31 vs. limit=15.0 2024-06-22 04:13:48,057 INFO [train.py:1028] (0/2) Epoch 29, batch 2450, loss[loss=0.1609, simple_loss=0.2242, pruned_loss=0.04882, over 13263.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2459, pruned_loss=0.06243, over 2584247.93 frames. ], batch size: 63, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:13:59,689 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2024-06-22 04:14:06,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=523875.0, ans=0.025 2024-06-22 04:14:19,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=523911.6666666667, ans=0.0 2024-06-22 04:14:23,295 INFO [train.py:1028] (0/2) Epoch 29, batch 2500, loss[loss=0.1795, simple_loss=0.2343, pruned_loss=0.0623, over 13184.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2446, pruned_loss=0.06201, over 2586727.45 frames. ], batch size: 83, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:14:23,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=523930.0, ans=0.0 2024-06-22 04:14:32,582 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=523948.3333333333, ans=0.125 2024-06-22 04:14:37,029 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.318e+02 2.443e+02 2.663e+02 3.793e+02, threshold=4.886e+02, percent-clipped=0.0 2024-06-22 04:14:40,805 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=523966.6666666667, ans=0.0 2024-06-22 04:14:44,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.73 vs. limit=6.0 2024-06-22 04:14:54,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=524003.3333333333, ans=0.125 2024-06-22 04:14:55,379 INFO [train.py:1028] (0/2) Epoch 29, batch 2550, loss[loss=0.1883, simple_loss=0.2542, pruned_loss=0.06117, over 12550.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2435, pruned_loss=0.06166, over 2587587.81 frames. ], batch size: 22, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:14:58,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=524021.6666666667, ans=0.125 2024-06-22 04:15:03,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524021.6666666667, ans=0.1 2024-06-22 04:15:07,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=524040.0, ans=0.125 2024-06-22 04:15:14,977 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=524058.3333333333, ans=0.2 2024-06-22 04:15:15,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=524058.3333333333, ans=15.0 2024-06-22 04:15:26,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=524095.0, ans=0.1 2024-06-22 04:15:30,418 INFO [train.py:1028] (0/2) Epoch 29, batch 2600, loss[loss=0.1826, simple_loss=0.2444, pruned_loss=0.06039, over 13240.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.2434, pruned_loss=0.06184, over 2587266.94 frames. ], batch size: 52, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:15:35,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=524113.3333333333, ans=0.025 2024-06-22 04:15:48,665 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.354e+02 2.512e+02 2.696e+02 3.404e+02, threshold=5.024e+02, percent-clipped=0.0 2024-06-22 04:15:51,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=524150.0, ans=0.0 2024-06-22 04:16:07,408 INFO [train.py:1028] (0/2) Epoch 29, batch 2650, loss[loss=0.1818, simple_loss=0.2339, pruned_loss=0.06486, over 13056.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2425, pruned_loss=0.06172, over 2587382.16 frames. ], batch size: 144, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:16:09,867 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2024-06-22 04:16:18,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=15.0 2024-06-22 04:16:39,633 INFO [train.py:1028] (0/2) Epoch 29, batch 2700, loss[loss=0.1841, simple_loss=0.2367, pruned_loss=0.06578, over 13265.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2415, pruned_loss=0.06161, over 2586486.19 frames. ], batch size: 89, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:16:39,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=524296.6666666666, ans=0.0 2024-06-22 04:16:51,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=524315.0, ans=0.035 2024-06-22 04:16:53,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524333.3333333334, ans=0.125 2024-06-22 04:16:53,793 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.297e+02 2.443e+02 2.651e+02 3.296e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 04:17:08,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=524370.0, ans=0.125 2024-06-22 04:17:11,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=524370.0, ans=0.125 2024-06-22 04:17:13,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.99 vs. limit=15.0 2024-06-22 04:17:15,677 INFO [train.py:1028] (0/2) Epoch 29, batch 2750, loss[loss=0.1864, simple_loss=0.247, pruned_loss=0.06287, over 13255.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2409, pruned_loss=0.06109, over 2584064.66 frames. ], batch size: 43, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:17:25,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=524406.6666666666, ans=0.0 2024-06-22 04:17:28,585 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=524425.0, ans=0.125 2024-06-22 04:17:44,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524443.3333333334, ans=0.1 2024-06-22 04:17:46,852 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=524461.6666666666, ans=0.2 2024-06-22 04:17:49,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524461.6666666666, ans=0.1 2024-06-22 04:17:51,919 INFO [train.py:1028] (0/2) Epoch 29, batch 2800, loss[loss=0.179, simple_loss=0.2243, pruned_loss=0.06682, over 10803.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.2403, pruned_loss=0.061, over 2581308.80 frames. ], batch size: 305, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:18:02,432 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524498.3333333334, ans=0.125 2024-06-22 04:18:05,408 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.332e+02 2.483e+02 2.805e+02 3.871e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-22 04:18:06,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=524516.6666666666, ans=0.125 2024-06-22 04:18:24,248 INFO [train.py:1028] (0/2) Epoch 29, batch 2850, loss[loss=0.1794, simple_loss=0.2351, pruned_loss=0.06179, over 13271.00 frames. ], tot_loss[loss=0.1802, simple_loss=0.2391, pruned_loss=0.06063, over 2578896.12 frames. ], batch size: 49, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:18:32,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=524590.0, ans=0.5 2024-06-22 04:18:40,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=524608.3333333334, ans=0.2 2024-06-22 04:18:41,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=524608.3333333334, ans=0.0 2024-06-22 04:18:47,355 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524626.6666666666, ans=0.1 2024-06-22 04:18:53,104 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=524645.0, ans=0.0 2024-06-22 04:18:56,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.85 vs. limit=15.0 2024-06-22 04:18:57,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=524645.0, ans=0.2 2024-06-22 04:18:59,217 INFO [train.py:1028] (0/2) Epoch 29, batch 2900, loss[loss=0.1656, simple_loss=0.2269, pruned_loss=0.05217, over 13152.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.237, pruned_loss=0.05981, over 2586558.41 frames. ], batch size: 55, lr: 2.00e-03, grad_scale: 32.0 2024-06-22 04:19:00,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=524663.3333333334, ans=0.125 2024-06-22 04:19:09,719 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.69 vs. limit=15.0 2024-06-22 04:19:12,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=524700.0, ans=0.125 2024-06-22 04:19:13,244 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.363e+02 2.581e+02 2.810e+02 3.242e+02, threshold=5.161e+02, percent-clipped=0.0 2024-06-22 04:19:14,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=524700.0, ans=0.125 2024-06-22 04:19:35,660 INFO [train.py:1028] (0/2) Epoch 29, batch 2950, loss[loss=0.1796, simple_loss=0.2331, pruned_loss=0.06303, over 13290.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2368, pruned_loss=0.05981, over 2578736.22 frames. ], batch size: 43, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:19:43,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=524773.3333333334, ans=0.125 2024-06-22 04:19:45,849 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=524773.3333333334, ans=0.2 2024-06-22 04:19:54,464 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=524791.6666666666, ans=0.0 2024-06-22 04:19:56,146 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.62 vs. limit=6.0 2024-06-22 04:20:08,923 INFO [train.py:1028] (0/2) Epoch 29, batch 3000, loss[loss=0.1808, simple_loss=0.2395, pruned_loss=0.06102, over 13204.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2361, pruned_loss=0.05946, over 2578539.08 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:20:08,924 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 04:20:16,977 INFO [train.py:1060] (0/2) Epoch 29, validation: loss=0.1934, simple_loss=0.2521, pruned_loss=0.06738, over 351949.00 frames. 2024-06-22 04:20:16,977 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 04:20:17,074 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=524846.6666666666, ans=0.025 2024-06-22 04:20:18,116 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2024-06-22 04:20:21,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=524846.6666666666, ans=0.025 2024-06-22 04:20:30,760 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.302e+02 2.422e+02 2.566e+02 3.144e+02, threshold=4.843e+02, percent-clipped=0.0 2024-06-22 04:20:32,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=524883.3333333334, ans=0.125 2024-06-22 04:20:33,922 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2024-06-22 04:20:38,355 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2024-06-22 04:20:44,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=524920.0, ans=0.125 2024-06-22 04:20:52,751 INFO [train.py:1028] (0/2) Epoch 29, batch 3050, loss[loss=0.1715, simple_loss=0.2323, pruned_loss=0.05535, over 13322.00 frames. ], tot_loss[loss=0.177, simple_loss=0.235, pruned_loss=0.05948, over 2578461.48 frames. ], batch size: 46, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:20:58,555 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=524956.6666666666, ans=0.0 2024-06-22 04:21:17,956 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=525011.6666666666, ans=0.125 2024-06-22 04:21:20,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=525011.6666666666, ans=0.0 2024-06-22 04:21:23,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=525011.6666666666, ans=0.0 2024-06-22 04:21:25,045 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2024-06-22 04:21:25,168 INFO [train.py:1028] (0/2) Epoch 29, batch 3100, loss[loss=0.1627, simple_loss=0.2175, pruned_loss=0.05401, over 13123.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2348, pruned_loss=0.05933, over 2579307.95 frames. ], batch size: 145, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:21:33,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=525030.0, ans=0.0 2024-06-22 04:21:41,983 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.877e+02 2.305e+02 2.452e+02 2.604e+02 3.487e+02, threshold=4.903e+02, percent-clipped=0.0 2024-06-22 04:21:44,854 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.30 vs. limit=10.0 2024-06-22 04:21:52,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=525085.0, ans=0.125 2024-06-22 04:21:54,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=525103.3333333334, ans=0.125 2024-06-22 04:22:00,799 INFO [train.py:1028] (0/2) Epoch 29, batch 3150, loss[loss=0.1727, simple_loss=0.2242, pruned_loss=0.06061, over 12913.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2338, pruned_loss=0.05902, over 2581043.49 frames. ], batch size: 158, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:22:10,505 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=525140.0, ans=0.125 2024-06-22 04:22:17,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525158.3333333334, ans=0.1 2024-06-22 04:22:20,199 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=12.0 2024-06-22 04:22:26,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=525195.0, ans=0.125 2024-06-22 04:22:33,269 INFO [train.py:1028] (0/2) Epoch 29, batch 3200, loss[loss=0.1701, simple_loss=0.2327, pruned_loss=0.0537, over 13165.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.233, pruned_loss=0.05844, over 2580764.51 frames. ], batch size: 55, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:22:36,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=525213.3333333334, ans=0.125 2024-06-22 04:22:36,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525213.3333333334, ans=0.1 2024-06-22 04:22:38,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525213.3333333334, ans=0.1 2024-06-22 04:22:39,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=525231.6666666666, ans=0.125 2024-06-22 04:22:40,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=525231.6666666666, ans=0.125 2024-06-22 04:22:41,053 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=525231.6666666666, ans=0.125 2024-06-22 04:22:49,809 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.234e+02 2.367e+02 2.507e+02 2.902e+02, threshold=4.735e+02, percent-clipped=0.0 2024-06-22 04:22:50,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=525250.0, ans=0.0 2024-06-22 04:22:57,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=525268.3333333334, ans=0.125 2024-06-22 04:23:09,061 INFO [train.py:1028] (0/2) Epoch 29, batch 3250, loss[loss=0.169, simple_loss=0.2349, pruned_loss=0.05152, over 13223.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2329, pruned_loss=0.05862, over 2585517.34 frames. ], batch size: 72, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:23:09,126 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525305.0, ans=0.1 2024-06-22 04:23:09,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=525305.0, ans=0.125 2024-06-22 04:23:28,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=525341.6666666666, ans=0.0 2024-06-22 04:23:32,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=525360.0, ans=0.2 2024-06-22 04:23:42,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=525378.3333333334, ans=0.125 2024-06-22 04:23:44,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=525378.3333333334, ans=0.2 2024-06-22 04:23:45,882 INFO [train.py:1028] (0/2) Epoch 29, batch 3300, loss[loss=0.1549, simple_loss=0.2109, pruned_loss=0.04943, over 12695.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2324, pruned_loss=0.05816, over 2582261.91 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:23:51,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=525396.6666666666, ans=0.0 2024-06-22 04:23:51,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=525415.0, ans=0.125 2024-06-22 04:24:00,184 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.296e+02 2.474e+02 2.646e+02 3.545e+02, threshold=4.948e+02, percent-clipped=0.0 2024-06-22 04:24:00,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=525433.3333333334, ans=0.09899494936611666 2024-06-22 04:24:04,719 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:24:08,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525451.6666666666, ans=0.125 2024-06-22 04:24:18,969 INFO [train.py:1028] (0/2) Epoch 29, batch 3350, loss[loss=0.1826, simple_loss=0.2316, pruned_loss=0.06682, over 12988.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2315, pruned_loss=0.05816, over 2576445.24 frames. ], batch size: 158, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:24:29,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=525506.6666666666, ans=0.0 2024-06-22 04:24:34,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=525525.0, ans=0.05 2024-06-22 04:24:42,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=525543.3333333334, ans=0.5 2024-06-22 04:24:54,979 INFO [train.py:1028] (0/2) Epoch 29, batch 3400, loss[loss=0.1898, simple_loss=0.2495, pruned_loss=0.06511, over 12771.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.2317, pruned_loss=0.05877, over 2575599.10 frames. ], batch size: 22, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:24:59,680 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=525580.0, ans=0.025 2024-06-22 04:25:05,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=525598.3333333334, ans=0.0 2024-06-22 04:25:08,940 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.331e+02 2.488e+02 2.682e+02 3.920e+02, threshold=4.975e+02, percent-clipped=0.0 2024-06-22 04:25:31,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=525671.6666666666, ans=0.125 2024-06-22 04:25:31,612 INFO [train.py:1028] (0/2) Epoch 29, batch 3450, loss[loss=0.1727, simple_loss=0.2316, pruned_loss=0.0569, over 12789.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2307, pruned_loss=0.05832, over 2577537.03 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:25:33,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525671.6666666666, ans=0.1 2024-06-22 04:25:37,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=525690.0, ans=0.2 2024-06-22 04:25:46,187 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=525708.3333333334, ans=0.0 2024-06-22 04:26:04,641 INFO [train.py:1028] (0/2) Epoch 29, batch 3500, loss[loss=0.1707, simple_loss=0.2296, pruned_loss=0.05588, over 12936.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2305, pruned_loss=0.05801, over 2576027.11 frames. ], batch size: 33, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:26:09,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=525763.3333333334, ans=0.2 2024-06-22 04:26:09,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=525763.3333333334, ans=0.0 2024-06-22 04:26:15,823 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.45 vs. limit=22.5 2024-06-22 04:26:18,798 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.237e+02 2.354e+02 2.539e+02 3.070e+02, threshold=4.709e+02, percent-clipped=0.0 2024-06-22 04:26:23,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525800.0, ans=0.1 2024-06-22 04:26:33,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=525836.6666666666, ans=0.125 2024-06-22 04:26:36,511 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=525836.6666666666, ans=0.2 2024-06-22 04:26:40,999 INFO [train.py:1028] (0/2) Epoch 29, batch 3550, loss[loss=0.1527, simple_loss=0.2092, pruned_loss=0.0481, over 13124.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.23, pruned_loss=0.05775, over 2577189.54 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:26:48,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=525873.3333333334, ans=0.0 2024-06-22 04:26:49,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=525873.3333333334, ans=0.0 2024-06-22 04:27:09,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=525928.3333333334, ans=0.0 2024-06-22 04:27:10,397 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2024-06-22 04:27:10,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2024-06-22 04:27:13,304 INFO [train.py:1028] (0/2) Epoch 29, batch 3600, loss[loss=0.1897, simple_loss=0.2407, pruned_loss=0.06932, over 13306.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.23, pruned_loss=0.05793, over 2580863.75 frames. ], batch size: 49, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:27:15,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=525946.6666666666, ans=0.025 2024-06-22 04:27:21,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=525946.6666666666, ans=0.125 2024-06-22 04:27:22,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=525965.0, ans=0.05 2024-06-22 04:27:30,480 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.342e+02 2.497e+02 2.754e+02 4.411e+02, threshold=4.994e+02, percent-clipped=0.0 2024-06-22 04:27:32,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=525983.3333333334, ans=0.0 2024-06-22 04:27:48,349 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.40 vs. limit=15.0 2024-06-22 04:27:49,284 INFO [train.py:1028] (0/2) Epoch 29, batch 3650, loss[loss=0.168, simple_loss=0.2196, pruned_loss=0.05822, over 13086.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2298, pruned_loss=0.05768, over 2580293.16 frames. ], batch size: 103, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:27:50,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=526038.3333333334, ans=0.5 2024-06-22 04:27:51,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=526038.3333333334, ans=0.0 2024-06-22 04:28:03,366 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.94 vs. limit=15.0 2024-06-22 04:28:05,080 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=526075.0, ans=0.1 2024-06-22 04:28:06,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=526075.0, ans=0.025 2024-06-22 04:28:08,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=526093.3333333334, ans=0.2 2024-06-22 04:28:22,106 INFO [train.py:1028] (0/2) Epoch 29, batch 3700, loss[loss=0.1903, simple_loss=0.2465, pruned_loss=0.06706, over 13311.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2291, pruned_loss=0.05759, over 2585698.81 frames. ], batch size: 72, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:28:27,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2024-06-22 04:28:36,040 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.281e+02 2.384e+02 2.618e+02 3.123e+02, threshold=4.768e+02, percent-clipped=0.0 2024-06-22 04:28:48,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=526185.0, ans=0.125 2024-06-22 04:28:48,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=526185.0, ans=0.02 2024-06-22 04:28:51,477 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=526203.3333333334, ans=0.0 2024-06-22 04:28:53,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.50 vs. limit=15.0 2024-06-22 04:28:57,541 INFO [train.py:1028] (0/2) Epoch 29, batch 3750, loss[loss=0.1628, simple_loss=0.2244, pruned_loss=0.05061, over 12488.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2285, pruned_loss=0.05709, over 2587263.59 frames. ], batch size: 22, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:28:59,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=8.0 2024-06-22 04:28:59,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.67 vs. limit=22.5 2024-06-22 04:29:07,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=526240.0, ans=0.125 2024-06-22 04:29:08,615 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=22.5 2024-06-22 04:29:11,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.34 vs. limit=15.0 2024-06-22 04:29:18,709 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526276.6666666666, ans=0.125 2024-06-22 04:29:18,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=526276.6666666666, ans=0.125 2024-06-22 04:29:26,694 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 04:29:32,856 INFO [train.py:1028] (0/2) Epoch 29, batch 3800, loss[loss=0.1749, simple_loss=0.2368, pruned_loss=0.05652, over 13234.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2285, pruned_loss=0.05712, over 2584761.93 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:29:36,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=526313.3333333334, ans=0.0 2024-06-22 04:29:38,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=526313.3333333334, ans=0.125 2024-06-22 04:29:40,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=526331.6666666666, ans=0.0 2024-06-22 04:29:44,284 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.54 vs. limit=22.5 2024-06-22 04:29:46,330 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.276e+02 2.400e+02 2.630e+02 3.441e+02, threshold=4.801e+02, percent-clipped=0.0 2024-06-22 04:29:49,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=526350.0, ans=10.0 2024-06-22 04:29:52,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=526368.3333333334, ans=0.1 2024-06-22 04:29:56,622 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2024-06-22 04:29:59,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526386.6666666666, ans=0.1 2024-06-22 04:30:05,613 INFO [train.py:1028] (0/2) Epoch 29, batch 3850, loss[loss=0.1679, simple_loss=0.2195, pruned_loss=0.05817, over 13028.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2276, pruned_loss=0.05665, over 2584385.69 frames. ], batch size: 144, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:30:07,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=526405.0, ans=0.125 2024-06-22 04:30:10,347 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526405.0, ans=0.125 2024-06-22 04:30:13,295 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.75 vs. limit=15.0 2024-06-22 04:30:16,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=526423.3333333334, ans=0.125 2024-06-22 04:30:33,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=526478.3333333334, ans=0.125 2024-06-22 04:30:37,640 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=526478.3333333334, ans=0.1 2024-06-22 04:30:39,076 INFO [train.py:1028] (0/2) Epoch 29, batch 3900, loss[loss=0.1623, simple_loss=0.2174, pruned_loss=0.05354, over 13193.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2271, pruned_loss=0.05657, over 2587362.14 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:30:44,725 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=526496.6666666666, ans=0.0 2024-06-22 04:30:46,297 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2024-06-22 04:30:46,800 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526515.0, ans=0.125 2024-06-22 04:30:56,312 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.223e+02 2.363e+02 2.548e+02 3.559e+02, threshold=4.726e+02, percent-clipped=0.0 2024-06-22 04:31:00,545 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=526533.3333333334, ans=0.125 2024-06-22 04:31:14,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.72 vs. limit=15.0 2024-06-22 04:31:16,329 INFO [train.py:1028] (0/2) Epoch 29, batch 3950, loss[loss=0.1584, simple_loss=0.2106, pruned_loss=0.05308, over 13148.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2261, pruned_loss=0.05584, over 2588279.86 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:31:20,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=526588.3333333334, ans=0.0 2024-06-22 04:31:36,362 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=526625.0, ans=0.0 2024-06-22 04:31:42,710 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=526643.3333333334, ans=0.2 2024-06-22 04:31:45,937 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526661.6666666666, ans=0.125 2024-06-22 04:31:46,682 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=526661.6666666666, ans=0.0 2024-06-22 04:31:53,051 INFO [train.py:1028] (0/2) Epoch 29, batch 4000, loss[loss=0.1621, simple_loss=0.2166, pruned_loss=0.05379, over 12904.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2262, pruned_loss=0.05624, over 2582938.75 frames. ], batch size: 39, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:31:59,754 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526698.3333333334, ans=0.1 2024-06-22 04:32:03,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526698.3333333334, ans=0.125 2024-06-22 04:32:06,885 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.311e+02 2.432e+02 2.634e+02 3.544e+02, threshold=4.864e+02, percent-clipped=0.0 2024-06-22 04:32:10,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=526716.6666666666, ans=0.0 2024-06-22 04:32:21,998 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=526753.3333333334, ans=0.0 2024-06-22 04:32:25,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=526753.3333333334, ans=15.0 2024-06-22 04:32:26,351 INFO [train.py:1028] (0/2) Epoch 29, batch 4050, loss[loss=0.1774, simple_loss=0.2275, pruned_loss=0.06369, over 10885.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2263, pruned_loss=0.05644, over 2580619.63 frames. ], batch size: 303, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:32:30,590 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=526771.6666666666, ans=0.0 2024-06-22 04:32:46,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.06 vs. limit=15.0 2024-06-22 04:32:48,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=526826.6666666666, ans=0.2 2024-06-22 04:32:56,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=526845.0, ans=0.125 2024-06-22 04:33:03,536 INFO [train.py:1028] (0/2) Epoch 29, batch 4100, loss[loss=0.1775, simple_loss=0.2255, pruned_loss=0.0648, over 13067.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.226, pruned_loss=0.05646, over 2575930.09 frames. ], batch size: 102, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:33:09,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526863.3333333334, ans=0.125 2024-06-22 04:33:12,700 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=526881.6666666666, ans=0.2 2024-06-22 04:33:17,967 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.284e+02 2.397e+02 2.647e+02 3.168e+02, threshold=4.794e+02, percent-clipped=0.0 2024-06-22 04:33:18,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=526900.0, ans=0.0 2024-06-22 04:33:21,376 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526900.0, ans=0.1 2024-06-22 04:33:29,667 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.58 vs. limit=15.0 2024-06-22 04:33:36,460 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.63 vs. limit=15.0 2024-06-22 04:33:36,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=526936.6666666666, ans=0.0 2024-06-22 04:33:42,037 INFO [train.py:1028] (0/2) Epoch 29, batch 4150, loss[loss=0.1719, simple_loss=0.2305, pruned_loss=0.0567, over 13193.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2263, pruned_loss=0.05649, over 2574073.57 frames. ], batch size: 55, lr: 1.99e-03, grad_scale: 32.0 2024-06-22 04:33:42,218 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526955.0, ans=0.1 2024-06-22 04:34:04,400 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2024-06-22 04:34:07,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.10 vs. limit=22.5 2024-06-22 04:34:15,419 INFO [train.py:1028] (0/2) Epoch 29, batch 4200, loss[loss=0.1604, simple_loss=0.2143, pruned_loss=0.05319, over 13106.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2262, pruned_loss=0.05651, over 2576931.55 frames. ], batch size: 102, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:34:29,334 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.899e+02 2.227e+02 2.373e+02 2.534e+02 3.280e+02, threshold=4.745e+02, percent-clipped=0.0 2024-06-22 04:34:31,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=527083.3333333334, ans=0.125 2024-06-22 04:34:36,548 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=527101.6666666666, ans=0.125 2024-06-22 04:34:48,839 INFO [train.py:1028] (0/2) Epoch 29, batch 4250, loss[loss=0.1551, simple_loss=0.2146, pruned_loss=0.04779, over 13305.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2259, pruned_loss=0.0563, over 2578497.39 frames. ], batch size: 46, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:34:49,681 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:34:52,091 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:34:56,561 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=527156.6666666666, ans=0.0 2024-06-22 04:35:14,329 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.60 vs. limit=15.0 2024-06-22 04:35:18,665 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=527211.6666666666, ans=0.125 2024-06-22 04:35:24,618 INFO [train.py:1028] (0/2) Epoch 29, batch 4300, loss[loss=0.1755, simple_loss=0.2285, pruned_loss=0.06126, over 13194.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2254, pruned_loss=0.05608, over 2579528.99 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:35:27,914 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:35:33,377 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:35:41,583 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2024-06-22 04:35:41,782 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.249e+02 2.335e+02 2.591e+02 4.091e+02, threshold=4.670e+02, percent-clipped=0.0 2024-06-22 04:35:43,641 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2024-06-22 04:35:44,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=527266.6666666666, ans=0.125 2024-06-22 04:35:50,827 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=527285.0, ans=0.125 2024-06-22 04:36:00,354 INFO [train.py:1028] (0/2) Epoch 29, batch 4350, loss[loss=0.1629, simple_loss=0.2247, pruned_loss=0.05052, over 13166.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.225, pruned_loss=0.05583, over 2584421.44 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:36:02,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.25 vs. limit=15.0 2024-06-22 04:36:08,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=527340.0, ans=0.09899494936611666 2024-06-22 04:36:33,885 INFO [train.py:1028] (0/2) Epoch 29, batch 4400, loss[loss=0.1591, simple_loss=0.2132, pruned_loss=0.0525, over 13218.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.2249, pruned_loss=0.05591, over 2585373.29 frames. ], batch size: 83, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:36:35,002 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2024-06-22 04:36:35,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=527413.3333333334, ans=0.1 2024-06-22 04:36:39,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=527431.6666666666, ans=0.125 2024-06-22 04:36:47,249 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.303e+02 2.449e+02 2.649e+02 3.365e+02, threshold=4.898e+02, percent-clipped=0.0 2024-06-22 04:36:50,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.58 vs. limit=22.5 2024-06-22 04:36:52,156 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=527450.0, ans=0.0 2024-06-22 04:36:54,010 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=527468.3333333334, ans=0.05 2024-06-22 04:37:11,776 INFO [train.py:1028] (0/2) Epoch 29, batch 4450, loss[loss=0.1799, simple_loss=0.2371, pruned_loss=0.06132, over 12844.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2253, pruned_loss=0.0563, over 2580585.77 frames. ], batch size: 33, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:37:17,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=527505.0, ans=0.125 2024-06-22 04:37:20,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=527523.3333333334, ans=15.0 2024-06-22 04:37:20,139 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2024-06-22 04:37:25,543 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=527541.6666666666, ans=0.125 2024-06-22 04:37:25,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=527541.6666666666, ans=0.0 2024-06-22 04:37:42,148 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2024-06-22 04:37:42,580 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=527578.3333333334, ans=0.1 2024-06-22 04:37:47,947 INFO [train.py:1028] (0/2) Epoch 29, batch 4500, loss[loss=0.1621, simple_loss=0.2159, pruned_loss=0.05413, over 13237.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.225, pruned_loss=0.05617, over 2585349.05 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:38:02,141 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 2.236e+02 2.334e+02 2.463e+02 3.396e+02, threshold=4.668e+02, percent-clipped=0.0 2024-06-22 04:38:12,949 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.91 vs. limit=12.0 2024-06-22 04:38:13,210 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=527651.6666666666, ans=0.2 2024-06-22 04:38:21,759 INFO [train.py:1028] (0/2) Epoch 29, batch 4550, loss[loss=0.1578, simple_loss=0.2117, pruned_loss=0.05194, over 13268.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2248, pruned_loss=0.05624, over 2588282.27 frames. ], batch size: 52, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:38:23,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=527688.3333333334, ans=0.125 2024-06-22 04:38:25,591 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=527688.3333333334, ans=0.95 2024-06-22 04:38:28,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=527706.6666666666, ans=0.125 2024-06-22 04:38:36,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=527725.0, ans=0.0 2024-06-22 04:38:40,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=527743.3333333334, ans=0.125 2024-06-22 04:38:47,974 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.07 vs. limit=10.0 2024-06-22 04:38:52,917 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=527761.6666666666, ans=0.0 2024-06-22 04:38:54,143 INFO [train.py:1028] (0/2) Epoch 29, batch 4600, loss[loss=0.1854, simple_loss=0.2338, pruned_loss=0.06855, over 12536.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.2247, pruned_loss=0.05603, over 2583424.57 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:38:57,428 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=527780.0, ans=0.025 2024-06-22 04:39:10,371 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=527816.6666666666, ans=0.0 2024-06-22 04:39:10,742 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.306e+02 2.444e+02 2.636e+02 3.242e+02, threshold=4.889e+02, percent-clipped=0.0 2024-06-22 04:39:13,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=527816.6666666666, ans=0.0 2024-06-22 04:39:25,579 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=527853.3333333334, ans=0.2 2024-06-22 04:39:27,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=527853.3333333334, ans=0.1 2024-06-22 04:39:29,260 INFO [train.py:1028] (0/2) Epoch 29, batch 4650, loss[loss=0.1678, simple_loss=0.214, pruned_loss=0.06077, over 13119.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2247, pruned_loss=0.05623, over 2586746.54 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:39:33,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=527871.6666666666, ans=0.125 2024-06-22 04:40:05,372 INFO [train.py:1028] (0/2) Epoch 29, batch 4700, loss[loss=0.1587, simple_loss=0.2237, pruned_loss=0.04686, over 12863.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2243, pruned_loss=0.05601, over 2583210.87 frames. ], batch size: 26, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:40:18,177 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-288000.pt 2024-06-22 04:40:24,279 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.254e+02 2.407e+02 2.567e+02 3.679e+02, threshold=4.813e+02, percent-clipped=0.0 2024-06-22 04:40:32,230 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=528018.3333333334, ans=0.0 2024-06-22 04:40:37,316 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=528036.6666666666, ans=0.04949747468305833 2024-06-22 04:40:37,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2024-06-22 04:40:42,820 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=528055.0, ans=0.0 2024-06-22 04:40:43,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=528055.0, ans=15.0 2024-06-22 04:40:43,276 INFO [train.py:1028] (0/2) Epoch 29, batch 4750, loss[loss=0.187, simple_loss=0.2323, pruned_loss=0.07083, over 12525.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2239, pruned_loss=0.05602, over 2579435.85 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:40:45,128 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=15.0 2024-06-22 04:40:48,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=528055.0, ans=0.025 2024-06-22 04:40:48,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=528055.0, ans=0.0 2024-06-22 04:40:51,558 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.33 vs. limit=15.0 2024-06-22 04:41:03,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528091.6666666666, ans=0.125 2024-06-22 04:41:09,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=528110.0, ans=0.025 2024-06-22 04:41:20,100 INFO [train.py:1028] (0/2) Epoch 29, batch 4800, loss[loss=0.1474, simple_loss=0.2005, pruned_loss=0.04712, over 13290.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2239, pruned_loss=0.05588, over 2576308.83 frames. ], batch size: 63, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:41:27,034 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=528165.0, ans=0.125 2024-06-22 04:41:32,472 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=528165.0, ans=0.04949747468305833 2024-06-22 04:41:34,165 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.194e+02 2.371e+02 2.592e+02 3.538e+02, threshold=4.742e+02, percent-clipped=0.0 2024-06-22 04:41:43,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.47 vs. limit=22.5 2024-06-22 04:41:51,494 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=15.0 2024-06-22 04:41:55,989 INFO [train.py:1028] (0/2) Epoch 29, batch 4850, loss[loss=0.1495, simple_loss=0.2024, pruned_loss=0.04827, over 13212.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2235, pruned_loss=0.05558, over 2573585.84 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:41:57,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=528238.3333333334, ans=0.0 2024-06-22 04:42:00,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=528238.3333333334, ans=0.125 2024-06-22 04:42:01,444 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528238.3333333334, ans=0.125 2024-06-22 04:42:09,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=22.5 2024-06-22 04:42:15,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=528293.3333333334, ans=0.125 2024-06-22 04:42:29,384 INFO [train.py:1028] (0/2) Epoch 29, batch 4900, loss[loss=0.1465, simple_loss=0.2103, pruned_loss=0.04135, over 13235.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.2238, pruned_loss=0.056, over 2574941.78 frames. ], batch size: 59, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:42:36,168 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=528348.3333333334, ans=0.125 2024-06-22 04:42:43,111 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 2.205e+02 2.336e+02 2.528e+02 3.228e+02, threshold=4.673e+02, percent-clipped=0.0 2024-06-22 04:42:43,976 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=528366.6666666666, ans=0.125 2024-06-22 04:42:55,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=528403.3333333334, ans=0.0 2024-06-22 04:42:58,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=528403.3333333334, ans=0.0 2024-06-22 04:43:05,563 INFO [train.py:1028] (0/2) Epoch 29, batch 4950, loss[loss=0.1766, simple_loss=0.2207, pruned_loss=0.06626, over 11003.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2238, pruned_loss=0.05624, over 2569441.39 frames. ], batch size: 304, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:43:05,611 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528421.6666666666, ans=0.1 2024-06-22 04:43:06,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528421.6666666666, ans=0.1 2024-06-22 04:43:14,161 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528440.0, ans=0.125 2024-06-22 04:43:25,398 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.42 vs. limit=12.0 2024-06-22 04:43:33,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=528476.6666666666, ans=0.0 2024-06-22 04:43:36,078 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2024-06-22 04:43:39,964 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-22 04:43:41,610 INFO [train.py:1028] (0/2) Epoch 29, batch 5000, loss[loss=0.1639, simple_loss=0.2206, pruned_loss=0.05358, over 13177.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.2238, pruned_loss=0.05595, over 2574041.38 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:43:43,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=528513.3333333334, ans=0.125 2024-06-22 04:43:49,022 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=528531.6666666666, ans=0.0 2024-06-22 04:43:51,571 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2024-06-22 04:43:54,008 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=528531.6666666666, ans=0.0 2024-06-22 04:43:55,883 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.198e+02 2.354e+02 2.504e+02 3.990e+02, threshold=4.708e+02, percent-clipped=0.0 2024-06-22 04:44:01,268 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528568.3333333334, ans=0.1 2024-06-22 04:44:04,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=528568.3333333334, ans=0.125 2024-06-22 04:44:06,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528568.3333333334, ans=0.1 2024-06-22 04:44:08,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=528586.6666666666, ans=0.025 2024-06-22 04:44:11,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=528586.6666666666, ans=0.1 2024-06-22 04:44:11,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=528586.6666666666, ans=22.5 2024-06-22 04:44:15,275 INFO [train.py:1028] (0/2) Epoch 29, batch 5050, loss[loss=0.1628, simple_loss=0.2214, pruned_loss=0.05214, over 12998.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2241, pruned_loss=0.05609, over 2574328.88 frames. ], batch size: 36, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:44:17,408 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=528605.0, ans=0.2 2024-06-22 04:44:28,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528641.6666666666, ans=0.1 2024-06-22 04:44:28,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=528641.6666666666, ans=0.2 2024-06-22 04:44:48,045 INFO [train.py:1028] (0/2) Epoch 29, batch 5100, loss[loss=0.1761, simple_loss=0.2441, pruned_loss=0.05409, over 12912.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2243, pruned_loss=0.05621, over 2571083.57 frames. ], batch size: 39, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:44:54,714 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528715.0, ans=0.1 2024-06-22 04:45:03,710 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.11 vs. limit=15.0 2024-06-22 04:45:05,186 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.226e+02 2.357e+02 2.573e+02 3.326e+02, threshold=4.714e+02, percent-clipped=0.0 2024-06-22 04:45:07,337 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:45:10,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=528751.6666666666, ans=0.125 2024-06-22 04:45:21,492 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=528770.0, ans=0.0 2024-06-22 04:45:24,000 INFO [train.py:1028] (0/2) Epoch 29, batch 5150, loss[loss=0.158, simple_loss=0.2084, pruned_loss=0.05378, over 13125.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2236, pruned_loss=0.05621, over 2573104.23 frames. ], batch size: 132, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:45:39,661 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2024-06-22 04:45:40,770 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=528825.0, ans=0.0 2024-06-22 04:45:42,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=528825.0, ans=22.5 2024-06-22 04:45:45,895 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=528825.0, ans=0.025 2024-06-22 04:45:49,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=528843.3333333334, ans=0.125 2024-06-22 04:45:58,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528861.6666666666, ans=0.1 2024-06-22 04:46:00,746 INFO [train.py:1028] (0/2) Epoch 29, batch 5200, loss[loss=0.1722, simple_loss=0.2255, pruned_loss=0.05943, over 13136.00 frames. ], tot_loss[loss=0.1677, simple_loss=0.2235, pruned_loss=0.05598, over 2575139.73 frames. ], batch size: 95, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:46:01,143 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2024-06-22 04:46:12,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=528898.3333333334, ans=0.125 2024-06-22 04:46:14,591 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.229e+02 2.356e+02 2.469e+02 3.093e+02, threshold=4.712e+02, percent-clipped=0.0 2024-06-22 04:46:23,542 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=528935.0, ans=0.125 2024-06-22 04:46:29,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528953.3333333334, ans=0.125 2024-06-22 04:46:29,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=528953.3333333334, ans=0.025 2024-06-22 04:46:30,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten.whitening_limit, batch_count=528953.3333333334, ans=15.0 2024-06-22 04:46:34,637 INFO [train.py:1028] (0/2) Epoch 29, batch 5250, loss[loss=0.1572, simple_loss=0.2093, pruned_loss=0.05255, over 13257.00 frames. ], tot_loss[loss=0.1675, simple_loss=0.2234, pruned_loss=0.0558, over 2571112.53 frames. ], batch size: 52, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:46:41,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=528990.0, ans=0.025 2024-06-22 04:46:53,125 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=529008.3333333334, ans=0.125 2024-06-22 04:47:05,096 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=529026.6666666666, ans=0.125 2024-06-22 04:47:06,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529045.0, ans=0.1 2024-06-22 04:47:09,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=529045.0, ans=0.2 2024-06-22 04:47:13,439 INFO [train.py:1028] (0/2) Epoch 29, batch 5300, loss[loss=0.1494, simple_loss=0.2018, pruned_loss=0.04848, over 13077.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2228, pruned_loss=0.05532, over 2568083.65 frames. ], batch size: 144, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:47:15,242 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2024-06-22 04:47:27,569 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.263e+02 2.362e+02 2.545e+02 3.339e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-22 04:47:40,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=529118.3333333334, ans=0.125 2024-06-22 04:47:44,067 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=529136.6666666666, ans=0.1 2024-06-22 04:47:50,716 INFO [train.py:1028] (0/2) Epoch 29, batch 5350, loss[loss=0.1683, simple_loss=0.231, pruned_loss=0.05276, over 11372.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.2233, pruned_loss=0.05576, over 2574883.74 frames. ], batch size: 16, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:47:56,109 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=529155.0, ans=0.125 2024-06-22 04:47:58,535 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2024-06-22 04:48:03,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529191.6666666666, ans=0.125 2024-06-22 04:48:14,744 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.76 vs. limit=22.5 2024-06-22 04:48:18,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=529228.3333333334, ans=0.125 2024-06-22 04:48:23,125 INFO [train.py:1028] (0/2) Epoch 29, batch 5400, loss[loss=0.1863, simple_loss=0.23, pruned_loss=0.07128, over 12206.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2235, pruned_loss=0.05621, over 2567897.17 frames. ], batch size: 240, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:48:26,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=529246.6666666666, ans=0.125 2024-06-22 04:48:36,858 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.242e+02 2.378e+02 2.630e+02 3.499e+02, threshold=4.756e+02, percent-clipped=0.0 2024-06-22 04:48:38,262 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=529283.3333333334, ans=0.125 2024-06-22 04:48:40,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.48 vs. limit=15.0 2024-06-22 04:48:41,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=529301.6666666666, ans=0.125 2024-06-22 04:48:46,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=529301.6666666666, ans=0.025 2024-06-22 04:48:47,246 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=529301.6666666666, ans=0.125 2024-06-22 04:48:59,387 INFO [train.py:1028] (0/2) Epoch 29, batch 5450, loss[loss=0.1621, simple_loss=0.2157, pruned_loss=0.05421, over 12467.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2235, pruned_loss=0.05603, over 2571268.50 frames. ], batch size: 25, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:49:07,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529356.6666666666, ans=0.125 2024-06-22 04:49:07,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.37 vs. limit=10.0 2024-06-22 04:49:12,715 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2024-06-22 04:49:15,405 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2024-06-22 04:49:35,237 INFO [train.py:1028] (0/2) Epoch 29, batch 5500, loss[loss=0.1987, simple_loss=0.2417, pruned_loss=0.0778, over 12069.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2236, pruned_loss=0.05618, over 2563586.97 frames. ], batch size: 240, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:49:42,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=529448.3333333334, ans=0.0 2024-06-22 04:49:48,861 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.223e+02 2.348e+02 2.538e+02 2.951e+02, threshold=4.697e+02, percent-clipped=0.0 2024-06-22 04:50:06,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2024-06-22 04:50:08,234 INFO [train.py:1028] (0/2) Epoch 29, batch 5550, loss[loss=0.1747, simple_loss=0.2335, pruned_loss=0.05797, over 13198.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2237, pruned_loss=0.05617, over 2567168.35 frames. ], batch size: 43, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:50:13,745 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2024-06-22 04:50:22,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=529558.3333333334, ans=0.125 2024-06-22 04:50:25,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=529558.3333333334, ans=0.125 2024-06-22 04:50:33,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529595.0, ans=0.1 2024-06-22 04:50:39,788 INFO [train.py:1028] (0/2) Epoch 29, batch 5600, loss[loss=0.1568, simple_loss=0.2128, pruned_loss=0.05036, over 13283.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.223, pruned_loss=0.05587, over 2570118.85 frames. ], batch size: 89, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:50:44,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529613.3333333334, ans=0.1 2024-06-22 04:50:53,848 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.287e+02 2.446e+02 2.598e+02 3.545e+02, threshold=4.892e+02, percent-clipped=0.0 2024-06-22 04:51:00,311 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2024-06-22 04:51:06,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=529668.3333333334, ans=0.0 2024-06-22 04:51:15,966 INFO [train.py:1028] (0/2) Epoch 29, batch 5650, loss[loss=0.1724, simple_loss=0.2242, pruned_loss=0.06029, over 12590.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2229, pruned_loss=0.05556, over 2574723.10 frames. ], batch size: 202, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:51:16,386 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.84 vs. limit=22.5 2024-06-22 04:51:24,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=529723.3333333334, ans=0.1 2024-06-22 04:51:32,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=529741.6666666666, ans=0.0 2024-06-22 04:51:33,482 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=529741.6666666666, ans=0.0 2024-06-22 04:51:42,943 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2024-06-22 04:51:51,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=529796.6666666666, ans=0.0 2024-06-22 04:51:52,298 INFO [train.py:1028] (0/2) Epoch 29, batch 5700, loss[loss=0.1617, simple_loss=0.2255, pruned_loss=0.04897, over 13204.00 frames. ], tot_loss[loss=0.1671, simple_loss=0.2228, pruned_loss=0.05566, over 2578361.55 frames. ], batch size: 63, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:51:54,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=529796.6666666666, ans=0.025 2024-06-22 04:52:00,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=529815.0, ans=0.0 2024-06-22 04:52:05,417 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.310e+02 2.416e+02 2.647e+02 3.297e+02, threshold=4.832e+02, percent-clipped=0.0 2024-06-22 04:52:21,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=529870.0, ans=0.125 2024-06-22 04:52:21,918 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.32 vs. limit=10.0 2024-06-22 04:52:22,891 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529870.0, ans=0.1 2024-06-22 04:52:23,441 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:52:24,558 INFO [train.py:1028] (0/2) Epoch 29, batch 5750, loss[loss=0.1904, simple_loss=0.237, pruned_loss=0.07185, over 12758.00 frames. ], tot_loss[loss=0.168, simple_loss=0.224, pruned_loss=0.05598, over 2579334.04 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:52:27,117 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=529888.3333333334, ans=0.0 2024-06-22 04:52:27,765 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.const_attention_rate, batch_count=529888.3333333334, ans=0.025 2024-06-22 04:52:28,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=529888.3333333334, ans=0.0 2024-06-22 04:52:29,018 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=529888.3333333334, ans=0.2 2024-06-22 04:52:46,388 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2024-06-22 04:52:49,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2024-06-22 04:53:00,603 INFO [train.py:1028] (0/2) Epoch 29, batch 5800, loss[loss=0.1771, simple_loss=0.2311, pruned_loss=0.06153, over 12798.00 frames. ], tot_loss[loss=0.1695, simple_loss=0.2253, pruned_loss=0.05686, over 2578017.60 frames. ], batch size: 176, lr: 1.99e-03, grad_scale: 64.0 2024-06-22 04:53:07,912 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529998.3333333334, ans=0.125 2024-06-22 04:53:13,981 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.327e+02 2.489e+02 2.704e+02 3.501e+02, threshold=4.979e+02, percent-clipped=0.0 2024-06-22 04:53:18,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=530016.6666666666, ans=0.125 2024-06-22 04:53:30,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=530053.3333333334, ans=0.0 2024-06-22 04:53:35,610 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=530071.6666666666, ans=0.125 2024-06-22 04:53:36,062 INFO [train.py:1028] (0/2) Epoch 29, batch 5850, loss[loss=0.1845, simple_loss=0.2399, pruned_loss=0.06453, over 12567.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2263, pruned_loss=0.05712, over 2576731.89 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:53:45,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=530090.0, ans=0.125 2024-06-22 04:53:50,617 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2024-06-22 04:53:59,490 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=530126.6666666666, ans=0.0 2024-06-22 04:54:08,405 INFO [train.py:1028] (0/2) Epoch 29, batch 5900, loss[loss=0.1537, simple_loss=0.2074, pruned_loss=0.04999, over 13136.00 frames. ], tot_loss[loss=0.1718, simple_loss=0.228, pruned_loss=0.05778, over 2577028.42 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:54:12,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530163.3333333334, ans=0.1 2024-06-22 04:54:13,794 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.32 vs. limit=15.0 2024-06-22 04:54:19,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=530181.6666666666, ans=0.025 2024-06-22 04:54:22,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=530200.0, ans=0.0 2024-06-22 04:54:22,659 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.352e+02 2.530e+02 2.765e+02 4.168e+02, threshold=5.060e+02, percent-clipped=0.0 2024-06-22 04:54:30,324 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530218.3333333334, ans=0.125 2024-06-22 04:54:40,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.41 vs. limit=12.0 2024-06-22 04:54:41,255 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.27 vs. limit=15.0 2024-06-22 04:54:41,493 INFO [train.py:1028] (0/2) Epoch 29, batch 5950, loss[loss=0.1518, simple_loss=0.2026, pruned_loss=0.05054, over 13094.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2294, pruned_loss=0.0583, over 2582638.16 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:54:44,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=530255.0, ans=0.2 2024-06-22 04:54:56,200 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.72 vs. limit=10.0 2024-06-22 04:55:12,829 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=530328.3333333334, ans=0.0 2024-06-22 04:55:17,126 INFO [train.py:1028] (0/2) Epoch 29, batch 6000, loss[loss=0.2156, simple_loss=0.2685, pruned_loss=0.08132, over 12093.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2307, pruned_loss=0.05901, over 2575815.14 frames. ], batch size: 240, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:55:17,127 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 04:55:26,069 INFO [train.py:1060] (0/2) Epoch 29, validation: loss=0.1938, simple_loss=0.2522, pruned_loss=0.06764, over 351949.00 frames. 2024-06-22 04:55:26,069 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 04:55:27,932 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=15.0 2024-06-22 04:55:29,431 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=530346.6666666666, ans=0.125 2024-06-22 04:55:34,214 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=530365.0, ans=0.0 2024-06-22 04:55:36,236 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=530365.0, ans=0.2 2024-06-22 04:55:37,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=530365.0, ans=0.0 2024-06-22 04:55:39,911 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.348e+02 2.499e+02 2.704e+02 3.423e+02, threshold=4.998e+02, percent-clipped=0.0 2024-06-22 04:55:40,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530383.3333333334, ans=0.1 2024-06-22 04:55:40,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=530383.3333333334, ans=0.125 2024-06-22 04:55:55,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=530420.0, ans=0.0 2024-06-22 04:55:57,304 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=530420.0, ans=0.0 2024-06-22 04:55:59,138 INFO [train.py:1028] (0/2) Epoch 29, batch 6050, loss[loss=0.1627, simple_loss=0.2251, pruned_loss=0.05013, over 12967.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2319, pruned_loss=0.05932, over 2578656.92 frames. ], batch size: 39, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:55:59,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530438.3333333334, ans=0.1 2024-06-22 04:55:59,348 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=530438.3333333334, ans=0.0 2024-06-22 04:56:01,945 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530438.3333333334, ans=0.0 2024-06-22 04:56:05,465 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=530456.6666666666, ans=0.09899494936611666 2024-06-22 04:56:09,320 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=530456.6666666666, ans=0.2 2024-06-22 04:56:11,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=530456.6666666666, ans=0.07 2024-06-22 04:56:11,396 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2024-06-22 04:56:16,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=530475.0, ans=0.125 2024-06-22 04:56:19,782 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.16 vs. limit=15.0 2024-06-22 04:56:32,271 INFO [train.py:1028] (0/2) Epoch 29, batch 6100, loss[loss=0.1792, simple_loss=0.2316, pruned_loss=0.06339, over 13069.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2331, pruned_loss=0.05946, over 2580523.29 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:56:46,560 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.404e+02 2.542e+02 2.787e+02 3.764e+02, threshold=5.084e+02, percent-clipped=0.0 2024-06-22 04:56:47,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=530566.6666666666, ans=0.125 2024-06-22 04:57:04,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=530603.3333333334, ans=0.05 2024-06-22 04:57:04,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=530603.3333333334, ans=0.125 2024-06-22 04:57:09,313 INFO [train.py:1028] (0/2) Epoch 29, batch 6150, loss[loss=0.1834, simple_loss=0.2313, pruned_loss=0.06772, over 10691.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2345, pruned_loss=0.05987, over 2577682.21 frames. ], batch size: 303, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:57:10,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=530621.6666666666, ans=0.0 2024-06-22 04:57:18,964 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=530640.0, ans=0.2 2024-06-22 04:57:22,532 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2024-06-22 04:57:31,702 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=530676.6666666666, ans=0.0 2024-06-22 04:57:45,661 INFO [train.py:1028] (0/2) Epoch 29, batch 6200, loss[loss=0.194, simple_loss=0.2553, pruned_loss=0.06638, over 13227.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2358, pruned_loss=0.06037, over 2574284.89 frames. ], batch size: 89, lr: 1.98e-03, grad_scale: 128.0 2024-06-22 04:57:51,886 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2024-06-22 04:58:00,160 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.503e+02 2.658e+02 3.131e+02 4.462e+02, threshold=5.316e+02, percent-clipped=0.0 2024-06-22 04:58:07,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=530768.3333333334, ans=0.125 2024-06-22 04:58:13,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=530786.6666666666, ans=0.125 2024-06-22 04:58:19,053 INFO [train.py:1028] (0/2) Epoch 29, batch 6250, loss[loss=0.1917, simple_loss=0.246, pruned_loss=0.06876, over 13236.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2373, pruned_loss=0.06101, over 2568128.55 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 64.0 2024-06-22 04:58:20,954 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2024-06-22 04:58:21,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530805.0, ans=0.125 2024-06-22 04:58:27,539 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=530823.3333333334, ans=0.0 2024-06-22 04:58:28,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530823.3333333334, ans=0.0 2024-06-22 04:58:40,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=530860.0, ans=0.1 2024-06-22 04:58:51,205 INFO [train.py:1028] (0/2) Epoch 29, batch 6300, loss[loss=0.1763, simple_loss=0.2351, pruned_loss=0.0587, over 11819.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.239, pruned_loss=0.06166, over 2564151.34 frames. ], batch size: 16, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 04:58:55,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=530896.6666666666, ans=0.0 2024-06-22 04:58:59,871 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=530896.6666666666, ans=0.125 2024-06-22 04:59:00,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=530915.0, ans=0.125 2024-06-22 04:59:06,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=530915.0, ans=0.0 2024-06-22 04:59:06,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.60 vs. limit=6.0 2024-06-22 04:59:09,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.394e+02 2.578e+02 2.772e+02 3.631e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 04:59:10,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=530933.3333333334, ans=0.125 2024-06-22 04:59:13,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.26 vs. limit=15.0 2024-06-22 04:59:14,015 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=530951.6666666666, ans=0.125 2024-06-22 04:59:16,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=530951.6666666666, ans=0.0 2024-06-22 04:59:19,642 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 04:59:31,598 INFO [train.py:1028] (0/2) Epoch 29, batch 6350, loss[loss=0.1895, simple_loss=0.2496, pruned_loss=0.06464, over 12554.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2403, pruned_loss=0.06171, over 2573657.55 frames. ], batch size: 203, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 04:59:41,508 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=531006.6666666666, ans=0.125 2024-06-22 04:59:49,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=531025.0, ans=0.0 2024-06-22 04:59:52,636 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2024-06-22 04:59:58,177 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=531061.6666666666, ans=0.0 2024-06-22 05:00:04,786 INFO [train.py:1028] (0/2) Epoch 29, batch 6400, loss[loss=0.1649, simple_loss=0.2327, pruned_loss=0.0485, over 13147.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2417, pruned_loss=0.06197, over 2574117.21 frames. ], batch size: 67, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:00:05,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=531080.0, ans=0.95 2024-06-22 05:00:06,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=531080.0, ans=0.125 2024-06-22 05:00:20,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=531116.6666666666, ans=0.125 2024-06-22 05:00:21,148 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.390e+02 2.537e+02 2.751e+02 3.479e+02, threshold=5.074e+02, percent-clipped=0.0 2024-06-22 05:00:24,985 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=531135.0, ans=0.0 2024-06-22 05:00:26,459 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.02 vs. limit=22.5 2024-06-22 05:00:27,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=531135.0, ans=0.125 2024-06-22 05:00:34,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=531153.3333333334, ans=0.125 2024-06-22 05:00:37,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2024-06-22 05:00:38,805 INFO [train.py:1028] (0/2) Epoch 29, batch 6450, loss[loss=0.2197, simple_loss=0.2735, pruned_loss=0.08289, over 12557.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2433, pruned_loss=0.06276, over 2580394.35 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:00:39,771 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=531171.6666666666, ans=0.1 2024-06-22 05:00:41,157 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=531171.6666666666, ans=0.125 2024-06-22 05:00:44,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.36 vs. limit=15.0 2024-06-22 05:00:56,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=531208.3333333334, ans=0.125 2024-06-22 05:01:18,554 INFO [train.py:1028] (0/2) Epoch 29, batch 6500, loss[loss=0.2004, simple_loss=0.2448, pruned_loss=0.07802, over 10939.00 frames. ], tot_loss[loss=0.1856, simple_loss=0.2449, pruned_loss=0.06319, over 2584301.82 frames. ], batch size: 303, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:01:39,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=531300.0, ans=0.0 2024-06-22 05:01:39,795 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.392e+02 2.572e+02 2.825e+02 3.701e+02, threshold=5.145e+02, percent-clipped=0.0 2024-06-22 05:01:49,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=531336.6666666666, ans=0.0 2024-06-22 05:01:56,686 INFO [train.py:1028] (0/2) Epoch 29, batch 6550, loss[loss=0.186, simple_loss=0.2594, pruned_loss=0.05628, over 12386.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2461, pruned_loss=0.06328, over 2588123.70 frames. ], batch size: 22, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:02:01,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=531355.0, ans=0.125 2024-06-22 05:02:11,017 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.60 vs. limit=15.0 2024-06-22 05:02:17,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=15.0 2024-06-22 05:02:30,057 INFO [train.py:1028] (0/2) Epoch 29, batch 6600, loss[loss=0.1826, simple_loss=0.2517, pruned_loss=0.05675, over 13228.00 frames. ], tot_loss[loss=0.187, simple_loss=0.247, pruned_loss=0.0635, over 2589874.57 frames. ], batch size: 72, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:02:36,976 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=22.5 2024-06-22 05:02:40,325 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=531465.0, ans=0.125 2024-06-22 05:02:40,338 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=531465.0, ans=0.0 2024-06-22 05:02:41,217 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.35 vs. limit=10.0 2024-06-22 05:02:44,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=531483.3333333334, ans=0.0 2024-06-22 05:02:46,132 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.123e+02 2.400e+02 2.578e+02 2.805e+02 3.906e+02, threshold=5.156e+02, percent-clipped=0.0 2024-06-22 05:02:52,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=531501.6666666666, ans=0.125 2024-06-22 05:02:57,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531520.0, ans=0.1 2024-06-22 05:02:58,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=531520.0, ans=0.035 2024-06-22 05:03:03,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=531538.3333333334, ans=15.0 2024-06-22 05:03:03,928 INFO [train.py:1028] (0/2) Epoch 29, batch 6650, loss[loss=0.2228, simple_loss=0.2818, pruned_loss=0.08187, over 12884.00 frames. ], tot_loss[loss=0.1884, simple_loss=0.2488, pruned_loss=0.06405, over 2585037.20 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:03:04,792 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:03:06,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=531538.3333333334, ans=0.125 2024-06-22 05:03:11,732 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=531556.6666666666, ans=0.0 2024-06-22 05:03:25,108 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2024-06-22 05:03:26,946 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.05 vs. limit=22.5 2024-06-22 05:03:39,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=531611.6666666666, ans=0.125 2024-06-22 05:03:40,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=531611.6666666666, ans=0.125 2024-06-22 05:03:44,129 INFO [train.py:1028] (0/2) Epoch 29, batch 6700, loss[loss=0.1875, simple_loss=0.2429, pruned_loss=0.06606, over 12719.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2492, pruned_loss=0.06455, over 2584677.66 frames. ], batch size: 176, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:03:44,333 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=531630.0, ans=0.1 2024-06-22 05:03:48,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=531630.0, ans=0.125 2024-06-22 05:04:00,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.418e+02 2.540e+02 2.865e+02 4.425e+02, threshold=5.080e+02, percent-clipped=0.0 2024-06-22 05:04:09,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=531685.0, ans=0.0 2024-06-22 05:04:11,892 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=531703.3333333334, ans=0.0 2024-06-22 05:04:17,793 INFO [train.py:1028] (0/2) Epoch 29, batch 6750, loss[loss=0.2503, simple_loss=0.3038, pruned_loss=0.09839, over 12232.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2497, pruned_loss=0.06471, over 2578749.70 frames. ], batch size: 240, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:04:39,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=531776.6666666666, ans=0.0 2024-06-22 05:04:46,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531795.0, ans=0.125 2024-06-22 05:04:50,621 INFO [train.py:1028] (0/2) Epoch 29, batch 6800, loss[loss=0.1881, simple_loss=0.2523, pruned_loss=0.06198, over 13232.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2509, pruned_loss=0.06487, over 2580510.50 frames. ], batch size: 67, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:04:52,631 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=531813.3333333334, ans=0.0 2024-06-22 05:05:03,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=531850.0, ans=0.2 2024-06-22 05:05:06,288 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.444e+02 2.607e+02 2.819e+02 4.067e+02, threshold=5.214e+02, percent-clipped=0.0 2024-06-22 05:05:09,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=531868.3333333334, ans=0.125 2024-06-22 05:05:26,719 INFO [train.py:1028] (0/2) Epoch 29, batch 6850, loss[loss=0.1911, simple_loss=0.2623, pruned_loss=0.05995, over 13217.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2512, pruned_loss=0.06461, over 2584438.21 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:05:26,875 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=531905.0, ans=0.09899494936611666 2024-06-22 05:05:29,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=15.0 2024-06-22 05:05:41,744 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=531941.6666666666, ans=0.1 2024-06-22 05:05:42,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=531941.6666666666, ans=0.1 2024-06-22 05:05:58,463 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=531978.3333333334, ans=0.2 2024-06-22 05:05:59,675 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=531978.3333333334, ans=0.015 2024-06-22 05:06:02,818 INFO [train.py:1028] (0/2) Epoch 29, batch 6900, loss[loss=0.1952, simple_loss=0.2522, pruned_loss=0.06905, over 13355.00 frames. ], tot_loss[loss=0.191, simple_loss=0.252, pruned_loss=0.06501, over 2585362.48 frames. ], batch size: 49, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:06:04,244 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=531996.6666666666, ans=0.0 2024-06-22 05:06:06,828 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531996.6666666666, ans=0.125 2024-06-22 05:06:07,119 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.19 vs. limit=15.0 2024-06-22 05:06:10,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=532015.0, ans=0.125 2024-06-22 05:06:10,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=532015.0, ans=0.0 2024-06-22 05:06:19,468 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.453e+02 2.631e+02 2.859e+02 3.828e+02, threshold=5.261e+02, percent-clipped=0.0 2024-06-22 05:06:19,977 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2024-06-22 05:06:20,406 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=532033.3333333334, ans=0.2 2024-06-22 05:06:26,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=532051.6666666666, ans=0.125 2024-06-22 05:06:28,903 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=532070.0, ans=0.09899494936611666 2024-06-22 05:06:32,459 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=532070.0, ans=0.04949747468305833 2024-06-22 05:06:36,534 INFO [train.py:1028] (0/2) Epoch 29, batch 6950, loss[loss=0.187, simple_loss=0.2521, pruned_loss=0.06092, over 11071.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2518, pruned_loss=0.06453, over 2577617.63 frames. ], batch size: 16, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:06:40,017 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532088.3333333334, ans=0.1 2024-06-22 05:06:40,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2024-06-22 05:06:42,204 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.05 vs. limit=22.5 2024-06-22 05:06:44,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=532106.6666666666, ans=0.2 2024-06-22 05:06:51,295 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=532125.0, ans=0.2 2024-06-22 05:07:08,064 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=532161.6666666666, ans=0.0 2024-06-22 05:07:09,738 INFO [train.py:1028] (0/2) Epoch 29, batch 7000, loss[loss=0.209, simple_loss=0.2638, pruned_loss=0.07705, over 12946.00 frames. ], tot_loss[loss=0.1899, simple_loss=0.2515, pruned_loss=0.06412, over 2573590.09 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:07:11,821 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=532180.0, ans=0.5 2024-06-22 05:07:14,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=532180.0, ans=0.125 2024-06-22 05:07:16,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=532198.3333333334, ans=0.125 2024-06-22 05:07:20,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=532198.3333333334, ans=0.0 2024-06-22 05:07:20,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=532198.3333333334, ans=0.0 2024-06-22 05:07:31,218 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.868e+02 2.467e+02 2.672e+02 2.949e+02 3.719e+02, threshold=5.343e+02, percent-clipped=0.0 2024-06-22 05:07:41,512 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=15.0 2024-06-22 05:07:43,694 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=532253.3333333334, ans=0.5 2024-06-22 05:07:50,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=532271.6666666666, ans=0.0 2024-06-22 05:07:51,373 INFO [train.py:1028] (0/2) Epoch 29, batch 7050, loss[loss=0.1966, simple_loss=0.2586, pruned_loss=0.06728, over 12766.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2529, pruned_loss=0.06462, over 2580973.71 frames. ], batch size: 176, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:07:57,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=532290.0, ans=0.125 2024-06-22 05:08:00,628 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2024-06-22 05:08:06,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=532308.3333333334, ans=0.035 2024-06-22 05:08:06,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532308.3333333334, ans=0.1 2024-06-22 05:08:07,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=532308.3333333334, ans=0.0 2024-06-22 05:08:19,037 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2024-06-22 05:08:20,294 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2024-06-22 05:08:20,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=532345.0, ans=10.0 2024-06-22 05:08:23,853 INFO [train.py:1028] (0/2) Epoch 29, batch 7100, loss[loss=0.2161, simple_loss=0.2815, pruned_loss=0.07539, over 13117.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.254, pruned_loss=0.0653, over 2574894.05 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:08:24,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=532363.3333333334, ans=0.125 2024-06-22 05:08:26,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2024-06-22 05:08:29,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=532381.6666666666, ans=0.1 2024-06-22 05:08:40,061 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.586e+02 2.799e+02 3.039e+02 3.830e+02, threshold=5.597e+02, percent-clipped=0.0 2024-06-22 05:08:40,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=532400.0, ans=0.125 2024-06-22 05:08:52,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=532436.6666666666, ans=0.0 2024-06-22 05:08:52,263 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=532436.6666666666, ans=0.04949747468305833 2024-06-22 05:08:53,817 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.22 vs. limit=12.0 2024-06-22 05:08:56,873 INFO [train.py:1028] (0/2) Epoch 29, batch 7150, loss[loss=0.2227, simple_loss=0.2716, pruned_loss=0.08685, over 12541.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2544, pruned_loss=0.06535, over 2573075.56 frames. ], batch size: 202, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:09:02,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=532473.3333333334, ans=0.125 2024-06-22 05:09:09,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=532491.6666666666, ans=0.2 2024-06-22 05:09:16,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=532510.0, ans=0.2 2024-06-22 05:09:16,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=532510.0, ans=0.025 2024-06-22 05:09:20,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=532510.0, ans=0.2 2024-06-22 05:09:20,905 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=532510.0, ans=0.125 2024-06-22 05:09:21,163 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.30 vs. limit=10.0 2024-06-22 05:09:27,842 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2024-06-22 05:09:28,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=532528.3333333334, ans=0.0 2024-06-22 05:09:28,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2024-06-22 05:09:29,811 INFO [train.py:1028] (0/2) Epoch 29, batch 7200, loss[loss=0.1997, simple_loss=0.2629, pruned_loss=0.06828, over 13203.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.256, pruned_loss=0.06583, over 2578559.87 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:09:32,501 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=532546.6666666666, ans=0.0 2024-06-22 05:09:33,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=532546.6666666666, ans=0.125 2024-06-22 05:09:35,637 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=532565.0, ans=0.0 2024-06-22 05:09:41,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532565.0, ans=0.1 2024-06-22 05:09:49,539 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.498e+02 2.673e+02 3.022e+02 4.304e+02, threshold=5.345e+02, percent-clipped=0.0 2024-06-22 05:09:59,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=532620.0, ans=0.0 2024-06-22 05:10:01,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=532620.0, ans=0.0 2024-06-22 05:10:01,926 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=532620.0, ans=0.125 2024-06-22 05:10:06,609 INFO [train.py:1028] (0/2) Epoch 29, batch 7250, loss[loss=0.2038, simple_loss=0.2708, pruned_loss=0.06841, over 12964.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2562, pruned_loss=0.06576, over 2579936.26 frames. ], batch size: 36, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:10:06,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=532638.3333333334, ans=0.125 2024-06-22 05:10:11,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=532638.3333333334, ans=0.2 2024-06-22 05:10:13,560 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=532638.3333333334, ans=0.125 2024-06-22 05:10:14,875 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:10:27,657 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.98 vs. limit=22.5 2024-06-22 05:10:29,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=532675.0, ans=0.125 2024-06-22 05:10:39,984 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=532711.6666666666, ans=0.125 2024-06-22 05:10:43,561 INFO [train.py:1028] (0/2) Epoch 29, batch 7300, loss[loss=0.1914, simple_loss=0.2584, pruned_loss=0.06221, over 12985.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2571, pruned_loss=0.06613, over 2579615.62 frames. ], batch size: 36, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:10:49,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=532730.0, ans=0.0 2024-06-22 05:10:51,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=532748.3333333334, ans=0.1 2024-06-22 05:10:59,886 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.426e+02 2.576e+02 2.812e+02 3.503e+02, threshold=5.152e+02, percent-clipped=0.0 2024-06-22 05:11:07,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=532785.0, ans=0.025 2024-06-22 05:11:08,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=532785.0, ans=0.0 2024-06-22 05:11:16,242 INFO [train.py:1028] (0/2) Epoch 29, batch 7350, loss[loss=0.1928, simple_loss=0.2532, pruned_loss=0.06623, over 13294.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2574, pruned_loss=0.06637, over 2581905.69 frames. ], batch size: 46, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:11:17,326 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2024-06-22 05:11:22,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=532840.0, ans=0.2 2024-06-22 05:11:27,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=22.5 2024-06-22 05:11:28,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=532840.0, ans=0.0 2024-06-22 05:11:49,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=532895.0, ans=0.125 2024-06-22 05:11:49,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=532895.0, ans=0.125 2024-06-22 05:11:53,620 INFO [train.py:1028] (0/2) Epoch 29, batch 7400, loss[loss=0.2049, simple_loss=0.2784, pruned_loss=0.06566, over 13211.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2575, pruned_loss=0.06636, over 2586715.41 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:12:00,604 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532931.6666666666, ans=0.1 2024-06-22 05:12:02,220 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=532931.6666666666, ans=0.125 2024-06-22 05:12:08,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=532950.0, ans=0.125 2024-06-22 05:12:10,499 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.485e+02 2.688e+02 2.937e+02 3.615e+02, threshold=5.377e+02, percent-clipped=0.0 2024-06-22 05:12:20,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=532968.3333333334, ans=0.2 2024-06-22 05:12:24,729 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=532986.6666666666, ans=0.0 2024-06-22 05:12:30,401 INFO [train.py:1028] (0/2) Epoch 29, batch 7450, loss[loss=0.1687, simple_loss=0.2354, pruned_loss=0.05106, over 12628.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2575, pruned_loss=0.06626, over 2581226.27 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:12:34,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=533005.0, ans=0.0 2024-06-22 05:12:36,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=533023.3333333334, ans=10.0 2024-06-22 05:12:47,179 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=533041.6666666666, ans=0.0 2024-06-22 05:12:58,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=533078.3333333334, ans=0.2 2024-06-22 05:13:03,937 INFO [train.py:1028] (0/2) Epoch 29, batch 7500, loss[loss=0.2395, simple_loss=0.29, pruned_loss=0.09444, over 10654.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2587, pruned_loss=0.06664, over 2578498.30 frames. ], batch size: 303, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:13:06,465 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=8.0 2024-06-22 05:13:16,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=533115.0, ans=0.95 2024-06-22 05:13:19,541 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=533133.3333333334, ans=0.0 2024-06-22 05:13:21,195 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.463e+02 2.636e+02 2.922e+02 4.248e+02, threshold=5.271e+02, percent-clipped=0.0 2024-06-22 05:13:22,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=533133.3333333334, ans=0.95 2024-06-22 05:13:31,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=533170.0, ans=0.125 2024-06-22 05:13:35,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533188.3333333334, ans=0.125 2024-06-22 05:13:36,343 INFO [train.py:1028] (0/2) Epoch 29, batch 7550, loss[loss=0.204, simple_loss=0.2539, pruned_loss=0.07706, over 12927.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2593, pruned_loss=0.06716, over 2577421.64 frames. ], batch size: 158, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:13:38,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=533188.3333333334, ans=0.025 2024-06-22 05:13:41,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=533188.3333333334, ans=0.0 2024-06-22 05:13:42,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=533206.6666666666, ans=0.125 2024-06-22 05:13:54,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=533225.0, ans=0.5 2024-06-22 05:14:09,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=533261.6666666666, ans=0.125 2024-06-22 05:14:11,029 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=533261.6666666666, ans=0.95 2024-06-22 05:14:12,908 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=533261.6666666666, ans=0.0 2024-06-22 05:14:13,748 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2024-06-22 05:14:15,402 INFO [train.py:1028] (0/2) Epoch 29, batch 7600, loss[loss=0.1987, simple_loss=0.2605, pruned_loss=0.06843, over 13221.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2603, pruned_loss=0.06749, over 2577186.25 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:14:16,751 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533280.0, ans=0.125 2024-06-22 05:14:32,416 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.439e+02 2.561e+02 2.764e+02 4.519e+02, threshold=5.123e+02, percent-clipped=0.0 2024-06-22 05:14:34,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=533335.0, ans=0.125 2024-06-22 05:14:41,765 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=12.99 vs. limit=15.0 2024-06-22 05:14:45,104 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:14:45,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.61 vs. limit=12.0 2024-06-22 05:14:46,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533353.3333333334, ans=0.1 2024-06-22 05:14:49,152 INFO [train.py:1028] (0/2) Epoch 29, batch 7650, loss[loss=0.1753, simple_loss=0.2383, pruned_loss=0.0561, over 12984.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2605, pruned_loss=0.06756, over 2572772.40 frames. ], batch size: 33, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:15:00,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=533390.0, ans=0.125 2024-06-22 05:15:04,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=533408.3333333334, ans=0.0 2024-06-22 05:15:06,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=533408.3333333334, ans=0.1 2024-06-22 05:15:11,569 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=533426.6666666666, ans=0.0 2024-06-22 05:15:22,208 INFO [train.py:1028] (0/2) Epoch 29, batch 7700, loss[loss=0.1852, simple_loss=0.2563, pruned_loss=0.05701, over 13250.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2611, pruned_loss=0.06794, over 2569238.74 frames. ], batch size: 63, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:15:22,938 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=533463.3333333334, ans=0.125 2024-06-22 05:15:28,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=533481.6666666666, ans=0.125 2024-06-22 05:15:32,315 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=533481.6666666666, ans=0.125 2024-06-22 05:15:38,219 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2024-06-22 05:15:38,494 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.542e+02 2.663e+02 2.887e+02 4.323e+02, threshold=5.325e+02, percent-clipped=0.0 2024-06-22 05:15:52,972 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=533536.6666666666, ans=0.125 2024-06-22 05:15:57,711 INFO [train.py:1028] (0/2) Epoch 29, batch 7750, loss[loss=0.2037, simple_loss=0.2716, pruned_loss=0.06791, over 13259.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2616, pruned_loss=0.06839, over 2573387.99 frames. ], batch size: 72, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:16:17,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533591.6666666666, ans=0.1 2024-06-22 05:16:24,993 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=533610.0, ans=0.04949747468305833 2024-06-22 05:16:25,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=533610.0, ans=0.125 2024-06-22 05:16:28,357 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.64 vs. limit=15.0 2024-06-22 05:16:28,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=533628.3333333334, ans=0.2 2024-06-22 05:16:29,664 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.70 vs. limit=22.5 2024-06-22 05:16:33,862 INFO [train.py:1028] (0/2) Epoch 29, batch 7800, loss[loss=0.1962, simple_loss=0.2534, pruned_loss=0.06949, over 13165.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2625, pruned_loss=0.0687, over 2577282.49 frames. ], batch size: 95, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:16:34,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=533646.6666666666, ans=0.1 2024-06-22 05:16:38,013 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=533646.6666666666, ans=0.0 2024-06-22 05:16:42,033 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2024-06-22 05:16:51,299 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.465e+02 2.627e+02 2.820e+02 3.564e+02, threshold=5.255e+02, percent-clipped=0.0 2024-06-22 05:16:54,990 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2024-06-22 05:16:58,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=533701.6666666666, ans=0.05 2024-06-22 05:17:02,436 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.12 vs. limit=15.0 2024-06-22 05:17:07,443 INFO [train.py:1028] (0/2) Epoch 29, batch 7850, loss[loss=0.1742, simple_loss=0.2312, pruned_loss=0.05862, over 11859.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2631, pruned_loss=0.06916, over 2571737.23 frames. ], batch size: 17, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:17:17,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533756.6666666666, ans=0.1 2024-06-22 05:17:21,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=533775.0, ans=15.0 2024-06-22 05:17:22,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=533775.0, ans=0.125 2024-06-22 05:17:22,468 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:17:24,799 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533775.0, ans=0.125 2024-06-22 05:17:27,845 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=15.0 2024-06-22 05:17:33,676 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=533811.6666666666, ans=0.04949747468305833 2024-06-22 05:17:33,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=533811.6666666666, ans=0.125 2024-06-22 05:17:45,044 INFO [train.py:1028] (0/2) Epoch 29, batch 7900, loss[loss=0.1905, simple_loss=0.2623, pruned_loss=0.05937, over 13154.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2634, pruned_loss=0.06909, over 2570294.97 frames. ], batch size: 77, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:17:51,259 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=533848.3333333334, ans=0.125 2024-06-22 05:17:57,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=533848.3333333334, ans=0.125 2024-06-22 05:18:03,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=533866.6666666666, ans=0.0 2024-06-22 05:18:06,385 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.595e+02 2.814e+02 3.073e+02 4.192e+02, threshold=5.629e+02, percent-clipped=0.0 2024-06-22 05:18:09,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=533885.0, ans=0.125 2024-06-22 05:18:15,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=533903.3333333334, ans=0.5 2024-06-22 05:18:17,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=533903.3333333334, ans=0.125 2024-06-22 05:18:19,070 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=15.0 2024-06-22 05:18:21,817 INFO [train.py:1028] (0/2) Epoch 29, batch 7950, loss[loss=0.2221, simple_loss=0.2729, pruned_loss=0.0856, over 10978.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2638, pruned_loss=0.06902, over 2574327.17 frames. ], batch size: 305, lr: 1.98e-03, grad_scale: 16.0 2024-06-22 05:18:22,559 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=533921.6666666666, ans=0.125 2024-06-22 05:18:33,574 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=533940.0, ans=0.0 2024-06-22 05:18:36,963 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=533958.3333333334, ans=0.0 2024-06-22 05:18:38,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=533958.3333333334, ans=0.0 2024-06-22 05:18:42,050 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.83 vs. limit=15.0 2024-06-22 05:18:54,881 INFO [train.py:1028] (0/2) Epoch 29, batch 8000, loss[loss=0.2072, simple_loss=0.2692, pruned_loss=0.07264, over 12678.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2646, pruned_loss=0.06919, over 2571127.89 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:19:04,510 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=534031.6666666666, ans=0.125 2024-06-22 05:19:10,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=534050.0, ans=0.2 2024-06-22 05:19:12,250 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.508e+02 2.719e+02 2.945e+02 3.763e+02, threshold=5.438e+02, percent-clipped=0.0 2024-06-22 05:19:14,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=534068.3333333334, ans=0.0 2024-06-22 05:19:18,039 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=534068.3333333334, ans=0.1 2024-06-22 05:19:27,995 INFO [train.py:1028] (0/2) Epoch 29, batch 8050, loss[loss=0.1855, simple_loss=0.2457, pruned_loss=0.06262, over 13227.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2636, pruned_loss=0.06864, over 2571040.40 frames. ], batch size: 83, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:19:29,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=534105.0, ans=0.0 2024-06-22 05:20:04,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=534178.3333333334, ans=0.0 2024-06-22 05:20:07,193 INFO [train.py:1028] (0/2) Epoch 29, batch 8100, loss[loss=0.2008, simple_loss=0.2595, pruned_loss=0.07107, over 13195.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2635, pruned_loss=0.06877, over 2575950.60 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:20:07,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=534196.6666666666, ans=0.0 2024-06-22 05:20:09,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=534196.6666666666, ans=0.125 2024-06-22 05:20:16,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=534215.0, ans=0.2 2024-06-22 05:20:24,735 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534233.3333333334, ans=0.125 2024-06-22 05:20:25,130 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.465e+02 2.659e+02 2.840e+02 4.254e+02, threshold=5.318e+02, percent-clipped=0.0 2024-06-22 05:20:25,241 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:20:31,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534251.6666666666, ans=0.1 2024-06-22 05:20:40,733 INFO [train.py:1028] (0/2) Epoch 29, batch 8150, loss[loss=0.1907, simple_loss=0.2537, pruned_loss=0.06388, over 13090.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2637, pruned_loss=0.06854, over 2579026.76 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:20:42,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=534288.3333333334, ans=0.125 2024-06-22 05:20:42,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534288.3333333334, ans=0.0 2024-06-22 05:20:49,052 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=534306.6666666666, ans=0.09899494936611666 2024-06-22 05:20:54,247 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:20:56,643 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2024-06-22 05:20:59,199 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534325.0, ans=0.1 2024-06-22 05:21:02,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=534343.3333333334, ans=0.0 2024-06-22 05:21:08,816 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=534361.6666666666, ans=0.2 2024-06-22 05:21:14,063 INFO [train.py:1028] (0/2) Epoch 29, batch 8200, loss[loss=0.2009, simple_loss=0.2716, pruned_loss=0.06515, over 13146.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2637, pruned_loss=0.06821, over 2583293.71 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:21:27,227 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2024-06-22 05:21:29,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=534416.6666666666, ans=0.125 2024-06-22 05:21:31,986 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.499e+02 2.686e+02 2.955e+02 4.578e+02, threshold=5.372e+02, percent-clipped=0.0 2024-06-22 05:21:42,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=534435.0, ans=0.125 2024-06-22 05:21:49,135 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534453.3333333334, ans=0.1 2024-06-22 05:21:50,103 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=15.0 2024-06-22 05:21:50,478 INFO [train.py:1028] (0/2) Epoch 29, batch 8250, loss[loss=0.212, simple_loss=0.2877, pruned_loss=0.06817, over 13230.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.264, pruned_loss=0.06831, over 2583895.74 frames. ], batch size: 52, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:21:59,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=534471.6666666666, ans=0.0 2024-06-22 05:22:02,078 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=534490.0, ans=0.0 2024-06-22 05:22:03,878 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=534490.0, ans=0.025 2024-06-22 05:22:05,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=534490.0, ans=0.0 2024-06-22 05:22:07,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=534508.3333333334, ans=0.125 2024-06-22 05:22:11,047 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=534508.3333333334, ans=0.125 2024-06-22 05:22:12,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=534526.6666666666, ans=0.0 2024-06-22 05:22:23,663 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:22:25,701 INFO [train.py:1028] (0/2) Epoch 29, batch 8300, loss[loss=0.2034, simple_loss=0.2583, pruned_loss=0.07426, over 13034.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2634, pruned_loss=0.06804, over 2581656.64 frames. ], batch size: 102, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:22:27,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=534563.3333333334, ans=0.2 2024-06-22 05:22:32,351 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=534581.6666666666, ans=0.025 2024-06-22 05:22:43,449 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.500e+02 2.598e+02 2.782e+02 3.928e+02, threshold=5.197e+02, percent-clipped=0.0 2024-06-22 05:22:46,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=534618.3333333334, ans=0.125 2024-06-22 05:22:49,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=534618.3333333334, ans=0.125 2024-06-22 05:22:49,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2024-06-22 05:22:53,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=534636.6666666666, ans=0.125 2024-06-22 05:22:56,899 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=534636.6666666666, ans=0.05 2024-06-22 05:22:58,681 INFO [train.py:1028] (0/2) Epoch 29, batch 8350, loss[loss=0.2038, simple_loss=0.2691, pruned_loss=0.06929, over 13195.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2639, pruned_loss=0.06813, over 2580982.60 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:23:04,650 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534673.3333333334, ans=0.0 2024-06-22 05:23:12,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=534691.6666666666, ans=0.1 2024-06-22 05:23:31,711 INFO [train.py:1028] (0/2) Epoch 29, batch 8400, loss[loss=0.2052, simple_loss=0.266, pruned_loss=0.07218, over 12972.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2635, pruned_loss=0.0682, over 2576780.47 frames. ], batch size: 39, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:23:37,053 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=12.0 2024-06-22 05:23:45,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=534765.0, ans=0.0 2024-06-22 05:23:52,676 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.527e+02 2.785e+02 2.997e+02 3.894e+02, threshold=5.571e+02, percent-clipped=0.0 2024-06-22 05:24:02,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=534801.6666666666, ans=0.025 2024-06-22 05:24:07,786 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.73 vs. limit=22.5 2024-06-22 05:24:11,139 INFO [train.py:1028] (0/2) Epoch 29, batch 8450, loss[loss=0.1994, simple_loss=0.2698, pruned_loss=0.0645, over 13173.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2644, pruned_loss=0.06832, over 2578032.11 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:24:34,352 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534893.3333333334, ans=0.125 2024-06-22 05:24:38,356 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=534911.6666666666, ans=0.0 2024-06-22 05:24:43,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=534930.0, ans=10.0 2024-06-22 05:24:44,280 INFO [train.py:1028] (0/2) Epoch 29, batch 8500, loss[loss=0.1889, simple_loss=0.2599, pruned_loss=0.05897, over 12594.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2653, pruned_loss=0.06846, over 2576513.79 frames. ], batch size: 29, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:24:54,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534948.3333333334, ans=0.125 2024-06-22 05:24:58,194 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=534966.6666666666, ans=0.2 2024-06-22 05:24:59,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=534966.6666666666, ans=0.2 2024-06-22 05:25:01,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=534966.6666666666, ans=0.2 2024-06-22 05:25:02,824 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.520e+02 2.699e+02 2.936e+02 4.127e+02, threshold=5.399e+02, percent-clipped=0.0 2024-06-22 05:25:15,500 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.31 vs. limit=15.0 2024-06-22 05:25:17,670 INFO [train.py:1028] (0/2) Epoch 29, batch 8550, loss[loss=0.2203, simple_loss=0.2702, pruned_loss=0.08518, over 12724.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2648, pruned_loss=0.06823, over 2573998.79 frames. ], batch size: 22, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:25:23,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=535021.6666666666, ans=0.125 2024-06-22 05:25:24,382 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=535040.0, ans=0.025 2024-06-22 05:25:30,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=535058.3333333334, ans=0.0 2024-06-22 05:25:37,386 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=535076.6666666666, ans=0.2 2024-06-22 05:25:41,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2024-06-22 05:25:49,562 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.01 vs. limit=10.0 2024-06-22 05:25:57,223 INFO [train.py:1028] (0/2) Epoch 29, batch 8600, loss[loss=0.1905, simple_loss=0.2518, pruned_loss=0.06465, over 13140.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2657, pruned_loss=0.06856, over 2571960.78 frames. ], batch size: 112, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:26:04,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535131.6666666666, ans=0.1 2024-06-22 05:26:15,201 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.516e+02 2.683e+02 3.020e+02 3.927e+02, threshold=5.365e+02, percent-clipped=0.0 2024-06-22 05:26:25,756 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=535186.6666666666, ans=0.0 2024-06-22 05:26:28,883 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=535186.6666666666, ans=0.0 2024-06-22 05:26:30,626 INFO [train.py:1028] (0/2) Epoch 29, batch 8650, loss[loss=0.1743, simple_loss=0.2395, pruned_loss=0.05457, over 13034.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2659, pruned_loss=0.06869, over 2575605.54 frames. ], batch size: 102, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:26:31,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=535205.0, ans=0.2 2024-06-22 05:26:40,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535223.3333333334, ans=0.1 2024-06-22 05:26:42,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=535223.3333333334, ans=0.125 2024-06-22 05:26:50,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=535260.0, ans=0.125 2024-06-22 05:26:59,367 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=535278.3333333334, ans=0.0 2024-06-22 05:27:03,417 INFO [train.py:1028] (0/2) Epoch 29, batch 8700, loss[loss=0.1855, simple_loss=0.2525, pruned_loss=0.05918, over 13279.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2661, pruned_loss=0.06895, over 2572350.61 frames. ], batch size: 59, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:27:08,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=535296.6666666666, ans=0.0 2024-06-22 05:27:12,172 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=535315.0, ans=0.125 2024-06-22 05:27:16,053 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-292000.pt 2024-06-22 05:27:21,227 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=535333.3333333334, ans=0.125 2024-06-22 05:27:26,361 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.566e+02 2.778e+02 3.014e+02 3.613e+02, threshold=5.555e+02, percent-clipped=0.0 2024-06-22 05:27:26,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=535333.3333333334, ans=0.05 2024-06-22 05:27:38,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=535370.0, ans=0.0 2024-06-22 05:27:44,347 INFO [train.py:1028] (0/2) Epoch 29, batch 8750, loss[loss=0.2178, simple_loss=0.281, pruned_loss=0.07731, over 13098.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2667, pruned_loss=0.06936, over 2567856.34 frames. ], batch size: 121, lr: 1.98e-03, grad_scale: 32.0 2024-06-22 05:27:45,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=535388.3333333334, ans=0.2 2024-06-22 05:27:47,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=535388.3333333334, ans=0.2 2024-06-22 05:27:51,639 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.57 vs. limit=22.5 2024-06-22 05:28:01,171 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.22 vs. limit=15.0 2024-06-22 05:28:03,812 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2024-06-22 05:28:06,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535443.3333333334, ans=0.1 2024-06-22 05:28:09,613 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=535443.3333333334, ans=0.05 2024-06-22 05:28:10,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=535443.3333333334, ans=0.2 2024-06-22 05:28:16,285 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.56 vs. limit=10.0 2024-06-22 05:28:20,870 INFO [train.py:1028] (0/2) Epoch 29, batch 8800, loss[loss=0.2071, simple_loss=0.2794, pruned_loss=0.06737, over 13251.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2659, pruned_loss=0.0689, over 2572333.16 frames. ], batch size: 72, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:28:22,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=535480.0, ans=0.125 2024-06-22 05:28:39,628 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.463e+02 2.620e+02 2.930e+02 4.131e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 05:28:53,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=535571.6666666666, ans=0.0 2024-06-22 05:28:54,366 INFO [train.py:1028] (0/2) Epoch 29, batch 8850, loss[loss=0.2116, simple_loss=0.274, pruned_loss=0.07461, over 12508.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2665, pruned_loss=0.06963, over 2562285.67 frames. ], batch size: 202, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:29:00,281 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=15.0 2024-06-22 05:29:03,408 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.26 vs. limit=10.0 2024-06-22 05:29:09,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=535608.3333333334, ans=0.2 2024-06-22 05:29:10,464 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.65 vs. limit=15.0 2024-06-22 05:29:16,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2024-06-22 05:29:23,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=535645.0, ans=0.04949747468305833 2024-06-22 05:29:27,572 INFO [train.py:1028] (0/2) Epoch 29, batch 8900, loss[loss=0.2341, simple_loss=0.3005, pruned_loss=0.08384, over 12993.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2679, pruned_loss=0.0703, over 2561326.85 frames. ], batch size: 33, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:29:40,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=535681.6666666666, ans=0.0 2024-06-22 05:29:45,569 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2024-06-22 05:29:47,804 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:29:49,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=535700.0, ans=0.0 2024-06-22 05:29:49,745 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.537e+02 2.681e+02 2.865e+02 4.790e+02, threshold=5.362e+02, percent-clipped=0.0 2024-06-22 05:29:54,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535718.3333333334, ans=0.1 2024-06-22 05:30:04,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535736.6666666666, ans=0.1 2024-06-22 05:30:08,147 INFO [train.py:1028] (0/2) Epoch 29, batch 8950, loss[loss=0.2422, simple_loss=0.2937, pruned_loss=0.09535, over 12529.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2681, pruned_loss=0.06996, over 2561179.46 frames. ], batch size: 202, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:30:15,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=535773.3333333334, ans=0.125 2024-06-22 05:30:18,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=535773.3333333334, ans=0.2 2024-06-22 05:30:23,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535791.6666666666, ans=0.1 2024-06-22 05:30:42,053 INFO [train.py:1028] (0/2) Epoch 29, batch 9000, loss[loss=0.2066, simple_loss=0.275, pruned_loss=0.06913, over 13290.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2677, pruned_loss=0.06959, over 2567180.86 frames. ], batch size: 46, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:30:42,054 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 05:30:50,152 INFO [train.py:1060] (0/2) Epoch 29, validation: loss=0.1947, simple_loss=0.2528, pruned_loss=0.06827, over 351949.00 frames. 2024-06-22 05:30:50,153 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 05:30:51,966 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.61 vs. limit=22.5 2024-06-22 05:30:55,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=535846.6666666666, ans=0.1 2024-06-22 05:30:56,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2024-06-22 05:31:06,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=535883.3333333334, ans=0.025 2024-06-22 05:31:08,729 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.510e+02 2.660e+02 3.036e+02 4.689e+02, threshold=5.321e+02, percent-clipped=0.0 2024-06-22 05:31:11,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=535901.6666666666, ans=0.5 2024-06-22 05:31:23,114 INFO [train.py:1028] (0/2) Epoch 29, batch 9050, loss[loss=0.1685, simple_loss=0.2437, pruned_loss=0.04661, over 11542.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.269, pruned_loss=0.06984, over 2566324.56 frames. ], batch size: 17, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:31:28,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=535938.3333333334, ans=0.125 2024-06-22 05:31:30,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=535956.6666666666, ans=0.025 2024-06-22 05:31:31,673 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=535956.6666666666, ans=0.125 2024-06-22 05:31:32,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2024-06-22 05:31:33,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=535956.6666666666, ans=0.0 2024-06-22 05:31:37,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=535975.0, ans=0.05 2024-06-22 05:31:41,171 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=535975.0, ans=0.1 2024-06-22 05:31:50,696 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.76 vs. limit=10.0 2024-06-22 05:31:55,558 INFO [train.py:1028] (0/2) Epoch 29, batch 9100, loss[loss=0.2018, simple_loss=0.2772, pruned_loss=0.0632, over 13229.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2683, pruned_loss=0.06955, over 2567733.68 frames. ], batch size: 72, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:31:57,007 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=536030.0, ans=0.09899494936611666 2024-06-22 05:32:13,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=536066.6666666666, ans=0.09899494936611666 2024-06-22 05:32:13,595 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.491e+02 2.701e+02 2.937e+02 3.719e+02, threshold=5.402e+02, percent-clipped=0.0 2024-06-22 05:32:18,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=536085.0, ans=0.2 2024-06-22 05:32:19,475 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=536085.0, ans=0.0 2024-06-22 05:32:26,374 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=536103.3333333334, ans=0.035 2024-06-22 05:32:27,481 INFO [train.py:1028] (0/2) Epoch 29, batch 9150, loss[loss=0.1771, simple_loss=0.2516, pruned_loss=0.0513, over 13147.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.268, pruned_loss=0.06946, over 2569520.00 frames. ], batch size: 77, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:33:01,720 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536195.0, ans=0.1 2024-06-22 05:33:07,176 INFO [train.py:1028] (0/2) Epoch 29, batch 9200, loss[loss=0.1925, simple_loss=0.2601, pruned_loss=0.06242, over 12966.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2676, pruned_loss=0.06885, over 2572656.97 frames. ], batch size: 36, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:33:08,768 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=536213.3333333334, ans=0.2 2024-06-22 05:33:10,055 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=536213.3333333334, ans=0.0 2024-06-22 05:33:24,992 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.486e+02 2.628e+02 2.903e+02 3.926e+02, threshold=5.256e+02, percent-clipped=0.0 2024-06-22 05:33:33,589 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=536286.6666666666, ans=0.025 2024-06-22 05:33:39,147 INFO [train.py:1028] (0/2) Epoch 29, batch 9250, loss[loss=0.1937, simple_loss=0.2623, pruned_loss=0.06252, over 13203.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2678, pruned_loss=0.06899, over 2574439.22 frames. ], batch size: 67, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:33:41,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=536305.0, ans=0.125 2024-06-22 05:33:42,853 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2024-06-22 05:33:43,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=536305.0, ans=0.125 2024-06-22 05:33:50,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=536323.3333333334, ans=0.1 2024-06-22 05:33:53,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=536341.6666666666, ans=0.125 2024-06-22 05:33:57,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=536360.0, ans=0.125 2024-06-22 05:34:00,119 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536360.0, ans=0.1 2024-06-22 05:34:01,438 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=536360.0, ans=0.09899494936611666 2024-06-22 05:34:11,243 INFO [train.py:1028] (0/2) Epoch 29, batch 9300, loss[loss=0.2089, simple_loss=0.2729, pruned_loss=0.07248, over 12970.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2679, pruned_loss=0.06918, over 2571362.61 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:34:21,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=536415.0, ans=0.0 2024-06-22 05:34:27,602 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=536433.3333333334, ans=0.0 2024-06-22 05:34:29,359 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.509e+02 2.642e+02 2.875e+02 3.721e+02, threshold=5.284e+02, percent-clipped=0.0 2024-06-22 05:34:33,175 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536451.6666666666, ans=0.1 2024-06-22 05:34:37,522 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=536470.0, ans=0.125 2024-06-22 05:34:43,320 INFO [train.py:1028] (0/2) Epoch 29, batch 9350, loss[loss=0.215, simple_loss=0.2794, pruned_loss=0.07537, over 12469.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2681, pruned_loss=0.06917, over 2568604.81 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:34:45,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=536488.3333333334, ans=0.2 2024-06-22 05:34:47,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=536488.3333333334, ans=0.0 2024-06-22 05:35:14,276 INFO [train.py:1028] (0/2) Epoch 29, batch 9400, loss[loss=0.1891, simple_loss=0.2628, pruned_loss=0.05771, over 13256.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2687, pruned_loss=0.06958, over 2568616.02 frames. ], batch size: 52, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:35:14,678 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2024-06-22 05:35:16,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=536580.0, ans=0.125 2024-06-22 05:35:23,194 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2024-06-22 05:35:31,328 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.519e+02 2.637e+02 2.775e+02 3.823e+02, threshold=5.273e+02, percent-clipped=0.0 2024-06-22 05:35:41,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536653.3333333334, ans=0.1 2024-06-22 05:35:45,058 INFO [train.py:1028] (0/2) Epoch 29, batch 9450, loss[loss=0.2216, simple_loss=0.2866, pruned_loss=0.07834, over 12391.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2689, pruned_loss=0.0696, over 2569476.69 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:35:56,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=536690.0, ans=0.125 2024-06-22 05:35:57,645 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.29 vs. limit=22.5 2024-06-22 05:35:58,787 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.93 vs. limit=15.0 2024-06-22 05:36:03,473 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=536726.6666666666, ans=0.0 2024-06-22 05:36:07,023 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2024-06-22 05:36:07,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=536726.6666666666, ans=0.025 2024-06-22 05:36:21,349 INFO [train.py:1028] (0/2) Epoch 29, batch 9500, loss[loss=0.196, simple_loss=0.2621, pruned_loss=0.06497, over 13232.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2685, pruned_loss=0.06909, over 2578624.92 frames. ], batch size: 43, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:36:24,032 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=536763.3333333334, ans=0.125 2024-06-22 05:36:30,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=536781.6666666666, ans=0.125 2024-06-22 05:36:33,939 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536800.0, ans=0.125 2024-06-22 05:36:37,142 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=536800.0, ans=0.0 2024-06-22 05:36:37,169 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=536800.0, ans=0.07 2024-06-22 05:36:38,946 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.482e+02 2.594e+02 2.841e+02 3.787e+02, threshold=5.189e+02, percent-clipped=0.0 2024-06-22 05:36:52,563 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.65 vs. limit=15.0 2024-06-22 05:36:52,818 INFO [train.py:1028] (0/2) Epoch 29, batch 9550, loss[loss=0.1702, simple_loss=0.2336, pruned_loss=0.05337, over 13116.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2679, pruned_loss=0.06895, over 2575430.13 frames. ], batch size: 40, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:36:53,620 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=536855.0, ans=0.125 2024-06-22 05:36:54,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=536855.0, ans=0.0 2024-06-22 05:36:58,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536873.3333333334, ans=0.1 2024-06-22 05:37:07,443 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=536891.6666666666, ans=0.2 2024-06-22 05:37:13,308 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=536910.0, ans=0.125 2024-06-22 05:37:23,609 INFO [train.py:1028] (0/2) Epoch 29, batch 9600, loss[loss=0.2154, simple_loss=0.2749, pruned_loss=0.07794, over 10349.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2679, pruned_loss=0.06894, over 2573496.01 frames. ], batch size: 305, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:37:37,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=536983.3333333334, ans=0.125 2024-06-22 05:37:40,671 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.632e+02 2.821e+02 3.152e+02 4.333e+02, threshold=5.642e+02, percent-clipped=0.0 2024-06-22 05:37:42,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.80 vs. limit=15.0 2024-06-22 05:37:42,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.19 vs. limit=6.0 2024-06-22 05:37:46,035 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.60 vs. limit=10.0 2024-06-22 05:37:54,211 INFO [train.py:1028] (0/2) Epoch 29, batch 9650, loss[loss=0.1872, simple_loss=0.2425, pruned_loss=0.06592, over 13083.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2678, pruned_loss=0.06943, over 2563462.87 frames. ], batch size: 132, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:37:59,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=537038.3333333334, ans=0.025 2024-06-22 05:38:03,330 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=537056.6666666666, ans=0.025 2024-06-22 05:38:07,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.58 vs. limit=22.5 2024-06-22 05:38:12,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=537075.0, ans=0.025 2024-06-22 05:38:22,616 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=537111.6666666666, ans=0.0 2024-06-22 05:38:25,552 INFO [train.py:1028] (0/2) Epoch 29, batch 9700, loss[loss=0.2162, simple_loss=0.2705, pruned_loss=0.08099, over 13046.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2677, pruned_loss=0.06975, over 2558986.93 frames. ], batch size: 144, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:38:45,846 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.571e+02 2.807e+02 3.065e+02 4.294e+02, threshold=5.615e+02, percent-clipped=0.0 2024-06-22 05:39:00,063 INFO [train.py:1028] (0/2) Epoch 29, batch 9750, loss[loss=0.2079, simple_loss=0.2695, pruned_loss=0.0732, over 13061.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2668, pruned_loss=0.06922, over 2554548.92 frames. ], batch size: 132, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:39:20,298 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=537276.6666666666, ans=0.2 2024-06-22 05:39:24,933 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:39:31,613 INFO [train.py:1028] (0/2) Epoch 29, batch 9800, loss[loss=0.1793, simple_loss=0.2449, pruned_loss=0.05689, over 12940.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2662, pruned_loss=0.06854, over 2548017.10 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:39:48,762 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.531e+02 2.680e+02 2.870e+02 3.704e+02, threshold=5.360e+02, percent-clipped=0.0 2024-06-22 05:40:02,495 INFO [train.py:1028] (0/2) Epoch 29, batch 9850, loss[loss=0.2089, simple_loss=0.267, pruned_loss=0.07542, over 12994.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2652, pruned_loss=0.06802, over 2539637.79 frames. ], batch size: 102, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:40:03,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=537405.0, ans=0.0 2024-06-22 05:40:11,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=537423.3333333334, ans=0.125 2024-06-22 05:40:20,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=537441.6666666666, ans=0.09899494936611666 2024-06-22 05:40:34,213 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=537478.3333333334, ans=0.125 2024-06-22 05:40:35,318 INFO [train.py:1028] (0/2) Epoch 29, batch 9900, loss[loss=0.1829, simple_loss=0.2607, pruned_loss=0.05251, over 12877.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2648, pruned_loss=0.06826, over 2532010.99 frames. ], batch size: 39, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:40:53,031 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.480e+02 2.608e+02 2.805e+02 4.117e+02, threshold=5.216e+02, percent-clipped=0.0 2024-06-22 05:40:53,154 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=537533.3333333334, ans=0.0 2024-06-22 05:40:56,835 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=537551.6666666666, ans=0.125 2024-06-22 05:40:59,442 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:41:05,588 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2024-06-22 05:41:07,173 INFO [train.py:1028] (0/2) Epoch 29, batch 9950, loss[loss=0.2345, simple_loss=0.2924, pruned_loss=0.08827, over 12584.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.264, pruned_loss=0.0686, over 2526596.79 frames. ], batch size: 29, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:41:09,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.31 vs. limit=6.0 2024-06-22 05:41:09,793 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:41:09,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=537588.3333333334, ans=0.0 2024-06-22 05:41:10,558 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=537588.3333333334, ans=0.0 2024-06-22 05:41:12,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=537588.3333333334, ans=0.0 2024-06-22 05:41:21,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=537625.0, ans=0.025 2024-06-22 05:41:29,131 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=537643.3333333334, ans=0.125 2024-06-22 05:41:40,654 INFO [train.py:1028] (0/2) Epoch 29, batch 10000, loss[loss=0.2269, simple_loss=0.2904, pruned_loss=0.08172, over 12555.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2643, pruned_loss=0.06911, over 2488693.02 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:41:46,838 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2024-06-22 05:41:48,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.43 vs. limit=12.0 2024-06-22 05:41:50,099 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=537698.3333333334, ans=0.125 2024-06-22 05:41:51,855 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=537698.3333333334, ans=0.025 2024-06-22 05:41:54,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=537716.6666666666, ans=0.2 2024-06-22 05:41:58,746 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.485e+02 2.655e+02 2.968e+02 3.749e+02, threshold=5.310e+02, percent-clipped=0.0 2024-06-22 05:42:03,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=537735.0, ans=0.05 2024-06-22 05:42:06,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=537753.3333333334, ans=0.125 2024-06-22 05:42:06,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=537753.3333333334, ans=0.2 2024-06-22 05:42:06,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.26 vs. limit=12.0 2024-06-22 05:42:11,794 INFO [train.py:1028] (0/2) Epoch 29, batch 10050, loss[loss=0.2027, simple_loss=0.2693, pruned_loss=0.06801, over 12550.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2646, pruned_loss=0.06991, over 2446888.49 frames. ], batch size: 22, lr: 1.97e-03, grad_scale: 32.0 2024-06-22 05:42:13,470 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.64 vs. limit=15.0 2024-06-22 05:42:24,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=537808.3333333334, ans=0.125 2024-06-22 05:42:26,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=537808.3333333334, ans=0.0 2024-06-22 05:42:33,659 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=15.0 2024-06-22 05:42:41,681 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=537863.3333333334, ans=0.125 2024-06-22 05:42:42,300 INFO [train.py:1028] (0/2) Epoch 29, batch 10100, loss[loss=0.2023, simple_loss=0.2649, pruned_loss=0.06992, over 10740.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2648, pruned_loss=0.06954, over 2425323.12 frames. ], batch size: 16, lr: 1.97e-03, grad_scale: 16.0 2024-06-22 05:42:48,564 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.97 vs. limit=15.0 2024-06-22 05:42:51,692 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2024-06-22 05:42:55,093 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-29.pt 2024-06-22 05:44:49,221 INFO [train.py:1028] (0/2) Epoch 30, batch 0, loss[loss=0.1778, simple_loss=0.2447, pruned_loss=0.05547, over 12966.00 frames. ], tot_loss[loss=0.1778, simple_loss=0.2447, pruned_loss=0.05547, over 12966.00 frames. ], batch size: 36, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:44:49,223 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 05:44:53,028 INFO [zipformer.py:1858] (0/2) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.5134, 3.1226, 1.9701, 3.2106], device='cuda:0') 2024-06-22 05:44:56,331 INFO [train.py:1060] (0/2) Epoch 30, validation: loss=0.1949, simple_loss=0.2533, pruned_loss=0.06824, over 351949.00 frames. 2024-06-22 05:44:56,332 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 05:45:05,331 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.422e+02 2.550e+02 2.747e+02 3.297e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 05:45:10,303 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=537929.3333333334, ans=0.0 2024-06-22 05:45:13,972 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2024-06-22 05:45:18,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=537947.6666666666, ans=0.125 2024-06-22 05:45:28,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537966.0, ans=0.125 2024-06-22 05:45:32,865 INFO [train.py:1028] (0/2) Epoch 30, batch 50, loss[loss=0.1762, simple_loss=0.2438, pruned_loss=0.05428, over 12478.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2458, pruned_loss=0.06295, over 575048.70 frames. ], batch size: 29, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:45:37,586 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=537984.3333333334, ans=0.0 2024-06-22 05:45:38,426 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.06 vs. limit=10.0 2024-06-22 05:45:45,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=538021.0, ans=0.0 2024-06-22 05:45:45,232 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2024-06-22 05:45:46,307 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=538021.0, ans=0.2 2024-06-22 05:45:46,313 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=538021.0, ans=0.125 2024-06-22 05:45:48,822 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=538021.0, ans=0.125 2024-06-22 05:45:51,046 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2024-06-22 05:46:07,037 INFO [train.py:1028] (0/2) Epoch 30, batch 100, loss[loss=0.1689, simple_loss=0.2372, pruned_loss=0.05036, over 13256.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2451, pruned_loss=0.06262, over 1017470.40 frames. ], batch size: 46, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:46:08,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=538076.0, ans=0.2 2024-06-22 05:46:15,605 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.380e+02 2.505e+02 2.685e+02 3.497e+02, threshold=5.010e+02, percent-clipped=0.0 2024-06-22 05:46:28,781 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:46:38,694 INFO [train.py:1028] (0/2) Epoch 30, batch 150, loss[loss=0.1856, simple_loss=0.2466, pruned_loss=0.06232, over 13107.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2445, pruned_loss=0.06129, over 1365833.83 frames. ], batch size: 30, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:46:39,421 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=538167.6666666666, ans=0.2 2024-06-22 05:46:47,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=538186.0, ans=0.0 2024-06-22 05:46:50,135 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.17 vs. limit=15.0 2024-06-22 05:46:58,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538222.6666666666, ans=0.1 2024-06-22 05:47:02,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538222.6666666666, ans=0.0 2024-06-22 05:47:04,919 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=538241.0, ans=0.0 2024-06-22 05:47:08,509 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=538241.0, ans=0.1 2024-06-22 05:47:10,374 INFO [train.py:1028] (0/2) Epoch 30, batch 200, loss[loss=0.205, simple_loss=0.2619, pruned_loss=0.074, over 12595.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2446, pruned_loss=0.06139, over 1634759.51 frames. ], batch size: 202, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:47:13,136 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=538259.3333333334, ans=0.125 2024-06-22 05:47:14,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.68 vs. limit=15.0 2024-06-22 05:47:22,579 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 2.312e+02 2.412e+02 2.619e+02 3.031e+02, threshold=4.825e+02, percent-clipped=0.0 2024-06-22 05:47:29,893 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.99 vs. limit=22.5 2024-06-22 05:47:37,471 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=538314.3333333334, ans=0.0 2024-06-22 05:47:39,341 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=538332.6666666666, ans=10.0 2024-06-22 05:47:45,594 INFO [train.py:1028] (0/2) Epoch 30, batch 250, loss[loss=0.1755, simple_loss=0.2173, pruned_loss=0.06683, over 12987.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2449, pruned_loss=0.06139, over 1845797.88 frames. ], batch size: 144, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:48:03,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=538387.6666666666, ans=0.2 2024-06-22 05:48:17,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538424.3333333334, ans=0.1 2024-06-22 05:48:19,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.const_attention_rate, batch_count=538424.3333333334, ans=0.025 2024-06-22 05:48:21,103 INFO [train.py:1028] (0/2) Epoch 30, batch 300, loss[loss=0.1859, simple_loss=0.2371, pruned_loss=0.06733, over 13168.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.245, pruned_loss=0.06139, over 2009484.75 frames. ], batch size: 112, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:48:21,207 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=538442.6666666666, ans=0.125 2024-06-22 05:48:22,006 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=538442.6666666666, ans=0.125 2024-06-22 05:48:30,337 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.350e+02 2.494e+02 2.660e+02 3.197e+02, threshold=4.989e+02, percent-clipped=0.0 2024-06-22 05:48:36,747 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538479.3333333334, ans=0.0 2024-06-22 05:48:38,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=538479.3333333334, ans=0.0 2024-06-22 05:48:40,786 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=15.0 2024-06-22 05:48:42,233 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538497.6666666666, ans=0.1 2024-06-22 05:48:51,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=538516.0, ans=0.125 2024-06-22 05:48:52,964 INFO [train.py:1028] (0/2) Epoch 30, batch 350, loss[loss=0.1905, simple_loss=0.258, pruned_loss=0.06154, over 12986.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2443, pruned_loss=0.06117, over 2139078.30 frames. ], batch size: 33, lr: 1.94e-03, grad_scale: 16.0 2024-06-22 05:48:55,293 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=538534.3333333334, ans=0.025 2024-06-22 05:49:04,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=538552.6666666666, ans=0.05 2024-06-22 05:49:15,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2024-06-22 05:49:22,906 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=538607.6666666666, ans=0.125 2024-06-22 05:49:27,759 INFO [train.py:1028] (0/2) Epoch 30, batch 400, loss[loss=0.1837, simple_loss=0.2468, pruned_loss=0.0603, over 13306.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.2442, pruned_loss=0.06067, over 2239351.46 frames. ], batch size: 63, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:49:28,963 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=8.0 2024-06-22 05:49:36,790 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.360e+02 2.507e+02 2.786e+02 3.534e+02, threshold=5.014e+02, percent-clipped=0.0 2024-06-22 05:49:37,822 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2024-06-22 05:49:46,768 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.26 vs. limit=15.0 2024-06-22 05:49:53,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=538699.3333333334, ans=0.125 2024-06-22 05:49:55,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=538699.3333333334, ans=0.125 2024-06-22 05:49:56,975 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=538699.3333333334, ans=0.0 2024-06-22 05:50:03,250 INFO [train.py:1028] (0/2) Epoch 30, batch 450, loss[loss=0.1798, simple_loss=0.2466, pruned_loss=0.05652, over 13266.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2439, pruned_loss=0.06032, over 2313872.98 frames. ], batch size: 67, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:50:03,975 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:50:04,916 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2024-06-22 05:50:06,628 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538717.6666666666, ans=0.0 2024-06-22 05:50:33,127 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=538791.0, ans=0.0 2024-06-22 05:50:34,971 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=538809.3333333334, ans=0.2 2024-06-22 05:50:35,427 INFO [train.py:1028] (0/2) Epoch 30, batch 500, loss[loss=0.1968, simple_loss=0.2489, pruned_loss=0.07237, over 13132.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2442, pruned_loss=0.0603, over 2376262.22 frames. ], batch size: 121, lr: 1.94e-03, grad_scale: 32.0 2024-06-22 05:50:37,184 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=538809.3333333334, ans=22.5 2024-06-22 05:50:38,121 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538809.3333333334, ans=0.1 2024-06-22 05:50:41,120 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2024-06-22 05:50:44,294 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.331e+02 2.433e+02 2.631e+02 3.376e+02, threshold=4.865e+02, percent-clipped=0.0 2024-06-22 05:50:44,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=538827.6666666666, ans=0.2 2024-06-22 05:50:45,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=538827.6666666666, ans=0.0 2024-06-22 05:50:46,007 INFO [scaling.py:1023] (0/2) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2024-06-22 05:50:56,234 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=538864.3333333334, ans=0.125 2024-06-22 05:50:56,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=538864.3333333334, ans=0.0 2024-06-22 05:51:04,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2024-06-22 05:51:05,336 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=538882.6666666666, ans=0.2 2024-06-22 05:51:06,127 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2024-06-22 05:51:07,098 INFO [train.py:1028] (0/2) Epoch 30, batch 550, loss[loss=0.1892, simple_loss=0.2453, pruned_loss=0.06652, over 12953.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.2443, pruned_loss=0.06059, over 2421241.76 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:51:13,761 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=538919.3333333334, ans=0.125 2024-06-22 05:51:18,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=538919.3333333334, ans=0.125 2024-06-22 05:51:19,504 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=538937.6666666666, ans=0.04949747468305833 2024-06-22 05:51:21,770 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=15.0 2024-06-22 05:51:22,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=538937.6666666666, ans=0.125 2024-06-22 05:51:28,117 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:51:41,774 INFO [train.py:1028] (0/2) Epoch 30, batch 600, loss[loss=0.1822, simple_loss=0.2378, pruned_loss=0.06329, over 13005.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2446, pruned_loss=0.06076, over 2459520.70 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:51:50,755 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.354e+02 2.518e+02 2.774e+02 4.347e+02, threshold=5.035e+02, percent-clipped=0.0 2024-06-22 05:51:50,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=539011.0, ans=0.0 2024-06-22 05:51:52,306 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=539011.0, ans=0.0 2024-06-22 05:52:07,061 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=539047.6666666666, ans=0.2 2024-06-22 05:52:17,390 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=539084.3333333334, ans=0.07 2024-06-22 05:52:17,829 INFO [train.py:1028] (0/2) Epoch 30, batch 650, loss[loss=0.1677, simple_loss=0.2362, pruned_loss=0.04957, over 13210.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2444, pruned_loss=0.0602, over 2489980.04 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:52:18,600 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539084.3333333334, ans=0.1 2024-06-22 05:52:18,733 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=539084.3333333334, ans=0.125 2024-06-22 05:52:20,515 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 05:52:23,509 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2024-06-22 05:52:44,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539157.6666666666, ans=0.1 2024-06-22 05:52:49,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=539176.0, ans=0.025 2024-06-22 05:52:50,065 INFO [train.py:1028] (0/2) Epoch 30, batch 700, loss[loss=0.1902, simple_loss=0.2553, pruned_loss=0.06261, over 13309.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2441, pruned_loss=0.06022, over 2512209.82 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:52:51,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=539176.0, ans=0.125 2024-06-22 05:52:54,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539176.0, ans=0.1 2024-06-22 05:52:58,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.84 vs. limit=15.0 2024-06-22 05:52:59,099 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.362e+02 2.508e+02 2.737e+02 3.151e+02, threshold=5.015e+02, percent-clipped=0.0 2024-06-22 05:53:00,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=539194.3333333334, ans=0.2 2024-06-22 05:53:03,009 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=539212.6666666666, ans=0.125 2024-06-22 05:53:06,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539212.6666666666, ans=0.125 2024-06-22 05:53:22,958 INFO [train.py:1028] (0/2) Epoch 30, batch 750, loss[loss=0.1499, simple_loss=0.2205, pruned_loss=0.03968, over 13259.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2445, pruned_loss=0.05988, over 2526974.63 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:53:30,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2024-06-22 05:53:42,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=539304.3333333334, ans=0.0 2024-06-22 05:53:44,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=539304.3333333334, ans=0.125 2024-06-22 05:53:59,886 INFO [train.py:1028] (0/2) Epoch 30, batch 800, loss[loss=0.1696, simple_loss=0.2341, pruned_loss=0.05254, over 12999.00 frames. ], tot_loss[loss=0.1827, simple_loss=0.245, pruned_loss=0.0602, over 2539826.90 frames. ], batch size: 36, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:54:09,524 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.337e+02 2.477e+02 2.640e+02 3.223e+02, threshold=4.954e+02, percent-clipped=0.0 2024-06-22 05:54:09,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=539377.6666666666, ans=0.125 2024-06-22 05:54:17,416 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=539396.0, ans=10.0 2024-06-22 05:54:35,302 INFO [train.py:1028] (0/2) Epoch 30, batch 850, loss[loss=0.1779, simple_loss=0.2349, pruned_loss=0.06041, over 13150.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2449, pruned_loss=0.06005, over 2550393.56 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:54:40,144 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=539451.0, ans=0.025 2024-06-22 05:54:53,533 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.67 vs. limit=10.0 2024-06-22 05:54:57,173 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=539506.0, ans=0.125 2024-06-22 05:55:07,654 INFO [train.py:1028] (0/2) Epoch 30, batch 900, loss[loss=0.19, simple_loss=0.2496, pruned_loss=0.06523, over 12914.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2446, pruned_loss=0.06024, over 2554625.97 frames. ], batch size: 36, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:55:10,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=539542.6666666666, ans=0.125 2024-06-22 05:55:14,272 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=539561.0, ans=0.125 2024-06-22 05:55:17,890 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.408e+02 2.515e+02 2.772e+02 3.380e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-22 05:55:24,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=539579.3333333334, ans=0.125 2024-06-22 05:55:32,531 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=539597.6666666666, ans=0.125 2024-06-22 05:55:44,902 INFO [train.py:1028] (0/2) Epoch 30, batch 950, loss[loss=0.1778, simple_loss=0.2386, pruned_loss=0.05846, over 12912.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2448, pruned_loss=0.06057, over 2558449.23 frames. ], batch size: 39, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:55:52,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539652.6666666666, ans=0.1 2024-06-22 05:55:53,034 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2024-06-22 05:55:55,853 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=539652.6666666666, ans=0.125 2024-06-22 05:56:01,074 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2024-06-22 05:56:13,685 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=539707.6666666666, ans=0.025 2024-06-22 05:56:18,907 INFO [train.py:1028] (0/2) Epoch 30, batch 1000, loss[loss=0.1944, simple_loss=0.2566, pruned_loss=0.06616, over 13352.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2445, pruned_loss=0.06057, over 2560954.86 frames. ], batch size: 49, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:56:29,076 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.364e+02 2.532e+02 2.803e+02 4.075e+02, threshold=5.064e+02, percent-clipped=0.0 2024-06-22 05:56:50,720 INFO [train.py:1028] (0/2) Epoch 30, batch 1050, loss[loss=0.1794, simple_loss=0.2459, pruned_loss=0.05643, over 13126.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2449, pruned_loss=0.06056, over 2564809.50 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:56:52,692 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=539817.6666666666, ans=0.1 2024-06-22 05:57:02,704 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.74 vs. limit=15.0 2024-06-22 05:57:07,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=539854.3333333334, ans=0.0 2024-06-22 05:57:08,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=539854.3333333334, ans=0.125 2024-06-22 05:57:10,138 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2024-06-22 05:57:11,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=539872.6666666666, ans=0.125 2024-06-22 05:57:13,876 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=539872.6666666666, ans=0.125 2024-06-22 05:57:22,780 INFO [train.py:1028] (0/2) Epoch 30, batch 1100, loss[loss=0.1833, simple_loss=0.2446, pruned_loss=0.06101, over 13231.00 frames. ], tot_loss[loss=0.1832, simple_loss=0.2454, pruned_loss=0.06051, over 2570447.14 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:57:24,694 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.412e+01 2024-06-22 05:57:32,994 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.376e+02 2.467e+02 2.614e+02 3.978e+02, threshold=4.933e+02, percent-clipped=0.0 2024-06-22 05:57:33,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=539927.6666666666, ans=0.125 2024-06-22 05:57:44,442 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=539964.3333333334, ans=0.1 2024-06-22 05:57:52,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=539982.6666666666, ans=0.125 2024-06-22 05:57:55,695 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=539982.6666666666, ans=0.125 2024-06-22 05:57:57,560 INFO [train.py:1028] (0/2) Epoch 30, batch 1150, loss[loss=0.176, simple_loss=0.2411, pruned_loss=0.05546, over 13245.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2455, pruned_loss=0.06066, over 2570961.18 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 16.0 2024-06-22 05:58:02,521 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.66 vs. limit=6.0 2024-06-22 05:58:14,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2024-06-22 05:58:16,451 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=540037.6666666666, ans=0.2 2024-06-22 05:58:33,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=540074.3333333334, ans=0.125 2024-06-22 05:58:33,671 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540092.6666666666, ans=0.1 2024-06-22 05:58:34,174 INFO [train.py:1028] (0/2) Epoch 30, batch 1200, loss[loss=0.1751, simple_loss=0.2444, pruned_loss=0.05292, over 13205.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2456, pruned_loss=0.06101, over 2573371.91 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:58:35,549 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=540092.6666666666, ans=0.125 2024-06-22 05:58:37,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2024-06-22 05:58:44,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=540111.0, ans=0.05 2024-06-22 05:58:44,486 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.319e+02 2.510e+02 2.704e+02 3.574e+02, threshold=5.020e+02, percent-clipped=0.0 2024-06-22 05:59:00,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=540166.0, ans=0.0 2024-06-22 05:59:04,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540166.0, ans=0.1 2024-06-22 05:59:05,736 INFO [train.py:1028] (0/2) Epoch 30, batch 1250, loss[loss=0.1846, simple_loss=0.2494, pruned_loss=0.05988, over 13162.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2453, pruned_loss=0.0609, over 2582468.03 frames. ], batch size: 112, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:59:13,597 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=540202.6666666666, ans=0.025 2024-06-22 05:59:16,005 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=540202.6666666666, ans=0.025 2024-06-22 05:59:16,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=540202.6666666666, ans=0.125 2024-06-22 05:59:28,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=540239.3333333334, ans=0.2 2024-06-22 05:59:29,383 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=540239.3333333334, ans=0.2 2024-06-22 05:59:34,792 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=540257.6666666666, ans=0.0 2024-06-22 05:59:37,164 INFO [train.py:1028] (0/2) Epoch 30, batch 1300, loss[loss=0.1869, simple_loss=0.244, pruned_loss=0.06485, over 12736.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2458, pruned_loss=0.06101, over 2582913.05 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 05:59:42,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540276.0, ans=0.125 2024-06-22 05:59:43,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2024-06-22 05:59:50,072 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.389e+02 2.492e+02 2.652e+02 3.442e+02, threshold=4.984e+02, percent-clipped=0.0 2024-06-22 05:59:58,164 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.37 vs. limit=15.0 2024-06-22 05:59:59,248 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540331.0, ans=0.1 2024-06-22 06:00:15,490 INFO [train.py:1028] (0/2) Epoch 30, batch 1350, loss[loss=0.1778, simple_loss=0.2467, pruned_loss=0.0544, over 13209.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2458, pruned_loss=0.06093, over 2585752.93 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:00:28,360 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=540404.3333333334, ans=0.125 2024-06-22 06:00:38,364 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=540422.6666666666, ans=0.125 2024-06-22 06:00:40,296 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=540422.6666666666, ans=0.0 2024-06-22 06:00:49,325 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2024-06-22 06:00:49,469 INFO [train.py:1028] (0/2) Epoch 30, batch 1400, loss[loss=0.1858, simple_loss=0.2557, pruned_loss=0.05796, over 12499.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2456, pruned_loss=0.06097, over 2587981.29 frames. ], batch size: 25, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:01:00,083 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.386e+02 2.475e+02 2.724e+02 3.520e+02, threshold=4.950e+02, percent-clipped=0.0 2024-06-22 06:01:06,969 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.76 vs. limit=15.0 2024-06-22 06:01:14,269 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=540514.3333333334, ans=0.0 2024-06-22 06:01:19,568 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.const_attention_rate, batch_count=540532.6666666666, ans=0.025 2024-06-22 06:01:22,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=540551.0, ans=0.1 2024-06-22 06:01:22,477 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2024-06-22 06:01:22,595 INFO [train.py:1028] (0/2) Epoch 30, batch 1450, loss[loss=0.1716, simple_loss=0.2254, pruned_loss=0.05888, over 13064.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2451, pruned_loss=0.06107, over 2587526.28 frames. ], batch size: 121, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:01:26,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540551.0, ans=0.1 2024-06-22 06:01:28,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=540569.3333333334, ans=0.0 2024-06-22 06:01:33,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=15.0 2024-06-22 06:01:48,794 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=540606.0, ans=0.2 2024-06-22 06:01:58,161 INFO [train.py:1028] (0/2) Epoch 30, batch 1500, loss[loss=0.1868, simple_loss=0.2446, pruned_loss=0.06454, over 13182.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2455, pruned_loss=0.06133, over 2589725.47 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:01:59,797 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.12 vs. limit=15.0 2024-06-22 06:02:00,927 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=540642.6666666666, ans=0.125 2024-06-22 06:02:00,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=540642.6666666666, ans=0.125 2024-06-22 06:02:00,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=540642.6666666666, ans=0.2 2024-06-22 06:02:08,551 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.410e+02 2.593e+02 2.839e+02 3.544e+02, threshold=5.186e+02, percent-clipped=0.0 2024-06-22 06:02:11,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=540679.3333333334, ans=0.0 2024-06-22 06:02:19,213 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2024-06-22 06:02:19,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=540679.3333333334, ans=0.1 2024-06-22 06:02:20,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=540697.6666666666, ans=0.0 2024-06-22 06:02:33,850 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540734.3333333334, ans=0.1 2024-06-22 06:02:34,413 INFO [train.py:1028] (0/2) Epoch 30, batch 1550, loss[loss=0.1934, simple_loss=0.2499, pruned_loss=0.06852, over 12991.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2457, pruned_loss=0.06123, over 2584734.47 frames. ], batch size: 102, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:02:42,947 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=540752.6666666666, ans=0.0 2024-06-22 06:02:51,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=540771.0, ans=0.0 2024-06-22 06:02:52,410 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=15.0 2024-06-22 06:02:55,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=540789.3333333334, ans=10.0 2024-06-22 06:02:56,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=540789.3333333334, ans=0.0 2024-06-22 06:02:56,978 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540789.3333333334, ans=0.1 2024-06-22 06:03:02,368 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=540807.6666666666, ans=0.125 2024-06-22 06:03:04,225 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=540807.6666666666, ans=0.0 2024-06-22 06:03:07,061 INFO [train.py:1028] (0/2) Epoch 30, batch 1600, loss[loss=0.1751, simple_loss=0.2397, pruned_loss=0.05529, over 13154.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.2457, pruned_loss=0.06119, over 2579575.64 frames. ], batch size: 77, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:03:12,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=540826.0, ans=0.125 2024-06-22 06:03:12,843 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540844.3333333334, ans=0.125 2024-06-22 06:03:17,153 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.453e+02 2.672e+02 2.946e+02 4.286e+02, threshold=5.345e+02, percent-clipped=0.0 2024-06-22 06:03:18,657 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=540844.3333333334, ans=0.125 2024-06-22 06:03:31,879 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540881.0, ans=0.1 2024-06-22 06:03:32,481 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540881.0, ans=0.1 2024-06-22 06:03:40,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540899.3333333334, ans=0.1 2024-06-22 06:03:42,031 INFO [train.py:1028] (0/2) Epoch 30, batch 1650, loss[loss=0.1963, simple_loss=0.2496, pruned_loss=0.07152, over 13135.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2459, pruned_loss=0.0614, over 2574913.17 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:03:43,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=540917.6666666666, ans=0.125 2024-06-22 06:04:00,111 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=540954.3333333334, ans=0.125 2024-06-22 06:04:17,533 INFO [train.py:1028] (0/2) Epoch 30, batch 1700, loss[loss=0.1919, simple_loss=0.2632, pruned_loss=0.06025, over 12971.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.246, pruned_loss=0.06114, over 2580770.77 frames. ], batch size: 26, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:04:25,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=541027.6666666666, ans=0.0 2024-06-22 06:04:27,497 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.418e+02 2.543e+02 2.829e+02 4.265e+02, threshold=5.087e+02, percent-clipped=0.0 2024-06-22 06:04:42,107 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=541082.6666666666, ans=0.125 2024-06-22 06:04:47,786 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=541082.6666666666, ans=0.1 2024-06-22 06:04:49,475 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2024-06-22 06:04:49,742 INFO [train.py:1028] (0/2) Epoch 30, batch 1750, loss[loss=0.1818, simple_loss=0.2499, pruned_loss=0.05683, over 12515.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2462, pruned_loss=0.06117, over 2581660.98 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:04:54,229 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.00 vs. limit=22.5 2024-06-22 06:04:58,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=541119.3333333334, ans=0.125 2024-06-22 06:05:01,943 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541119.3333333334, ans=0.125 2024-06-22 06:05:13,901 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.02 vs. limit=22.5 2024-06-22 06:05:22,201 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=541192.6666666666, ans=0.125 2024-06-22 06:05:22,603 INFO [train.py:1028] (0/2) Epoch 30, batch 1800, loss[loss=0.1761, simple_loss=0.2418, pruned_loss=0.05519, over 13248.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2461, pruned_loss=0.06126, over 2581550.69 frames. ], batch size: 67, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:05:25,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=541192.6666666666, ans=0.025 2024-06-22 06:05:31,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=541211.0, ans=0.0 2024-06-22 06:05:32,498 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.42 vs. limit=15.0 2024-06-22 06:05:33,144 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2024-06-22 06:05:33,345 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.371e+02 2.539e+02 2.705e+02 3.332e+02, threshold=5.078e+02, percent-clipped=0.0 2024-06-22 06:05:36,460 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=541229.3333333334, ans=22.5 2024-06-22 06:05:44,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=541247.6666666666, ans=0.025 2024-06-22 06:05:47,950 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=541247.6666666666, ans=0.5 2024-06-22 06:05:57,422 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=541266.0, ans=0.0 2024-06-22 06:05:58,478 INFO [train.py:1028] (0/2) Epoch 30, batch 1850, loss[loss=0.1733, simple_loss=0.2286, pruned_loss=0.05904, over 13226.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2467, pruned_loss=0.06154, over 2583388.73 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:05:59,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=541284.3333333334, ans=0.125 2024-06-22 06:06:00,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=541284.3333333334, ans=0.0 2024-06-22 06:06:01,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=541284.3333333334, ans=10.0 2024-06-22 06:06:09,261 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=541302.6666666666, ans=0.125 2024-06-22 06:06:23,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541339.3333333334, ans=0.0 2024-06-22 06:06:33,795 INFO [train.py:1028] (0/2) Epoch 30, batch 1900, loss[loss=0.1749, simple_loss=0.2382, pruned_loss=0.05579, over 13158.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2464, pruned_loss=0.06155, over 2586137.55 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:06:44,663 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.388e+02 2.495e+02 2.679e+02 3.422e+02, threshold=4.990e+02, percent-clipped=0.0 2024-06-22 06:06:56,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=541431.0, ans=0.125 2024-06-22 06:06:59,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=541431.0, ans=0.0 2024-06-22 06:07:02,265 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.18 vs. limit=22.5 2024-06-22 06:07:06,943 INFO [train.py:1028] (0/2) Epoch 30, batch 1950, loss[loss=0.1638, simple_loss=0.2292, pruned_loss=0.04923, over 13310.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2461, pruned_loss=0.06148, over 2592880.72 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:07:07,394 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2024-06-22 06:07:15,871 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.24 vs. limit=22.5 2024-06-22 06:07:30,203 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=541522.6666666666, ans=0.1 2024-06-22 06:07:41,560 INFO [train.py:1028] (0/2) Epoch 30, batch 2000, loss[loss=0.182, simple_loss=0.2548, pruned_loss=0.05455, over 12800.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2457, pruned_loss=0.06142, over 2588262.88 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:07:47,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=541559.3333333334, ans=0.1 2024-06-22 06:07:51,768 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.361e+02 2.521e+02 2.752e+02 3.459e+02, threshold=5.043e+02, percent-clipped=0.0 2024-06-22 06:07:58,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=541596.0, ans=0.125 2024-06-22 06:08:04,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=541614.3333333334, ans=0.2 2024-06-22 06:08:16,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=541651.0, ans=0.125 2024-06-22 06:08:17,069 INFO [train.py:1028] (0/2) Epoch 30, batch 2050, loss[loss=0.1899, simple_loss=0.2555, pruned_loss=0.06216, over 12639.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2464, pruned_loss=0.06189, over 2583774.64 frames. ], batch size: 29, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:08:27,238 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=541669.3333333334, ans=0.0 2024-06-22 06:08:27,897 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=541669.3333333334, ans=0.125 2024-06-22 06:08:32,865 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=541687.6666666666, ans=0.125 2024-06-22 06:08:38,907 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.82 vs. limit=15.0 2024-06-22 06:08:40,699 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.55 vs. limit=15.0 2024-06-22 06:08:43,219 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=541724.3333333334, ans=0.07 2024-06-22 06:08:50,123 INFO [train.py:1028] (0/2) Epoch 30, batch 2100, loss[loss=0.1738, simple_loss=0.2396, pruned_loss=0.05401, over 13220.00 frames. ], tot_loss[loss=0.1844, simple_loss=0.2462, pruned_loss=0.06134, over 2586625.01 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:08:50,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=541742.6666666666, ans=0.2 2024-06-22 06:08:55,550 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=541742.6666666666, ans=0.2 2024-06-22 06:09:00,611 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.370e+02 2.519e+02 2.733e+02 3.736e+02, threshold=5.037e+02, percent-clipped=0.0 2024-06-22 06:09:08,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.13 vs. limit=10.0 2024-06-22 06:09:23,263 INFO [train.py:1028] (0/2) Epoch 30, batch 2150, loss[loss=0.174, simple_loss=0.2405, pruned_loss=0.05377, over 13293.00 frames. ], tot_loss[loss=0.1841, simple_loss=0.246, pruned_loss=0.06105, over 2589333.84 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:09:25,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=541834.3333333334, ans=0.125 2024-06-22 06:09:28,222 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2024-06-22 06:09:44,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=541871.0, ans=0.125 2024-06-22 06:09:54,683 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=541907.6666666666, ans=0.125 2024-06-22 06:09:56,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541907.6666666666, ans=0.125 2024-06-22 06:09:59,268 INFO [train.py:1028] (0/2) Epoch 30, batch 2200, loss[loss=0.1833, simple_loss=0.2399, pruned_loss=0.06341, over 13189.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2465, pruned_loss=0.06123, over 2589217.90 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:10:05,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=541944.3333333334, ans=0.025 2024-06-22 06:10:09,393 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.335e+02 2.468e+02 2.613e+02 4.116e+02, threshold=4.937e+02, percent-clipped=0.0 2024-06-22 06:10:12,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=541962.6666666666, ans=0.1 2024-06-22 06:10:30,282 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=541999.3333333334, ans=0.125 2024-06-22 06:10:34,783 INFO [train.py:1028] (0/2) Epoch 30, batch 2250, loss[loss=0.1818, simple_loss=0.2457, pruned_loss=0.059, over 13262.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2465, pruned_loss=0.06139, over 2587882.14 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:10:43,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=542036.0, ans=0.125 2024-06-22 06:10:48,599 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2024-06-22 06:10:48,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=542054.3333333334, ans=0.07 2024-06-22 06:10:59,689 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=542072.6666666666, ans=0.0 2024-06-22 06:11:03,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542091.0, ans=0.125 2024-06-22 06:11:07,704 INFO [train.py:1028] (0/2) Epoch 30, batch 2300, loss[loss=0.177, simple_loss=0.2396, pruned_loss=0.05714, over 12868.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2464, pruned_loss=0.06141, over 2582664.30 frames. ], batch size: 33, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:11:12,391 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=542109.3333333334, ans=0.1 2024-06-22 06:11:15,731 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542127.6666666666, ans=0.1 2024-06-22 06:11:18,256 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.338e+02 2.548e+02 2.789e+02 4.157e+02, threshold=5.095e+02, percent-clipped=0.0 2024-06-22 06:11:19,958 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2024-06-22 06:11:19,989 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.04 vs. limit=22.5 2024-06-22 06:11:25,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=542146.0, ans=0.125 2024-06-22 06:11:27,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542164.3333333334, ans=0.1 2024-06-22 06:11:30,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542164.3333333334, ans=0.1 2024-06-22 06:11:33,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=542182.6666666666, ans=0.125 2024-06-22 06:11:43,065 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=542201.0, ans=0.125 2024-06-22 06:11:43,594 INFO [train.py:1028] (0/2) Epoch 30, batch 2350, loss[loss=0.1902, simple_loss=0.247, pruned_loss=0.06669, over 13200.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.2469, pruned_loss=0.06174, over 2586280.47 frames. ], batch size: 67, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:12:18,401 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=542274.3333333334, ans=0.125 2024-06-22 06:12:19,681 INFO [train.py:1028] (0/2) Epoch 30, batch 2400, loss[loss=0.1675, simple_loss=0.2291, pruned_loss=0.05289, over 13307.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2461, pruned_loss=0.06146, over 2588961.14 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:12:20,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=542292.6666666666, ans=0.125 2024-06-22 06:12:23,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=542292.6666666666, ans=0.0 2024-06-22 06:12:30,021 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.354e+02 2.481e+02 2.785e+02 3.991e+02, threshold=4.963e+02, percent-clipped=0.0 2024-06-22 06:12:39,570 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542347.6666666666, ans=0.1 2024-06-22 06:12:40,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=542347.6666666666, ans=0.125 2024-06-22 06:12:45,273 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542366.0, ans=0.1 2024-06-22 06:12:46,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=542366.0, ans=0.125 2024-06-22 06:12:52,287 INFO [train.py:1028] (0/2) Epoch 30, batch 2450, loss[loss=0.1672, simple_loss=0.2347, pruned_loss=0.04988, over 13260.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2449, pruned_loss=0.06134, over 2585825.85 frames. ], batch size: 63, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:13:09,247 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=542421.0, ans=15.0 2024-06-22 06:13:13,101 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.13 vs. limit=12.0 2024-06-22 06:13:15,004 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=542439.3333333334, ans=0.125 2024-06-22 06:13:24,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542476.0, ans=0.1 2024-06-22 06:13:24,812 INFO [train.py:1028] (0/2) Epoch 30, batch 2500, loss[loss=0.1728, simple_loss=0.2311, pruned_loss=0.05723, over 13205.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2441, pruned_loss=0.06118, over 2588494.29 frames. ], batch size: 83, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:13:39,411 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=542494.3333333334, ans=0.125 2024-06-22 06:13:39,795 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.396e+02 2.550e+02 2.760e+02 4.922e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 06:13:42,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=542512.6666666666, ans=0.0 2024-06-22 06:14:02,685 INFO [train.py:1028] (0/2) Epoch 30, batch 2550, loss[loss=0.1947, simple_loss=0.2571, pruned_loss=0.06618, over 12774.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2432, pruned_loss=0.06092, over 2587658.94 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:14:11,518 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2024-06-22 06:14:17,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=542586.0, ans=0.0 2024-06-22 06:14:25,348 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.78 vs. limit=22.5 2024-06-22 06:14:27,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=542622.6666666666, ans=0.125 2024-06-22 06:14:38,219 INFO [train.py:1028] (0/2) Epoch 30, batch 2600, loss[loss=0.1798, simple_loss=0.2469, pruned_loss=0.05636, over 13241.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2424, pruned_loss=0.06054, over 2587015.51 frames. ], batch size: 52, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:14:40,501 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-296000.pt 2024-06-22 06:14:46,849 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=15.0 2024-06-22 06:14:47,743 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=542659.3333333334, ans=0.0 2024-06-22 06:14:52,297 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=542677.6666666666, ans=0.0 2024-06-22 06:14:54,231 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.340e+02 2.502e+02 2.665e+02 3.339e+02, threshold=5.004e+02, percent-clipped=0.0 2024-06-22 06:14:59,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=542696.0, ans=0.125 2024-06-22 06:15:00,734 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=542696.0, ans=0.2 2024-06-22 06:15:02,663 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=542714.3333333334, ans=0.0 2024-06-22 06:15:08,470 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542714.3333333334, ans=0.1 2024-06-22 06:15:16,534 INFO [train.py:1028] (0/2) Epoch 30, batch 2650, loss[loss=0.1815, simple_loss=0.2335, pruned_loss=0.06474, over 12988.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2408, pruned_loss=0.05975, over 2588844.99 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:15:17,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542751.0, ans=0.1 2024-06-22 06:15:19,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=542751.0, ans=0.95 2024-06-22 06:15:22,885 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2024-06-22 06:15:54,474 INFO [train.py:1028] (0/2) Epoch 30, batch 2700, loss[loss=0.1866, simple_loss=0.2376, pruned_loss=0.06779, over 13265.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2393, pruned_loss=0.05984, over 2586694.60 frames. ], batch size: 89, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:15:59,951 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=542842.6666666666, ans=0.1 2024-06-22 06:16:05,091 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.341e+02 2.483e+02 2.699e+02 3.731e+02, threshold=4.966e+02, percent-clipped=0.0 2024-06-22 06:16:19,105 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=542897.6666666666, ans=0.125 2024-06-22 06:16:22,991 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=542916.0, ans=0.125 2024-06-22 06:16:30,502 INFO [train.py:1028] (0/2) Epoch 30, batch 2750, loss[loss=0.167, simple_loss=0.226, pruned_loss=0.05401, over 13290.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2383, pruned_loss=0.05916, over 2583280.82 frames. ], batch size: 43, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:16:30,951 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.34 vs. limit=15.0 2024-06-22 06:16:39,006 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:16:40,671 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.52 vs. limit=6.0 2024-06-22 06:16:49,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542971.0, ans=0.1 2024-06-22 06:16:50,557 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.29 vs. limit=22.5 2024-06-22 06:16:53,125 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=12.0 2024-06-22 06:17:03,209 INFO [train.py:1028] (0/2) Epoch 30, batch 2800, loss[loss=0.1855, simple_loss=0.2362, pruned_loss=0.06743, over 10882.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2377, pruned_loss=0.05911, over 2581506.30 frames. ], batch size: 304, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:17:07,795 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.const_attention_rate, batch_count=543026.0, ans=0.025 2024-06-22 06:17:13,640 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.319e+02 2.458e+02 2.654e+02 3.588e+02, threshold=4.915e+02, percent-clipped=0.0 2024-06-22 06:17:15,929 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=543062.6666666666, ans=0.2 2024-06-22 06:17:16,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=543062.6666666666, ans=0.025 2024-06-22 06:17:31,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=543081.0, ans=0.125 2024-06-22 06:17:32,653 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=543099.3333333334, ans=0.125 2024-06-22 06:17:33,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=543099.3333333334, ans=0.2 2024-06-22 06:17:37,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543099.3333333334, ans=0.1 2024-06-22 06:17:37,552 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2024-06-22 06:17:38,642 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=543117.6666666666, ans=0.125 2024-06-22 06:17:39,209 INFO [train.py:1028] (0/2) Epoch 30, batch 2850, loss[loss=0.1848, simple_loss=0.2511, pruned_loss=0.05929, over 13304.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.237, pruned_loss=0.05885, over 2579156.23 frames. ], batch size: 49, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:17:43,158 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=543117.6666666666, ans=0.0 2024-06-22 06:17:48,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543136.0, ans=0.125 2024-06-22 06:17:49,458 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:18:07,396 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=543191.0, ans=0.125 2024-06-22 06:18:13,581 INFO [train.py:1028] (0/2) Epoch 30, batch 2900, loss[loss=0.1713, simple_loss=0.2359, pruned_loss=0.05329, over 13175.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.236, pruned_loss=0.05878, over 2586897.34 frames. ], batch size: 55, lr: 1.93e-03, grad_scale: 32.0 2024-06-22 06:18:17,894 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=543209.3333333334, ans=0.125 2024-06-22 06:18:18,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=543209.3333333334, ans=0.0 2024-06-22 06:18:24,492 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.317e+02 2.460e+02 2.689e+02 4.047e+02, threshold=4.921e+02, percent-clipped=0.0 2024-06-22 06:18:24,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=543227.6666666666, ans=0.2 2024-06-22 06:18:35,191 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=543264.3333333334, ans=0.2 2024-06-22 06:18:41,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=543282.6666666666, ans=0.125 2024-06-22 06:18:45,877 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=543282.6666666666, ans=0.2 2024-06-22 06:18:46,936 INFO [train.py:1028] (0/2) Epoch 30, batch 2950, loss[loss=0.1684, simple_loss=0.2308, pruned_loss=0.05297, over 13283.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2362, pruned_loss=0.05894, over 2579814.37 frames. ], batch size: 43, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:19:10,384 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=543356.0, ans=0.0 2024-06-22 06:19:17,618 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=543374.3333333334, ans=0.025 2024-06-22 06:19:18,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=543374.3333333334, ans=0.125 2024-06-22 06:19:24,251 INFO [train.py:1028] (0/2) Epoch 30, batch 3000, loss[loss=0.169, simple_loss=0.232, pruned_loss=0.05301, over 13198.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2351, pruned_loss=0.05844, over 2579175.55 frames. ], batch size: 59, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:19:24,252 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 06:19:32,228 INFO [train.py:1060] (0/2) Epoch 30, validation: loss=0.194, simple_loss=0.252, pruned_loss=0.06799, over 351949.00 frames. 2024-06-22 06:19:32,228 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 06:19:37,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=543392.6666666666, ans=0.2 2024-06-22 06:19:41,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=543411.0, ans=0.0 2024-06-22 06:19:43,401 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.334e+02 2.438e+02 2.640e+02 3.568e+02, threshold=4.875e+02, percent-clipped=0.0 2024-06-22 06:19:44,497 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2024-06-22 06:19:45,027 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.50 vs. limit=22.5 2024-06-22 06:19:48,137 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=543429.3333333334, ans=0.125 2024-06-22 06:19:52,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=543447.6666666666, ans=0.125 2024-06-22 06:20:09,603 INFO [train.py:1028] (0/2) Epoch 30, batch 3050, loss[loss=0.1732, simple_loss=0.2325, pruned_loss=0.05698, over 13324.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2344, pruned_loss=0.05862, over 2579323.22 frames. ], batch size: 46, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:20:20,529 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=543502.6666666666, ans=0.0 2024-06-22 06:20:23,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=543521.0, ans=0.0 2024-06-22 06:20:25,208 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=543521.0, ans=0.0 2024-06-22 06:20:29,251 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=543539.3333333334, ans=0.07 2024-06-22 06:20:29,776 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=543539.3333333334, ans=0.125 2024-06-22 06:20:41,074 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:20:41,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=543576.0, ans=0.125 2024-06-22 06:20:41,561 INFO [train.py:1028] (0/2) Epoch 30, batch 3100, loss[loss=0.1752, simple_loss=0.2265, pruned_loss=0.06197, over 13052.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.2334, pruned_loss=0.05794, over 2579906.03 frames. ], batch size: 144, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:20:43,098 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=543576.0, ans=0.0 2024-06-22 06:20:52,097 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.306e+02 2.496e+02 2.751e+02 3.896e+02, threshold=4.992e+02, percent-clipped=0.0 2024-06-22 06:20:52,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=543594.3333333334, ans=0.125 2024-06-22 06:20:53,632 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=543594.3333333334, ans=0.0 2024-06-22 06:21:00,698 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=543631.0, ans=0.125 2024-06-22 06:21:02,019 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543631.0, ans=0.1 2024-06-22 06:21:02,112 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=543631.0, ans=0.125 2024-06-22 06:21:12,448 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=543649.3333333334, ans=0.035 2024-06-22 06:21:13,618 INFO [train.py:1028] (0/2) Epoch 30, batch 3150, loss[loss=0.1738, simple_loss=0.2214, pruned_loss=0.06305, over 12940.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2323, pruned_loss=0.05764, over 2581343.61 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:21:24,242 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=543686.0, ans=0.0 2024-06-22 06:21:27,515 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.85 vs. limit=15.0 2024-06-22 06:21:32,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=543704.3333333334, ans=0.0 2024-06-22 06:21:47,626 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=543741.0, ans=0.125 2024-06-22 06:21:49,673 INFO [train.py:1028] (0/2) Epoch 30, batch 3200, loss[loss=0.1849, simple_loss=0.249, pruned_loss=0.06043, over 13151.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2322, pruned_loss=0.05733, over 2582384.34 frames. ], batch size: 55, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:21:50,093 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.32 vs. limit=15.0 2024-06-22 06:21:59,106 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=543777.6666666666, ans=0.025 2024-06-22 06:21:59,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=543777.6666666666, ans=0.0 2024-06-22 06:22:00,253 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.301e+02 2.438e+02 2.621e+02 3.469e+02, threshold=4.876e+02, percent-clipped=0.0 2024-06-22 06:22:08,865 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.08 vs. limit=15.0 2024-06-22 06:22:15,041 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=543814.3333333334, ans=0.2 2024-06-22 06:22:19,152 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=12.0 2024-06-22 06:22:24,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=543832.6666666666, ans=0.0 2024-06-22 06:22:25,113 INFO [train.py:1028] (0/2) Epoch 30, batch 3250, loss[loss=0.1512, simple_loss=0.2178, pruned_loss=0.04232, over 13256.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2317, pruned_loss=0.05728, over 2586886.71 frames. ], batch size: 72, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:22:25,415 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2024-06-22 06:22:31,658 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.09 vs. limit=22.5 2024-06-22 06:22:36,884 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=543869.3333333334, ans=0.125 2024-06-22 06:22:48,071 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=543906.0, ans=0.125 2024-06-22 06:22:49,873 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=22.5 2024-06-22 06:22:53,369 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=543924.3333333334, ans=0.1 2024-06-22 06:22:56,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=543924.3333333334, ans=0.1 2024-06-22 06:22:57,727 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.58 vs. limit=15.0 2024-06-22 06:22:58,603 INFO [train.py:1028] (0/2) Epoch 30, batch 3300, loss[loss=0.1749, simple_loss=0.23, pruned_loss=0.05987, over 12771.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2313, pruned_loss=0.0573, over 2583926.42 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:23:03,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=543942.6666666666, ans=0.05 2024-06-22 06:23:09,239 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.367e+02 2.528e+02 2.813e+02 3.945e+02, threshold=5.056e+02, percent-clipped=0.0 2024-06-22 06:23:16,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543979.3333333334, ans=0.1 2024-06-22 06:23:30,170 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=544016.0, ans=0.125 2024-06-22 06:23:30,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=544016.0, ans=0.02 2024-06-22 06:23:32,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=544016.0, ans=0.0 2024-06-22 06:23:32,160 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=544016.0, ans=0.0 2024-06-22 06:23:32,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=544016.0, ans=0.1 2024-06-22 06:23:34,621 INFO [train.py:1028] (0/2) Epoch 30, batch 3350, loss[loss=0.1753, simple_loss=0.2287, pruned_loss=0.06092, over 12951.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2311, pruned_loss=0.05765, over 2577870.68 frames. ], batch size: 158, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:23:34,848 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=544034.3333333334, ans=0.125 2024-06-22 06:23:34,864 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=544034.3333333334, ans=0.125 2024-06-22 06:23:36,072 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=544034.3333333334, ans=0.125 2024-06-22 06:23:37,226 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=544034.3333333334, ans=0.125 2024-06-22 06:23:39,986 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=544034.3333333334, ans=0.05 2024-06-22 06:23:46,714 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2024-06-22 06:23:47,359 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.94 vs. limit=22.5 2024-06-22 06:24:11,203 INFO [train.py:1028] (0/2) Epoch 30, batch 3400, loss[loss=0.1766, simple_loss=0.2424, pruned_loss=0.05539, over 12556.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2306, pruned_loss=0.05785, over 2575994.06 frames. ], batch size: 22, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:24:11,389 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=544126.0, ans=0.125 2024-06-22 06:24:11,904 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544126.0, ans=0.1 2024-06-22 06:24:18,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=544144.3333333334, ans=0.125 2024-06-22 06:24:21,460 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.325e+02 2.497e+02 2.812e+02 4.183e+02, threshold=4.995e+02, percent-clipped=0.0 2024-06-22 06:24:29,846 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.56 vs. limit=22.5 2024-06-22 06:24:31,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=544181.0, ans=0.0 2024-06-22 06:24:43,787 INFO [train.py:1028] (0/2) Epoch 30, batch 3450, loss[loss=0.1822, simple_loss=0.2344, pruned_loss=0.06499, over 12771.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.23, pruned_loss=0.05743, over 2577107.96 frames. ], batch size: 176, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:24:45,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=544217.6666666666, ans=0.2 2024-06-22 06:24:48,783 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=544217.6666666666, ans=0.2 2024-06-22 06:24:48,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544217.6666666666, ans=0.1 2024-06-22 06:24:56,428 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.68 vs. limit=10.0 2024-06-22 06:24:58,363 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2024-06-22 06:25:07,317 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=544272.6666666666, ans=12.0 2024-06-22 06:25:12,237 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=544291.0, ans=0.1 2024-06-22 06:25:16,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=544291.0, ans=0.125 2024-06-22 06:25:17,167 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2024-06-22 06:25:17,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=544291.0, ans=0.125 2024-06-22 06:25:19,404 INFO [train.py:1028] (0/2) Epoch 30, batch 3500, loss[loss=0.1692, simple_loss=0.2227, pruned_loss=0.05783, over 12949.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2297, pruned_loss=0.05743, over 2575490.84 frames. ], batch size: 33, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:25:20,294 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=544309.3333333334, ans=0.5 2024-06-22 06:25:23,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.const_attention_rate, batch_count=544309.3333333334, ans=0.025 2024-06-22 06:25:23,966 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=544309.3333333334, ans=0.125 2024-06-22 06:25:29,834 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.344e+02 2.476e+02 2.641e+02 3.205e+02, threshold=4.951e+02, percent-clipped=0.0 2024-06-22 06:25:39,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=544364.3333333334, ans=0.025 2024-06-22 06:25:40,488 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=544364.3333333334, ans=0.125 2024-06-22 06:25:40,574 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 06:25:42,327 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=544364.3333333334, ans=0.0 2024-06-22 06:25:50,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.const_attention_rate, batch_count=544382.6666666666, ans=0.025 2024-06-22 06:25:52,096 INFO [train.py:1028] (0/2) Epoch 30, batch 3550, loss[loss=0.1638, simple_loss=0.2218, pruned_loss=0.0529, over 13165.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.2295, pruned_loss=0.05735, over 2577271.05 frames. ], batch size: 95, lr: 1.93e-03, grad_scale: 64.0 2024-06-22 06:26:11,486 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=544437.6666666666, ans=0.0 2024-06-22 06:26:14,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=544456.0, ans=0.125 2024-06-22 06:26:25,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544474.3333333334, ans=0.1 2024-06-22 06:26:25,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=544474.3333333334, ans=0.025 2024-06-22 06:26:27,401 INFO [train.py:1028] (0/2) Epoch 30, batch 3600, loss[loss=0.1743, simple_loss=0.2374, pruned_loss=0.05564, over 12994.00 frames. ], tot_loss[loss=0.1718, simple_loss=0.2291, pruned_loss=0.05727, over 2579832.57 frames. ], batch size: 48, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:26:33,625 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544511.0, ans=0.1 2024-06-22 06:26:34,993 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:26:35,418 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2024-06-22 06:26:38,254 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.274e+02 2.363e+02 2.491e+02 3.554e+02, threshold=4.727e+02, percent-clipped=0.0 2024-06-22 06:26:39,186 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=544511.0, ans=0.0 2024-06-22 06:26:40,557 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=544529.3333333334, ans=0.125 2024-06-22 06:26:50,633 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=544547.6666666666, ans=0.125 2024-06-22 06:26:55,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=544566.0, ans=0.125 2024-06-22 06:27:00,467 INFO [train.py:1028] (0/2) Epoch 30, batch 3650, loss[loss=0.1773, simple_loss=0.2339, pruned_loss=0.06039, over 13081.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.229, pruned_loss=0.05715, over 2577906.76 frames. ], batch size: 102, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:27:02,427 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=544584.3333333334, ans=0.2 2024-06-22 06:27:04,647 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.25 vs. limit=22.5 2024-06-22 06:27:07,178 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.91 vs. limit=10.0 2024-06-22 06:27:08,778 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=544602.6666666666, ans=0.2 2024-06-22 06:27:08,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544602.6666666666, ans=0.125 2024-06-22 06:27:21,082 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544621.0, ans=0.1 2024-06-22 06:27:21,090 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=544621.0, ans=0.125 2024-06-22 06:27:27,397 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=544639.3333333334, ans=0.0 2024-06-22 06:27:37,198 INFO [train.py:1028] (0/2) Epoch 30, batch 3700, loss[loss=0.1605, simple_loss=0.2192, pruned_loss=0.05084, over 13277.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2285, pruned_loss=0.05695, over 2583230.74 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:27:42,577 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=544676.0, ans=0.125 2024-06-22 06:27:47,813 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.265e+02 2.404e+02 2.569e+02 3.591e+02, threshold=4.808e+02, percent-clipped=0.0 2024-06-22 06:27:47,992 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=544694.3333333334, ans=0.125 2024-06-22 06:27:48,769 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544694.3333333334, ans=0.1 2024-06-22 06:28:00,212 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=544731.0, ans=0.025 2024-06-22 06:28:04,151 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.74 vs. limit=15.0 2024-06-22 06:28:11,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=544749.3333333334, ans=10.0 2024-06-22 06:28:13,243 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=544767.6666666666, ans=0.025 2024-06-22 06:28:13,739 INFO [train.py:1028] (0/2) Epoch 30, batch 3750, loss[loss=0.176, simple_loss=0.2377, pruned_loss=0.05719, over 12373.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2285, pruned_loss=0.05687, over 2585090.69 frames. ], batch size: 22, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:28:16,661 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=544767.6666666666, ans=0.0 2024-06-22 06:28:22,499 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=544786.0, ans=0.5 2024-06-22 06:28:31,036 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=544804.3333333334, ans=0.0 2024-06-22 06:28:36,224 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=544822.6666666666, ans=0.0 2024-06-22 06:28:37,660 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=544822.6666666666, ans=0.0 2024-06-22 06:28:46,202 INFO [train.py:1028] (0/2) Epoch 30, batch 3800, loss[loss=0.1723, simple_loss=0.2223, pruned_loss=0.06111, over 13252.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2282, pruned_loss=0.05668, over 2583784.79 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:28:52,309 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=22.5 2024-06-22 06:28:55,414 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=544877.6666666666, ans=0.0 2024-06-22 06:28:55,593 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.70 vs. limit=15.0 2024-06-22 06:28:57,882 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.338e+02 2.569e+02 2.719e+02 3.938e+02, threshold=5.137e+02, percent-clipped=0.0 2024-06-22 06:29:02,742 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544896.0, ans=0.1 2024-06-22 06:29:05,443 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=22.5 2024-06-22 06:29:10,833 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=544914.3333333334, ans=0.1 2024-06-22 06:29:24,695 INFO [train.py:1028] (0/2) Epoch 30, batch 3850, loss[loss=0.1685, simple_loss=0.2241, pruned_loss=0.05641, over 12992.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2277, pruned_loss=0.05609, over 2582811.75 frames. ], batch size: 144, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:29:42,163 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=544987.6666666666, ans=0.125 2024-06-22 06:29:52,601 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=545024.3333333334, ans=0.05 2024-06-22 06:29:56,427 INFO [train.py:1028] (0/2) Epoch 30, batch 3900, loss[loss=0.1614, simple_loss=0.2146, pruned_loss=0.05413, over 13200.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2277, pruned_loss=0.05638, over 2585890.42 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:29:57,093 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:29:59,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=545042.6666666666, ans=0.125 2024-06-22 06:30:01,083 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=545042.6666666666, ans=0.125 2024-06-22 06:30:02,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=545061.0, ans=0.0 2024-06-22 06:30:07,294 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.279e+02 2.443e+02 2.743e+02 3.515e+02, threshold=4.886e+02, percent-clipped=0.0 2024-06-22 06:30:16,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=545079.3333333334, ans=0.125 2024-06-22 06:30:17,108 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.const_attention_rate, batch_count=545079.3333333334, ans=0.025 2024-06-22 06:30:23,040 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.17 vs. limit=15.0 2024-06-22 06:30:25,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545097.6666666666, ans=0.125 2024-06-22 06:30:33,946 INFO [train.py:1028] (0/2) Epoch 30, batch 3950, loss[loss=0.1528, simple_loss=0.2, pruned_loss=0.05281, over 13072.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2267, pruned_loss=0.05566, over 2586956.35 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 16.0 2024-06-22 06:30:42,046 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=545152.6666666666, ans=0.95 2024-06-22 06:30:43,423 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=545152.6666666666, ans=0.0 2024-06-22 06:30:46,933 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545152.6666666666, ans=0.1 2024-06-22 06:31:03,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=545207.6666666666, ans=0.125 2024-06-22 06:31:08,358 INFO [train.py:1028] (0/2) Epoch 30, batch 4000, loss[loss=0.161, simple_loss=0.2231, pruned_loss=0.0494, over 12966.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.226, pruned_loss=0.05558, over 2582043.74 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:31:16,958 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=545244.3333333334, ans=0.125 2024-06-22 06:31:19,553 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=545244.3333333334, ans=0.0 2024-06-22 06:31:20,623 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.255e+02 2.431e+02 2.673e+02 3.583e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 06:31:45,002 INFO [train.py:1028] (0/2) Epoch 30, batch 4050, loss[loss=0.1703, simple_loss=0.2169, pruned_loss=0.0619, over 11115.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2254, pruned_loss=0.05529, over 2580048.88 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:31:53,573 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545336.0, ans=0.1 2024-06-22 06:31:56,857 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=545336.0, ans=0.1 2024-06-22 06:32:20,347 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2024-06-22 06:32:21,331 INFO [train.py:1028] (0/2) Epoch 30, batch 4100, loss[loss=0.1815, simple_loss=0.2266, pruned_loss=0.0682, over 13002.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2254, pruned_loss=0.05564, over 2576292.21 frames. ], batch size: 102, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:32:22,239 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=545409.3333333334, ans=0.5 2024-06-22 06:32:33,674 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.345e+02 2.523e+02 2.737e+02 3.715e+02, threshold=5.046e+02, percent-clipped=0.0 2024-06-22 06:32:34,447 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=545446.0, ans=0.125 2024-06-22 06:32:41,831 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=545464.3333333334, ans=0.125 2024-06-22 06:32:46,720 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2024-06-22 06:32:47,180 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:32:47,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=545482.6666666666, ans=0.0 2024-06-22 06:32:54,361 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=545501.0, ans=0.0 2024-06-22 06:32:54,926 INFO [train.py:1028] (0/2) Epoch 30, batch 4150, loss[loss=0.17, simple_loss=0.2282, pruned_loss=0.05587, over 13062.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2251, pruned_loss=0.0553, over 2574626.32 frames. ], batch size: 55, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:33:12,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=545537.6666666666, ans=0.0 2024-06-22 06:33:15,461 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=545556.0, ans=0.2 2024-06-22 06:33:23,860 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545574.3333333334, ans=0.1 2024-06-22 06:33:31,191 INFO [train.py:1028] (0/2) Epoch 30, batch 4200, loss[loss=0.1846, simple_loss=0.2368, pruned_loss=0.06623, over 13212.00 frames. ], tot_loss[loss=0.1676, simple_loss=0.2247, pruned_loss=0.05526, over 2577266.01 frames. ], batch size: 103, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:33:35,592 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=545592.6666666666, ans=0.2 2024-06-22 06:33:38,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=545611.0, ans=0.1 2024-06-22 06:33:40,536 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=545611.0, ans=0.2 2024-06-22 06:33:43,047 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.285e+02 2.390e+02 2.552e+02 3.545e+02, threshold=4.781e+02, percent-clipped=0.0 2024-06-22 06:33:47,703 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545629.3333333334, ans=0.1 2024-06-22 06:33:55,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=545647.6666666666, ans=0.125 2024-06-22 06:34:04,132 INFO [train.py:1028] (0/2) Epoch 30, batch 4250, loss[loss=0.1685, simple_loss=0.2333, pruned_loss=0.05183, over 13317.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.2246, pruned_loss=0.05508, over 2579309.59 frames. ], batch size: 46, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:34:08,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=545684.3333333334, ans=0.1 2024-06-22 06:34:10,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=545702.6666666666, ans=0.0 2024-06-22 06:34:29,983 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545739.3333333334, ans=0.125 2024-06-22 06:34:33,510 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2024-06-22 06:34:34,803 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.30 vs. limit=22.5 2024-06-22 06:34:37,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=545757.6666666666, ans=0.125 2024-06-22 06:34:39,802 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=545776.0, ans=0.0 2024-06-22 06:34:40,427 INFO [train.py:1028] (0/2) Epoch 30, batch 4300, loss[loss=0.1717, simple_loss=0.2335, pruned_loss=0.055, over 13224.00 frames. ], tot_loss[loss=0.1676, simple_loss=0.2247, pruned_loss=0.05521, over 2581277.28 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:34:41,856 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=545776.0, ans=0.2 2024-06-22 06:34:48,358 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545794.3333333334, ans=0.1 2024-06-22 06:34:52,111 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.278e+02 2.453e+02 2.629e+02 3.559e+02, threshold=4.906e+02, percent-clipped=0.0 2024-06-22 06:34:56,069 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=545812.6666666666, ans=0.125 2024-06-22 06:34:56,206 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=545812.6666666666, ans=0.125 2024-06-22 06:35:00,228 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545831.0, ans=0.1 2024-06-22 06:35:08,054 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=545849.3333333334, ans=0.035 2024-06-22 06:35:08,501 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.82 vs. limit=10.0 2024-06-22 06:35:13,427 INFO [train.py:1028] (0/2) Epoch 30, batch 4350, loss[loss=0.1597, simple_loss=0.2193, pruned_loss=0.05002, over 13254.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2239, pruned_loss=0.05477, over 2585903.97 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:35:36,959 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2024-06-22 06:35:40,415 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=545922.6666666666, ans=0.125 2024-06-22 06:35:41,122 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=545922.6666666666, ans=0.125 2024-06-22 06:35:42,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=545941.0, ans=0.125 2024-06-22 06:35:48,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=545941.0, ans=0.0 2024-06-22 06:35:49,561 INFO [train.py:1028] (0/2) Epoch 30, batch 4400, loss[loss=0.1765, simple_loss=0.229, pruned_loss=0.06198, over 13234.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2236, pruned_loss=0.05486, over 2586351.37 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:35:52,529 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2024-06-22 06:36:01,572 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.305e+02 2.446e+02 2.686e+02 4.055e+02, threshold=4.893e+02, percent-clipped=0.0 2024-06-22 06:36:12,446 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=546014.3333333334, ans=0.125 2024-06-22 06:36:13,788 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=546014.3333333334, ans=0.0 2024-06-22 06:36:26,190 INFO [train.py:1028] (0/2) Epoch 30, batch 4450, loss[loss=0.1462, simple_loss=0.2105, pruned_loss=0.04092, over 12892.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.224, pruned_loss=0.0553, over 2581393.08 frames. ], batch size: 33, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:36:34,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=546069.3333333334, ans=0.0 2024-06-22 06:36:39,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.const_attention_rate, batch_count=546087.6666666666, ans=0.025 2024-06-22 06:36:43,350 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546087.6666666666, ans=0.1 2024-06-22 06:36:54,468 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=546124.3333333334, ans=0.125 2024-06-22 06:36:58,216 INFO [train.py:1028] (0/2) Epoch 30, batch 4500, loss[loss=0.1511, simple_loss=0.2084, pruned_loss=0.04688, over 13234.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2236, pruned_loss=0.05493, over 2585823.39 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:37:04,209 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=546161.0, ans=0.1 2024-06-22 06:37:06,666 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=546161.0, ans=0.125 2024-06-22 06:37:07,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=546161.0, ans=0.0 2024-06-22 06:37:09,866 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.301e+02 2.474e+02 2.689e+02 3.444e+02, threshold=4.949e+02, percent-clipped=0.0 2024-06-22 06:37:13,571 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=546179.3333333334, ans=12.0 2024-06-22 06:37:14,622 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=546179.3333333334, ans=0.07 2024-06-22 06:37:19,318 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=22.5 2024-06-22 06:37:30,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=546216.0, ans=0.2 2024-06-22 06:37:33,867 INFO [train.py:1028] (0/2) Epoch 30, batch 4550, loss[loss=0.1591, simple_loss=0.2158, pruned_loss=0.05118, over 13228.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2236, pruned_loss=0.05485, over 2589781.80 frames. ], batch size: 52, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:37:52,711 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=546271.0, ans=0.2 2024-06-22 06:38:01,310 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=546307.6666666666, ans=0.025 2024-06-22 06:38:07,115 INFO [train.py:1028] (0/2) Epoch 30, batch 4600, loss[loss=0.1816, simple_loss=0.2338, pruned_loss=0.06467, over 12513.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2235, pruned_loss=0.05481, over 2585847.05 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:38:08,746 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=546326.0, ans=0.05 2024-06-22 06:38:22,134 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=546344.3333333334, ans=0.0 2024-06-22 06:38:22,188 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=546344.3333333334, ans=0.025 2024-06-22 06:38:22,293 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=12.0 2024-06-22 06:38:24,700 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.296e+02 2.398e+02 2.618e+02 3.492e+02, threshold=4.795e+02, percent-clipped=0.0 2024-06-22 06:38:26,314 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=546362.6666666666, ans=0.125 2024-06-22 06:38:29,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=546362.6666666666, ans=0.125 2024-06-22 06:38:34,370 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.79 vs. limit=10.0 2024-06-22 06:38:45,467 INFO [train.py:1028] (0/2) Epoch 30, batch 4650, loss[loss=0.1771, simple_loss=0.2252, pruned_loss=0.0645, over 13076.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2231, pruned_loss=0.05491, over 2588668.11 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:39:00,445 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=546454.3333333334, ans=0.0 2024-06-22 06:39:04,471 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2024-06-22 06:39:08,166 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=546472.6666666666, ans=0.025 2024-06-22 06:39:16,128 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=546491.0, ans=0.125 2024-06-22 06:39:18,072 INFO [train.py:1028] (0/2) Epoch 30, batch 4700, loss[loss=0.1608, simple_loss=0.2239, pruned_loss=0.04885, over 12974.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.223, pruned_loss=0.05489, over 2584403.15 frames. ], batch size: 26, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:39:20,395 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=546509.3333333334, ans=0.0 2024-06-22 06:39:26,597 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=22.5 2024-06-22 06:39:33,258 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.253e+02 2.363e+02 2.510e+02 3.512e+02, threshold=4.725e+02, percent-clipped=0.0 2024-06-22 06:39:36,598 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=546546.0, ans=0.0 2024-06-22 06:39:47,797 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=546582.6666666666, ans=0.2 2024-06-22 06:39:51,583 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546582.6666666666, ans=0.1 2024-06-22 06:39:54,228 INFO [train.py:1028] (0/2) Epoch 30, batch 4750, loss[loss=0.1808, simple_loss=0.2285, pruned_loss=0.06656, over 12591.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2227, pruned_loss=0.05504, over 2581395.81 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:39:55,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=546601.0, ans=0.125 2024-06-22 06:39:55,841 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=546601.0, ans=0.2 2024-06-22 06:39:58,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=546601.0, ans=0.125 2024-06-22 06:39:59,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=546601.0, ans=0.125 2024-06-22 06:40:08,838 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=546637.6666666666, ans=0.09899494936611666 2024-06-22 06:40:10,812 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=546637.6666666666, ans=0.1 2024-06-22 06:40:14,493 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2024-06-22 06:40:30,614 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=546674.3333333334, ans=0.025 2024-06-22 06:40:31,840 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=546692.6666666666, ans=0.125 2024-06-22 06:40:32,369 INFO [train.py:1028] (0/2) Epoch 30, batch 4800, loss[loss=0.1629, simple_loss=0.2233, pruned_loss=0.05131, over 13313.00 frames. ], tot_loss[loss=0.1663, simple_loss=0.2227, pruned_loss=0.05501, over 2577641.12 frames. ], batch size: 63, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:40:37,606 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=15.0 2024-06-22 06:40:42,588 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=546711.0, ans=0.0 2024-06-22 06:40:44,295 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.304e+02 2.431e+02 2.591e+02 3.257e+02, threshold=4.862e+02, percent-clipped=0.0 2024-06-22 06:40:51,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=546747.6666666666, ans=0.2 2024-06-22 06:40:58,655 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=546766.0, ans=0.025 2024-06-22 06:41:04,977 INFO [train.py:1028] (0/2) Epoch 30, batch 4850, loss[loss=0.1606, simple_loss=0.2208, pruned_loss=0.05021, over 13295.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2234, pruned_loss=0.05516, over 2574715.63 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:41:24,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=546839.3333333334, ans=0.125 2024-06-22 06:41:42,408 INFO [train.py:1028] (0/2) Epoch 30, batch 4900, loss[loss=0.1679, simple_loss=0.2254, pruned_loss=0.05522, over 13197.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2232, pruned_loss=0.05511, over 2575697.56 frames. ], batch size: 59, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:41:45,289 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=546876.0, ans=0.2 2024-06-22 06:41:46,742 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.06 vs. limit=15.0 2024-06-22 06:41:49,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=546894.3333333334, ans=0.125 2024-06-22 06:41:54,238 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.252e+02 2.436e+02 2.671e+02 3.763e+02, threshold=4.871e+02, percent-clipped=0.0 2024-06-22 06:41:55,740 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546912.6666666666, ans=0.1 2024-06-22 06:42:11,417 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=546949.3333333334, ans=0.1 2024-06-22 06:42:14,825 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=546967.6666666666, ans=0.0 2024-06-22 06:42:15,174 INFO [train.py:1028] (0/2) Epoch 30, batch 4950, loss[loss=0.1812, simple_loss=0.2254, pruned_loss=0.06848, over 10963.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2232, pruned_loss=0.05542, over 2570377.47 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:42:30,799 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.81 vs. limit=15.0 2024-06-22 06:42:35,453 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=547004.3333333334, ans=0.05 2024-06-22 06:42:42,819 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=547022.6666666666, ans=0.125 2024-06-22 06:42:50,436 INFO [train.py:1028] (0/2) Epoch 30, batch 5000, loss[loss=0.1655, simple_loss=0.2179, pruned_loss=0.05653, over 13101.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2232, pruned_loss=0.05509, over 2573869.06 frames. ], batch size: 95, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:42:52,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=547059.3333333334, ans=0.125 2024-06-22 06:42:59,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547077.6666666666, ans=0.1 2024-06-22 06:43:03,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.363e+02 2.524e+02 2.722e+02 3.510e+02, threshold=5.049e+02, percent-clipped=0.0 2024-06-22 06:43:13,679 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2024-06-22 06:43:14,118 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=547114.3333333334, ans=0.025 2024-06-22 06:43:15,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=547114.3333333334, ans=0.0 2024-06-22 06:43:27,447 INFO [train.py:1028] (0/2) Epoch 30, batch 5050, loss[loss=0.163, simple_loss=0.2229, pruned_loss=0.0516, over 13000.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2231, pruned_loss=0.05468, over 2573322.24 frames. ], batch size: 36, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:43:29,089 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=547151.0, ans=0.0 2024-06-22 06:43:30,377 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547151.0, ans=0.1 2024-06-22 06:43:34,391 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=15.0 2024-06-22 06:43:36,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=547169.3333333334, ans=0.025 2024-06-22 06:43:42,449 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=547187.6666666666, ans=0.04949747468305833 2024-06-22 06:43:46,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=547206.0, ans=0.5 2024-06-22 06:43:53,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=547224.3333333334, ans=0.025 2024-06-22 06:43:53,918 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=547224.3333333334, ans=0.125 2024-06-22 06:43:58,859 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=547224.3333333334, ans=0.2 2024-06-22 06:44:00,732 INFO [train.py:1028] (0/2) Epoch 30, batch 5100, loss[loss=0.1578, simple_loss=0.2266, pruned_loss=0.04451, over 12813.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2235, pruned_loss=0.05502, over 2569486.49 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:44:12,643 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.365e+02 2.515e+02 2.693e+02 3.527e+02, threshold=5.030e+02, percent-clipped=0.0 2024-06-22 06:44:34,101 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=547316.0, ans=0.015 2024-06-22 06:44:34,274 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=547316.0, ans=0.125 2024-06-22 06:44:36,563 INFO [train.py:1028] (0/2) Epoch 30, batch 5150, loss[loss=0.1581, simple_loss=0.2106, pruned_loss=0.05282, over 13142.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2231, pruned_loss=0.05524, over 2571405.89 frames. ], batch size: 132, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:44:40,066 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=547334.3333333334, ans=0.2 2024-06-22 06:45:00,538 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=547389.3333333334, ans=0.125 2024-06-22 06:45:04,982 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.42 vs. limit=10.0 2024-06-22 06:45:08,940 INFO [train.py:1028] (0/2) Epoch 30, batch 5200, loss[loss=0.1727, simple_loss=0.2183, pruned_loss=0.06354, over 13165.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2228, pruned_loss=0.05494, over 2574342.40 frames. ], batch size: 95, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:45:13,287 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=547426.0, ans=0.125 2024-06-22 06:45:24,035 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.292e+02 2.447e+02 2.570e+02 3.700e+02, threshold=4.894e+02, percent-clipped=0.0 2024-06-22 06:45:25,847 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=15.0 2024-06-22 06:45:45,950 INFO [train.py:1028] (0/2) Epoch 30, batch 5250, loss[loss=0.1708, simple_loss=0.2274, pruned_loss=0.05707, over 13270.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2234, pruned_loss=0.05498, over 2571203.62 frames. ], batch size: 52, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:45:49,292 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=547517.6666666666, ans=0.125 2024-06-22 06:45:58,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=547536.0, ans=0.0 2024-06-22 06:46:00,603 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=15.0 2024-06-22 06:46:23,121 INFO [train.py:1028] (0/2) Epoch 30, batch 5300, loss[loss=0.1653, simple_loss=0.2134, pruned_loss=0.0586, over 12994.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2231, pruned_loss=0.05464, over 2568454.00 frames. ], batch size: 144, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:46:25,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=547609.3333333334, ans=0.0 2024-06-22 06:46:27,869 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=547609.3333333334, ans=0.0 2024-06-22 06:46:33,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=547627.6666666666, ans=0.0 2024-06-22 06:46:34,930 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.257e+02 2.353e+02 2.499e+02 2.882e+02, threshold=4.707e+02, percent-clipped=0.0 2024-06-22 06:46:37,668 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=547646.0, ans=0.125 2024-06-22 06:46:39,062 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=547646.0, ans=0.0 2024-06-22 06:46:42,410 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=547664.3333333334, ans=0.1 2024-06-22 06:46:44,365 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=547664.3333333334, ans=0.1 2024-06-22 06:46:45,774 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=547664.3333333334, ans=0.0 2024-06-22 06:46:53,556 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2024-06-22 06:46:56,702 INFO [train.py:1028] (0/2) Epoch 30, batch 5350, loss[loss=0.1807, simple_loss=0.2443, pruned_loss=0.05859, over 11976.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2226, pruned_loss=0.05476, over 2574602.05 frames. ], batch size: 17, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:46:58,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=547701.0, ans=10.0 2024-06-22 06:46:59,678 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=547701.0, ans=0.125 2024-06-22 06:47:05,736 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:47:09,672 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=547737.6666666666, ans=0.125 2024-06-22 06:47:10,991 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.53 vs. limit=15.0 2024-06-22 06:47:32,900 INFO [train.py:1028] (0/2) Epoch 30, batch 5400, loss[loss=0.1679, simple_loss=0.2175, pruned_loss=0.05917, over 12300.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2231, pruned_loss=0.0555, over 2567113.84 frames. ], batch size: 241, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:47:33,107 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:47:45,054 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.328e+02 2.500e+02 2.720e+02 3.221e+02, threshold=5.001e+02, percent-clipped=0.0 2024-06-22 06:47:59,344 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=547866.0, ans=0.0 2024-06-22 06:48:01,842 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=547866.0, ans=0.125 2024-06-22 06:48:05,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=547866.0, ans=0.125 2024-06-22 06:48:06,472 INFO [train.py:1028] (0/2) Epoch 30, batch 5450, loss[loss=0.1791, simple_loss=0.2347, pruned_loss=0.0617, over 12210.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.223, pruned_loss=0.05502, over 2571184.12 frames. ], batch size: 25, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:48:21,690 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=547902.6666666666, ans=0.125 2024-06-22 06:48:38,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=547957.6666666666, ans=0.0 2024-06-22 06:48:43,312 INFO [train.py:1028] (0/2) Epoch 30, batch 5500, loss[loss=0.1969, simple_loss=0.2427, pruned_loss=0.07556, over 12175.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2225, pruned_loss=0.05492, over 2565210.96 frames. ], batch size: 240, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:48:49,354 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=547994.3333333334, ans=10.0 2024-06-22 06:48:50,035 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=547994.3333333334, ans=0.125 2024-06-22 06:48:54,812 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.242e+02 2.394e+02 2.576e+02 3.164e+02, threshold=4.787e+02, percent-clipped=0.0 2024-06-22 06:48:55,331 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.98 vs. limit=15.0 2024-06-22 06:48:57,837 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=548012.6666666666, ans=0.09899494936611666 2024-06-22 06:49:03,446 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.99 vs. limit=6.0 2024-06-22 06:49:20,235 INFO [train.py:1028] (0/2) Epoch 30, batch 5550, loss[loss=0.1805, simple_loss=0.2308, pruned_loss=0.06509, over 13233.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2224, pruned_loss=0.05485, over 2568716.56 frames. ], batch size: 43, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:49:33,433 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=548104.3333333334, ans=0.125 2024-06-22 06:49:42,381 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=548122.6666666666, ans=0.125 2024-06-22 06:49:43,143 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=548122.6666666666, ans=0.0 2024-06-22 06:49:44,868 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:49:52,498 INFO [train.py:1028] (0/2) Epoch 30, batch 5600, loss[loss=0.1528, simple_loss=0.2077, pruned_loss=0.04896, over 13262.00 frames. ], tot_loss[loss=0.1658, simple_loss=0.2219, pruned_loss=0.0548, over 2570476.69 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:50:00,606 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=548177.6666666666, ans=0.125 2024-06-22 06:50:05,079 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.245e+02 2.382e+02 2.569e+02 3.098e+02, threshold=4.763e+02, percent-clipped=0.0 2024-06-22 06:50:07,716 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2024-06-22 06:50:08,376 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=12.0 2024-06-22 06:50:08,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=548196.0, ans=0.09899494936611666 2024-06-22 06:50:24,629 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0 2024-06-22 06:50:29,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=548232.6666666666, ans=0.0 2024-06-22 06:50:31,022 INFO [train.py:1028] (0/2) Epoch 30, batch 5650, loss[loss=0.1698, simple_loss=0.2209, pruned_loss=0.05937, over 12493.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2224, pruned_loss=0.05479, over 2575196.48 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:50:33,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=548251.0, ans=10.0 2024-06-22 06:50:34,154 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.62 vs. limit=12.0 2024-06-22 06:50:44,624 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=548287.6666666666, ans=0.07 2024-06-22 06:50:47,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=548287.6666666666, ans=0.125 2024-06-22 06:50:49,898 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=548287.6666666666, ans=0.1 2024-06-22 06:51:01,789 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=548324.3333333334, ans=0.0 2024-06-22 06:51:04,497 INFO [train.py:1028] (0/2) Epoch 30, batch 5700, loss[loss=0.1561, simple_loss=0.2182, pruned_loss=0.04698, over 13245.00 frames. ], tot_loss[loss=0.1663, simple_loss=0.2226, pruned_loss=0.05497, over 2579473.42 frames. ], batch size: 63, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:51:11,062 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2024-06-22 06:51:11,786 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2024-06-22 06:51:16,335 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.294e+02 2.443e+02 2.624e+02 3.434e+02, threshold=4.885e+02, percent-clipped=0.0 2024-06-22 06:51:22,737 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2024-06-22 06:51:24,562 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=548379.3333333334, ans=0.0 2024-06-22 06:51:29,811 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=22.5 2024-06-22 06:51:35,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.38 vs. limit=10.0 2024-06-22 06:51:37,400 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=548416.0, ans=0.125 2024-06-22 06:51:40,532 INFO [train.py:1028] (0/2) Epoch 30, batch 5750, loss[loss=0.1794, simple_loss=0.2292, pruned_loss=0.06475, over 12674.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2232, pruned_loss=0.05518, over 2579002.70 frames. ], batch size: 176, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:51:59,523 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=548489.3333333334, ans=0.0 2024-06-22 06:51:59,783 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.69 vs. limit=6.0 2024-06-22 06:52:16,347 INFO [train.py:1028] (0/2) Epoch 30, batch 5800, loss[loss=0.1655, simple_loss=0.219, pruned_loss=0.05596, over 12777.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.2243, pruned_loss=0.05577, over 2578662.41 frames. ], batch size: 176, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:52:27,989 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.265e+02 2.415e+02 2.528e+02 3.231e+02, threshold=4.829e+02, percent-clipped=0.0 2024-06-22 06:52:28,896 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=548562.6666666666, ans=0.125 2024-06-22 06:52:30,249 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=548562.6666666666, ans=0.0 2024-06-22 06:52:40,909 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.51 vs. limit=22.5 2024-06-22 06:52:41,447 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=15.0 2024-06-22 06:52:48,816 INFO [train.py:1028] (0/2) Epoch 30, batch 5850, loss[loss=0.1832, simple_loss=0.2371, pruned_loss=0.06463, over 12565.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2262, pruned_loss=0.05646, over 2577520.61 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:52:49,030 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=548617.6666666666, ans=0.125 2024-06-22 06:52:53,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.04 vs. limit=15.0 2024-06-22 06:53:05,148 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 06:53:12,703 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.45 vs. limit=15.0 2024-06-22 06:53:23,110 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548691.0, ans=0.1 2024-06-22 06:53:24,842 INFO [train.py:1028] (0/2) Epoch 30, batch 5900, loss[loss=0.1626, simple_loss=0.2157, pruned_loss=0.05478, over 13112.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2275, pruned_loss=0.05667, over 2578261.56 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:53:30,279 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=548709.3333333334, ans=0.0 2024-06-22 06:53:35,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=548727.6666666666, ans=0.0 2024-06-22 06:53:36,828 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.319e+02 2.458e+02 2.746e+02 3.634e+02, threshold=4.916e+02, percent-clipped=0.0 2024-06-22 06:53:38,229 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=548746.0, ans=0.0 2024-06-22 06:53:38,387 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=548746.0, ans=0.0 2024-06-22 06:53:46,808 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=548764.3333333334, ans=0.125 2024-06-22 06:53:46,975 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.65 vs. limit=15.0 2024-06-22 06:53:58,331 INFO [train.py:1028] (0/2) Epoch 30, batch 5950, loss[loss=0.1607, simple_loss=0.2165, pruned_loss=0.0525, over 13172.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2287, pruned_loss=0.05726, over 2582294.85 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:54:28,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=548874.3333333334, ans=0.025 2024-06-22 06:54:34,782 INFO [train.py:1028] (0/2) Epoch 30, batch 6000, loss[loss=0.2149, simple_loss=0.2639, pruned_loss=0.08299, over 12340.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2297, pruned_loss=0.05767, over 2574381.70 frames. ], batch size: 241, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:54:34,782 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 06:54:42,444 INFO [train.py:1060] (0/2) Epoch 30, validation: loss=0.1955, simple_loss=0.253, pruned_loss=0.06898, over 351949.00 frames. 2024-06-22 06:54:42,445 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18096MB 2024-06-22 06:54:54,434 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.382e+02 2.554e+02 2.764e+02 3.575e+02, threshold=5.107e+02, percent-clipped=0.0 2024-06-22 06:55:00,743 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.86 vs. limit=15.0 2024-06-22 06:55:03,777 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.80 vs. limit=15.0 2024-06-22 06:55:16,037 INFO [train.py:1028] (0/2) Epoch 30, batch 6050, loss[loss=0.1776, simple_loss=0.2355, pruned_loss=0.05983, over 12909.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2318, pruned_loss=0.05846, over 2577143.99 frames. ], batch size: 39, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:55:43,377 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2024-06-22 06:55:52,233 INFO [train.py:1028] (0/2) Epoch 30, batch 6100, loss[loss=0.1671, simple_loss=0.2175, pruned_loss=0.05832, over 13117.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2331, pruned_loss=0.05897, over 2578859.78 frames. ], batch size: 121, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:55:54,413 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=549076.0, ans=0.125 2024-06-22 06:55:58,052 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.29 vs. limit=15.0 2024-06-22 06:56:04,331 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.373e+02 2.551e+02 2.819e+02 3.758e+02, threshold=5.101e+02, percent-clipped=0.0 2024-06-22 06:56:05,197 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=549112.6666666666, ans=0.0 2024-06-22 06:56:10,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549112.6666666666, ans=0.125 2024-06-22 06:56:11,224 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.43 vs. limit=10.0 2024-06-22 06:56:12,759 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=15.0 2024-06-22 06:56:14,060 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2024-06-22 06:56:26,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.const_attention_rate, batch_count=549149.3333333334, ans=0.025 2024-06-22 06:56:28,780 INFO [train.py:1028] (0/2) Epoch 30, batch 6150, loss[loss=0.1947, simple_loss=0.2475, pruned_loss=0.07097, over 10824.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2338, pruned_loss=0.05917, over 2578719.46 frames. ], batch size: 304, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:56:32,930 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=549167.6666666666, ans=0.0 2024-06-22 06:56:47,499 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=22.5 2024-06-22 06:56:51,145 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2024-06-22 06:57:02,220 INFO [train.py:1028] (0/2) Epoch 30, batch 6200, loss[loss=0.2017, simple_loss=0.2698, pruned_loss=0.06681, over 13245.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2358, pruned_loss=0.05985, over 2576151.58 frames. ], batch size: 89, lr: 1.92e-03, grad_scale: 64.0 2024-06-22 06:57:02,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=549259.3333333334, ans=0.125 2024-06-22 06:57:04,503 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=549259.3333333334, ans=0.125 2024-06-22 06:57:04,718 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=549259.3333333334, ans=12.0 2024-06-22 06:57:05,977 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.04 vs. limit=15.0 2024-06-22 06:57:11,059 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=549277.6666666666, ans=0.0 2024-06-22 06:57:13,012 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=549277.6666666666, ans=0.07 2024-06-22 06:57:18,109 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.467e+02 2.622e+02 2.815e+02 3.651e+02, threshold=5.245e+02, percent-clipped=0.0 2024-06-22 06:57:19,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=549296.0, ans=0.125 2024-06-22 06:57:26,435 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=549314.3333333334, ans=0.0 2024-06-22 06:57:32,736 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=549332.6666666666, ans=0.125 2024-06-22 06:57:33,664 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=549332.6666666666, ans=10.0 2024-06-22 06:57:37,528 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.76 vs. limit=12.0 2024-06-22 06:57:39,115 INFO [train.py:1028] (0/2) Epoch 30, batch 6250, loss[loss=0.1719, simple_loss=0.229, pruned_loss=0.05741, over 13186.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2369, pruned_loss=0.06015, over 2568756.17 frames. ], batch size: 83, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:57:52,448 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.15 vs. limit=15.0 2024-06-22 06:58:00,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=549406.0, ans=0.0 2024-06-22 06:58:00,691 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=549406.0, ans=0.2 2024-06-22 06:58:02,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.20 vs. limit=15.0 2024-06-22 06:58:03,375 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=549406.0, ans=0.1 2024-06-22 06:58:07,721 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=15.0 2024-06-22 06:58:12,474 INFO [train.py:1028] (0/2) Epoch 30, batch 6300, loss[loss=0.1942, simple_loss=0.2592, pruned_loss=0.06458, over 10998.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2386, pruned_loss=0.06074, over 2563112.44 frames. ], batch size: 16, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:58:13,340 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=549442.6666666666, ans=0.125 2024-06-22 06:58:14,701 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=549442.6666666666, ans=0.2 2024-06-22 06:58:28,998 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.420e+02 2.576e+02 2.847e+02 4.433e+02, threshold=5.151e+02, percent-clipped=0.0 2024-06-22 06:58:38,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=549497.6666666666, ans=0.0 2024-06-22 06:58:49,385 INFO [train.py:1028] (0/2) Epoch 30, batch 6350, loss[loss=0.1921, simple_loss=0.2534, pruned_loss=0.06543, over 12539.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2402, pruned_loss=0.06091, over 2572411.80 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:59:25,825 INFO [train.py:1028] (0/2) Epoch 30, batch 6400, loss[loss=0.1572, simple_loss=0.2212, pruned_loss=0.04665, over 13232.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2417, pruned_loss=0.06124, over 2574199.64 frames. ], batch size: 67, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 06:59:31,480 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=13.47 vs. limit=15.0 2024-06-22 06:59:38,076 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.430e+02 2.566e+02 2.845e+02 4.157e+02, threshold=5.132e+02, percent-clipped=0.0 2024-06-22 06:59:40,298 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.37 vs. limit=15.0 2024-06-22 06:59:55,456 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=549699.3333333334, ans=0.2 2024-06-22 06:59:57,764 INFO [train.py:1028] (0/2) Epoch 30, batch 6450, loss[loss=0.2471, simple_loss=0.2941, pruned_loss=0.1001, over 12465.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2429, pruned_loss=0.06184, over 2579770.23 frames. ], batch size: 202, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:00:02,677 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2024-06-22 07:00:09,319 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=549736.0, ans=15.0 2024-06-22 07:00:11,387 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.54 vs. limit=22.5 2024-06-22 07:00:15,148 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=549754.3333333334, ans=0.125 2024-06-22 07:00:19,120 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=549772.6666666666, ans=0.0 2024-06-22 07:00:24,641 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=549772.6666666666, ans=0.125 2024-06-22 07:00:27,285 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=549791.0, ans=0.0 2024-06-22 07:00:34,363 INFO [train.py:1028] (0/2) Epoch 30, batch 6500, loss[loss=0.1743, simple_loss=0.2316, pruned_loss=0.05847, over 10796.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2443, pruned_loss=0.06213, over 2583212.62 frames. ], batch size: 303, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:00:36,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=549809.3333333334, ans=0.025 2024-06-22 07:00:37,691 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2024-06-22 07:00:47,370 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.503e+02 2.680e+02 3.120e+02 3.917e+02, threshold=5.361e+02, percent-clipped=0.0 2024-06-22 07:00:48,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=549846.0, ans=0.2 2024-06-22 07:00:48,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=549846.0, ans=0.2 2024-06-22 07:00:51,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=549846.0, ans=0.5 2024-06-22 07:01:00,536 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.72 vs. limit=22.5 2024-06-22 07:01:02,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=549882.6666666666, ans=0.95 2024-06-22 07:01:02,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=549882.6666666666, ans=0.0 2024-06-22 07:01:05,620 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2024-06-22 07:01:08,049 INFO [train.py:1028] (0/2) Epoch 30, batch 6550, loss[loss=0.2001, simple_loss=0.2681, pruned_loss=0.06604, over 12747.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2452, pruned_loss=0.06231, over 2587605.06 frames. ], batch size: 22, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:01:17,491 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=549919.3333333334, ans=0.0 2024-06-22 07:01:27,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=549956.0, ans=0.0 2024-06-22 07:01:40,234 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.96 vs. limit=12.0 2024-06-22 07:01:41,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549974.3333333334, ans=0.125 2024-06-22 07:01:41,981 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=549974.3333333334, ans=0.2 2024-06-22 07:01:42,851 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2024-06-22 07:01:43,081 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=549974.3333333334, ans=0.125 2024-06-22 07:01:44,223 INFO [train.py:1028] (0/2) Epoch 30, batch 6600, loss[loss=0.1742, simple_loss=0.2441, pruned_loss=0.0521, over 13232.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2453, pruned_loss=0.06235, over 2590218.76 frames. ], batch size: 72, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:01:45,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=549992.6666666666, ans=0.2 2024-06-22 07:01:45,813 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=549992.6666666666, ans=0.2 2024-06-22 07:01:46,448 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/checkpoint-300000.pt 2024-06-22 07:02:02,088 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.465e+02 2.619e+02 2.836e+02 4.476e+02, threshold=5.239e+02, percent-clipped=0.0 2024-06-22 07:02:03,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=550029.3333333334, ans=0.125 2024-06-22 07:02:04,881 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=550029.3333333334, ans=0.2 2024-06-22 07:02:09,271 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=550047.6666666666, ans=0.125 2024-06-22 07:02:20,513 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=550066.0, ans=0.125 2024-06-22 07:02:22,914 INFO [train.py:1028] (0/2) Epoch 30, batch 6650, loss[loss=0.2051, simple_loss=0.2592, pruned_loss=0.07555, over 12949.00 frames. ], tot_loss[loss=0.186, simple_loss=0.2466, pruned_loss=0.06269, over 2582640.49 frames. ], batch size: 158, lr: 1.92e-03, grad_scale: 32.0 2024-06-22 07:02:23,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=550084.3333333334, ans=0.125 2024-06-22 07:02:23,807 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=550084.3333333334, ans=0.0 2024-06-22 07:02:26,572 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=550084.3333333334, ans=0.2 2024-06-22 07:02:30,769 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.15 vs. limit=22.5 2024-06-22 07:02:33,826 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=550102.6666666666, ans=0.125 2024-06-22 07:02:46,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550139.3333333334, ans=0.1 2024-06-22 07:02:58,291 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.84 vs. limit=15.0 2024-06-22 07:03:00,990 INFO [train.py:1028] (0/2) Epoch 30, batch 6700, loss[loss=0.1916, simple_loss=0.2442, pruned_loss=0.06956, over 12755.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2476, pruned_loss=0.06306, over 2582445.82 frames. ], batch size: 176, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:03:01,858 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=550176.0, ans=0.0 2024-06-22 07:03:13,900 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.466e+02 2.723e+02 3.020e+02 4.955e+02, threshold=5.446e+02, percent-clipped=0.0 2024-06-22 07:03:17,110 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.87 vs. limit=15.0 2024-06-22 07:03:22,373 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=550231.0, ans=0.2 2024-06-22 07:03:22,394 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=550231.0, ans=0.1 2024-06-22 07:03:25,216 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.const_attention_rate, batch_count=550231.0, ans=0.025 2024-06-22 07:03:33,113 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.38 vs. limit=6.0 2024-06-22 07:03:40,729 INFO [train.py:1028] (0/2) Epoch 30, batch 6750, loss[loss=0.2336, simple_loss=0.2853, pruned_loss=0.09089, over 12220.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2477, pruned_loss=0.0632, over 2576322.09 frames. ], batch size: 241, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:03:47,584 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.43 vs. limit=6.0 2024-06-22 07:03:47,994 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=550286.0, ans=0.0 2024-06-22 07:03:55,804 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.88 vs. limit=15.0 2024-06-22 07:03:58,980 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=550304.3333333334, ans=0.0 2024-06-22 07:04:07,995 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=550341.0, ans=0.035 2024-06-22 07:04:08,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=550341.0, ans=0.125 2024-06-22 07:04:13,865 INFO [train.py:1028] (0/2) Epoch 30, batch 6800, loss[loss=0.1871, simple_loss=0.2538, pruned_loss=0.06018, over 13187.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2489, pruned_loss=0.06331, over 2579106.42 frames. ], batch size: 67, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:04:16,719 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=550359.3333333334, ans=0.125 2024-06-22 07:04:17,332 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=550359.3333333334, ans=0.125 2024-06-22 07:04:19,124 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=550359.3333333334, ans=0.125 2024-06-22 07:04:26,182 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.471e+02 2.558e+02 2.700e+02 3.304e+02, threshold=5.116e+02, percent-clipped=0.0 2024-06-22 07:04:28,424 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=550396.0, ans=0.125 2024-06-22 07:04:38,281 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=550414.3333333334, ans=0.125 2024-06-22 07:04:50,246 INFO [train.py:1028] (0/2) Epoch 30, batch 6850, loss[loss=0.1835, simple_loss=0.254, pruned_loss=0.05656, over 13266.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2494, pruned_loss=0.06308, over 2582869.20 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:04:59,578 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=550469.3333333334, ans=0.0 2024-06-22 07:05:05,014 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=550487.6666666666, ans=0.125 2024-06-22 07:05:13,851 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=550506.0, ans=0.125 2024-06-22 07:05:23,747 INFO [train.py:1028] (0/2) Epoch 30, batch 6900, loss[loss=0.1914, simple_loss=0.2567, pruned_loss=0.06301, over 13252.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2499, pruned_loss=0.06315, over 2585363.82 frames. ], batch size: 49, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:05:30,398 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=550561.0, ans=0.125 2024-06-22 07:05:36,092 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.491e+02 2.750e+02 2.896e+02 4.282e+02, threshold=5.500e+02, percent-clipped=0.0 2024-06-22 07:05:38,724 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=550579.3333333334, ans=0.09899494936611666 2024-06-22 07:05:40,979 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2024-06-22 07:05:55,723 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=550616.0, ans=0.05 2024-06-22 07:05:59,615 INFO [train.py:1028] (0/2) Epoch 30, batch 6950, loss[loss=0.1798, simple_loss=0.2321, pruned_loss=0.06373, over 11307.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2505, pruned_loss=0.06328, over 2579931.05 frames. ], batch size: 16, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:06:08,817 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=550652.6666666666, ans=0.2 2024-06-22 07:06:11,957 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2024-06-22 07:06:24,796 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=550689.3333333334, ans=0.0 2024-06-22 07:06:32,354 INFO [train.py:1028] (0/2) Epoch 30, batch 7000, loss[loss=0.2006, simple_loss=0.2555, pruned_loss=0.0728, over 12917.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2503, pruned_loss=0.06313, over 2576175.42 frames. ], batch size: 158, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:06:38,378 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=550744.3333333334, ans=0.125 2024-06-22 07:06:39,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=550744.3333333334, ans=0.0 2024-06-22 07:06:40,474 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=550744.3333333334, ans=0.125 2024-06-22 07:06:40,535 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=550744.3333333334, ans=0.125 2024-06-22 07:06:41,176 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=550744.3333333334, ans=0.125 2024-06-22 07:06:41,300 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=550744.3333333334, ans=0.125 2024-06-22 07:06:45,005 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2024-06-22 07:06:45,297 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.506e+02 2.697e+02 2.954e+02 3.946e+02, threshold=5.393e+02, percent-clipped=0.0 2024-06-22 07:06:52,196 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=550762.6666666666, ans=0.125 2024-06-22 07:07:03,782 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=550799.3333333334, ans=0.125 2024-06-22 07:07:05,045 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=550799.3333333334, ans=0.125 2024-06-22 07:07:09,634 INFO [train.py:1028] (0/2) Epoch 30, batch 7050, loss[loss=0.1991, simple_loss=0.2599, pruned_loss=0.06912, over 12759.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2513, pruned_loss=0.0632, over 2582997.13 frames. ], batch size: 176, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:07:15,949 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=550836.0, ans=0.125 2024-06-22 07:07:16,603 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=550836.0, ans=0.125 2024-06-22 07:07:16,612 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=550836.0, ans=0.2 2024-06-22 07:07:16,644 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=550836.0, ans=0.125 2024-06-22 07:07:20,083 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.76 vs. limit=15.0 2024-06-22 07:07:36,931 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=550891.0, ans=0.125 2024-06-22 07:07:41,184 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2024-06-22 07:07:42,101 INFO [train.py:1028] (0/2) Epoch 30, batch 7100, loss[loss=0.1926, simple_loss=0.2585, pruned_loss=0.06331, over 13205.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2521, pruned_loss=0.06389, over 2575410.84 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:07:50,399 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=550909.3333333334, ans=0.0 2024-06-22 07:07:55,129 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=550927.6666666666, ans=0.2 2024-06-22 07:07:58,186 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.459e+02 2.637e+02 2.810e+02 3.800e+02, threshold=5.273e+02, percent-clipped=0.0 2024-06-22 07:07:59,636 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=550946.0, ans=0.125 2024-06-22 07:07:59,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=550946.0, ans=0.125 2024-06-22 07:08:04,315 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:08:09,752 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=550964.3333333334, ans=0.0 2024-06-22 07:08:18,773 INFO [train.py:1028] (0/2) Epoch 30, batch 7150, loss[loss=0.2176, simple_loss=0.2786, pruned_loss=0.07826, over 12544.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2525, pruned_loss=0.0637, over 2573491.89 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:08:42,913 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=551056.0, ans=0.125 2024-06-22 07:08:50,828 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2024-06-22 07:08:51,565 INFO [train.py:1028] (0/2) Epoch 30, batch 7200, loss[loss=0.2122, simple_loss=0.2748, pruned_loss=0.07477, over 13186.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.254, pruned_loss=0.06409, over 2578556.66 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:08:59,948 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=551092.6666666666, ans=0.125 2024-06-22 07:09:03,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=551111.0, ans=0.125 2024-06-22 07:09:03,334 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=551111.0, ans=0.125 2024-06-22 07:09:07,422 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.551e+02 2.712e+02 2.969e+02 3.972e+02, threshold=5.423e+02, percent-clipped=0.0 2024-06-22 07:09:17,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=551147.6666666666, ans=0.125 2024-06-22 07:09:27,979 INFO [train.py:1028] (0/2) Epoch 30, batch 7250, loss[loss=0.1773, simple_loss=0.2461, pruned_loss=0.05426, over 12911.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2545, pruned_loss=0.06389, over 2579511.26 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:09:38,706 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=551202.6666666666, ans=10.0 2024-06-22 07:09:40,739 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=551221.0, ans=0.125 2024-06-22 07:09:44,924 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.const_attention_rate, batch_count=551221.0, ans=0.025 2024-06-22 07:09:46,582 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.27 vs. limit=22.5 2024-06-22 07:10:03,922 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=551276.0, ans=0.125 2024-06-22 07:10:04,372 INFO [train.py:1028] (0/2) Epoch 30, batch 7300, loss[loss=0.1703, simple_loss=0.2398, pruned_loss=0.05037, over 12865.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2559, pruned_loss=0.06478, over 2579435.95 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:10:12,866 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=551294.3333333334, ans=0.125 2024-06-22 07:10:15,457 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=551294.3333333334, ans=0.0 2024-06-22 07:10:16,535 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.517e+02 2.777e+02 3.038e+02 4.361e+02, threshold=5.554e+02, percent-clipped=0.0 2024-06-22 07:10:18,973 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=551312.6666666666, ans=0.04949747468305833 2024-06-22 07:10:19,649 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:10:21,830 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2024-06-22 07:10:25,487 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=551331.0, ans=0.125 2024-06-22 07:10:37,604 INFO [train.py:1028] (0/2) Epoch 30, batch 7350, loss[loss=0.2296, simple_loss=0.2933, pruned_loss=0.08294, over 13280.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2562, pruned_loss=0.06503, over 2581388.87 frames. ], batch size: 46, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:10:43,772 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=551386.0, ans=0.125 2024-06-22 07:10:55,507 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=551404.3333333334, ans=0.0 2024-06-22 07:10:59,156 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=22.5 2024-06-22 07:11:11,704 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=551441.0, ans=0.0 2024-06-22 07:11:13,418 INFO [train.py:1028] (0/2) Epoch 30, batch 7400, loss[loss=0.2028, simple_loss=0.2758, pruned_loss=0.06491, over 13304.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2573, pruned_loss=0.06535, over 2587257.14 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:11:14,312 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=551459.3333333334, ans=0.125 2024-06-22 07:11:16,459 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:11:23,868 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=551477.6666666666, ans=0.0 2024-06-22 07:11:26,415 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 2.476e+02 2.627e+02 2.896e+02 3.938e+02, threshold=5.254e+02, percent-clipped=0.0 2024-06-22 07:11:26,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=551496.0, ans=0.125 2024-06-22 07:11:36,915 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551514.3333333334, ans=0.1 2024-06-22 07:11:46,675 INFO [train.py:1028] (0/2) Epoch 30, batch 7450, loss[loss=0.1748, simple_loss=0.2443, pruned_loss=0.05269, over 12684.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2572, pruned_loss=0.06507, over 2581029.93 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:11:52,402 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2024-06-22 07:11:53,547 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=551569.3333333334, ans=0.125 2024-06-22 07:12:06,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=551587.6666666666, ans=0.125 2024-06-22 07:12:09,646 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2024-06-22 07:12:20,738 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=551624.3333333334, ans=0.0 2024-06-22 07:12:20,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=551624.3333333334, ans=0.0 2024-06-22 07:12:21,996 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551624.3333333334, ans=0.1 2024-06-22 07:12:23,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=551642.6666666666, ans=0.0 2024-06-22 07:12:23,858 INFO [train.py:1028] (0/2) Epoch 30, batch 7500, loss[loss=0.193, simple_loss=0.2453, pruned_loss=0.07028, over 10650.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2581, pruned_loss=0.06548, over 2578609.77 frames. ], batch size: 303, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:12:25,887 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=551642.6666666666, ans=0.125 2024-06-22 07:12:25,960 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=551642.6666666666, ans=0.125 2024-06-22 07:12:29,077 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=551642.6666666666, ans=0.125 2024-06-22 07:12:33,257 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=551661.0, ans=0.0 2024-06-22 07:12:36,204 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.502e+02 2.647e+02 2.944e+02 3.905e+02, threshold=5.293e+02, percent-clipped=0.0 2024-06-22 07:12:36,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=551679.3333333334, ans=0.125 2024-06-22 07:12:38,338 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2024-06-22 07:12:40,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=551679.3333333334, ans=0.1 2024-06-22 07:12:42,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=551679.3333333334, ans=0.125 2024-06-22 07:12:45,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=551697.6666666666, ans=10.0 2024-06-22 07:12:47,420 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=551697.6666666666, ans=0.0 2024-06-22 07:12:47,900 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=551697.6666666666, ans=0.125 2024-06-22 07:12:55,985 INFO [train.py:1028] (0/2) Epoch 30, batch 7550, loss[loss=0.2056, simple_loss=0.2631, pruned_loss=0.07404, over 12920.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2589, pruned_loss=0.06604, over 2577062.70 frames. ], batch size: 158, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:13:00,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=551734.3333333334, ans=0.0 2024-06-22 07:13:03,643 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=551734.3333333334, ans=0.2 2024-06-22 07:13:09,962 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=551752.6666666666, ans=10.0 2024-06-22 07:13:22,911 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=551789.3333333334, ans=0.0 2024-06-22 07:13:23,670 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=551789.3333333334, ans=0.0 2024-06-22 07:13:30,900 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.26 vs. limit=22.5 2024-06-22 07:13:32,462 INFO [train.py:1028] (0/2) Epoch 30, batch 7600, loss[loss=0.2178, simple_loss=0.2762, pruned_loss=0.07972, over 13212.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2598, pruned_loss=0.06643, over 2576436.03 frames. ], batch size: 83, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:13:32,880 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=15.0 2024-06-22 07:13:35,642 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.01 vs. limit=15.0 2024-06-22 07:13:45,514 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.528e+02 2.676e+02 2.983e+02 4.483e+02, threshold=5.353e+02, percent-clipped=0.0 2024-06-22 07:13:59,609 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=551881.0, ans=0.125 2024-06-22 07:14:00,907 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551881.0, ans=0.1 2024-06-22 07:14:11,139 INFO [train.py:1028] (0/2) Epoch 30, batch 7650, loss[loss=0.21, simple_loss=0.27, pruned_loss=0.07495, over 12902.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2602, pruned_loss=0.06662, over 2573537.64 frames. ], batch size: 33, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:14:13,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2024-06-22 07:14:13,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.36 vs. limit=22.5 2024-06-22 07:14:14,192 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=551917.6666666666, ans=0.125 2024-06-22 07:14:15,593 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=551917.6666666666, ans=0.0 2024-06-22 07:14:16,331 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=551917.6666666666, ans=0.125 2024-06-22 07:14:35,425 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=551972.6666666666, ans=0.125 2024-06-22 07:14:45,601 INFO [train.py:1028] (0/2) Epoch 30, batch 7700, loss[loss=0.2086, simple_loss=0.279, pruned_loss=0.06914, over 13300.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2608, pruned_loss=0.06683, over 2570231.16 frames. ], batch size: 63, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:15:00,123 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=552027.6666666666, ans=0.1 2024-06-22 07:15:01,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=22.5 2024-06-22 07:15:01,196 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.577e+02 2.720e+02 3.051e+02 4.313e+02, threshold=5.440e+02, percent-clipped=0.0 2024-06-22 07:15:04,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=552046.0, ans=0.2 2024-06-22 07:15:06,292 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2024-06-22 07:15:14,946 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=552082.6666666666, ans=0.07 2024-06-22 07:15:22,105 INFO [train.py:1028] (0/2) Epoch 30, batch 7750, loss[loss=0.1872, simple_loss=0.2606, pruned_loss=0.05693, over 13199.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2607, pruned_loss=0.06709, over 2575288.44 frames. ], batch size: 72, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:15:42,696 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552156.0, ans=0.1 2024-06-22 07:15:50,753 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=552156.0, ans=0.0 2024-06-22 07:15:58,284 INFO [train.py:1028] (0/2) Epoch 30, batch 7800, loss[loss=0.2012, simple_loss=0.2587, pruned_loss=0.07181, over 13116.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.261, pruned_loss=0.06704, over 2578579.22 frames. ], batch size: 95, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:16:10,860 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.626e+02 2.735e+02 2.911e+02 3.898e+02, threshold=5.470e+02, percent-clipped=0.0 2024-06-22 07:16:13,254 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=552229.3333333334, ans=0.125 2024-06-22 07:16:29,107 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.22 vs. limit=12.0 2024-06-22 07:16:31,434 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=552284.3333333334, ans=0.2 2024-06-22 07:16:31,981 INFO [train.py:1028] (0/2) Epoch 30, batch 7850, loss[loss=0.1794, simple_loss=0.2407, pruned_loss=0.05909, over 10847.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2614, pruned_loss=0.0673, over 2571582.89 frames. ], batch size: 16, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:16:58,240 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.const_attention_rate, batch_count=552339.3333333334, ans=0.025 2024-06-22 07:17:07,387 INFO [train.py:1028] (0/2) Epoch 30, batch 7900, loss[loss=0.1985, simple_loss=0.2642, pruned_loss=0.06638, over 13133.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2616, pruned_loss=0.06739, over 2571753.18 frames. ], batch size: 77, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:17:20,273 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.574e+02 2.807e+02 3.027e+02 4.035e+02, threshold=5.615e+02, percent-clipped=0.0 2024-06-22 07:17:31,001 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=552431.0, ans=0.0 2024-06-22 07:17:31,002 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=552431.0, ans=0.125 2024-06-22 07:17:31,051 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=552431.0, ans=0.125 2024-06-22 07:17:44,215 INFO [train.py:1028] (0/2) Epoch 30, batch 7950, loss[loss=0.2172, simple_loss=0.2752, pruned_loss=0.0796, over 10607.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2619, pruned_loss=0.06752, over 2574816.46 frames. ], batch size: 304, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:17:56,551 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=552486.0, ans=0.07 2024-06-22 07:17:59,278 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=552504.3333333334, ans=0.125 2024-06-22 07:18:03,168 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2024-06-22 07:18:07,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=552522.6666666666, ans=0.0 2024-06-22 07:18:16,634 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.28 vs. limit=15.0 2024-06-22 07:18:17,595 INFO [train.py:1028] (0/2) Epoch 30, batch 8000, loss[loss=0.2035, simple_loss=0.2745, pruned_loss=0.0663, over 12536.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2624, pruned_loss=0.06742, over 2571998.55 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:18:23,192 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2024-06-22 07:18:27,272 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2024-06-22 07:18:30,009 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.538e+02 2.746e+02 2.956e+02 4.531e+02, threshold=5.491e+02, percent-clipped=0.0 2024-06-22 07:18:34,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552596.0, ans=0.1 2024-06-22 07:18:41,961 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=552614.3333333334, ans=0.125 2024-06-22 07:18:53,863 INFO [train.py:1028] (0/2) Epoch 30, batch 8050, loss[loss=0.2056, simple_loss=0.2729, pruned_loss=0.06909, over 13207.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2619, pruned_loss=0.06705, over 2572105.74 frames. ], batch size: 83, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:18:59,584 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=552669.3333333334, ans=0.1 2024-06-22 07:19:15,801 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:19:21,596 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=552724.3333333334, ans=0.125 2024-06-22 07:19:21,605 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=552724.3333333334, ans=0.125 2024-06-22 07:19:21,627 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=552724.3333333334, ans=0.125 2024-06-22 07:19:25,597 INFO [train.py:1028] (0/2) Epoch 30, batch 8100, loss[loss=0.1946, simple_loss=0.2608, pruned_loss=0.06415, over 13158.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2624, pruned_loss=0.0673, over 2576036.57 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:19:27,222 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=552742.6666666666, ans=0.0 2024-06-22 07:19:28,418 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=552742.6666666666, ans=0.0 2024-06-22 07:19:29,677 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=552742.6666666666, ans=0.125 2024-06-22 07:19:37,679 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.489e+02 2.604e+02 2.730e+02 3.763e+02, threshold=5.208e+02, percent-clipped=0.0 2024-06-22 07:19:47,291 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552779.3333333334, ans=0.1 2024-06-22 07:19:59,286 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=552816.0, ans=0.125 2024-06-22 07:20:01,174 INFO [train.py:1028] (0/2) Epoch 30, batch 8150, loss[loss=0.1918, simple_loss=0.2566, pruned_loss=0.06347, over 13114.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2628, pruned_loss=0.06705, over 2579818.99 frames. ], batch size: 121, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:20:08,016 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=552852.6666666666, ans=0.125 2024-06-22 07:20:13,655 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:20:18,068 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=552871.0, ans=0.0 2024-06-22 07:20:22,130 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.80 vs. limit=10.0 2024-06-22 07:20:29,890 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2024-06-22 07:20:33,348 INFO [train.py:1028] (0/2) Epoch 30, batch 8200, loss[loss=0.2217, simple_loss=0.2827, pruned_loss=0.08031, over 13146.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2633, pruned_loss=0.06737, over 2583510.42 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:20:45,031 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=552944.3333333334, ans=0.0 2024-06-22 07:20:46,179 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.538e+02 2.710e+02 2.864e+02 3.723e+02, threshold=5.420e+02, percent-clipped=0.0 2024-06-22 07:20:57,784 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=552981.0, ans=0.125 2024-06-22 07:20:57,830 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=552981.0, ans=0.0 2024-06-22 07:21:03,649 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=552999.3333333334, ans=0.05 2024-06-22 07:21:09,855 INFO [train.py:1028] (0/2) Epoch 30, batch 8250, loss[loss=0.1926, simple_loss=0.2686, pruned_loss=0.05834, over 13301.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2642, pruned_loss=0.06785, over 2584884.33 frames. ], batch size: 52, lr: 1.91e-03, grad_scale: 64.0 2024-06-22 07:21:10,608 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=553017.6666666666, ans=0.2 2024-06-22 07:21:16,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=553036.0, ans=0.125 2024-06-22 07:21:22,890 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=553054.3333333334, ans=0.125 2024-06-22 07:21:28,877 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.29 vs. limit=6.0 2024-06-22 07:21:40,802 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.54 vs. limit=12.0 2024-06-22 07:21:45,136 INFO [train.py:1028] (0/2) Epoch 30, batch 8300, loss[loss=0.2141, simple_loss=0.275, pruned_loss=0.07659, over 13053.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2633, pruned_loss=0.06728, over 2580914.01 frames. ], batch size: 102, lr: 1.91e-03, grad_scale: 64.0 2024-06-22 07:21:48,255 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553109.3333333334, ans=0.1 2024-06-22 07:21:57,314 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.590e+02 2.754e+02 3.013e+02 4.165e+02, threshold=5.508e+02, percent-clipped=0.0 2024-06-22 07:22:01,072 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.06 vs. limit=10.0 2024-06-22 07:22:11,252 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553182.6666666666, ans=0.125 2024-06-22 07:22:12,648 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=553182.6666666666, ans=0.0 2024-06-22 07:22:13,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=553182.6666666666, ans=0.125 2024-06-22 07:22:17,574 INFO [train.py:1028] (0/2) Epoch 30, batch 8350, loss[loss=0.1999, simple_loss=0.2634, pruned_loss=0.06825, over 13201.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2632, pruned_loss=0.06718, over 2581754.95 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:22:26,412 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=553219.3333333334, ans=0.0 2024-06-22 07:22:28,762 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=553219.3333333334, ans=0.0 2024-06-22 07:22:31,276 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=553237.6666666666, ans=0.2 2024-06-22 07:22:37,038 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=553256.0, ans=0.125 2024-06-22 07:22:38,647 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553256.0, ans=0.1 2024-06-22 07:22:49,775 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=553292.6666666666, ans=0.0 2024-06-22 07:22:50,410 INFO [train.py:1028] (0/2) Epoch 30, batch 8400, loss[loss=0.187, simple_loss=0.2522, pruned_loss=0.06091, over 12980.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2632, pruned_loss=0.06731, over 2578043.86 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:23:00,115 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.07 vs. limit=15.0 2024-06-22 07:23:02,576 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2024-06-22 07:23:06,550 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.483e+02 2.610e+02 2.783e+02 3.423e+02, threshold=5.220e+02, percent-clipped=0.0 2024-06-22 07:23:23,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=553366.0, ans=0.0 2024-06-22 07:23:25,770 INFO [train.py:1028] (0/2) Epoch 30, batch 8450, loss[loss=0.2041, simple_loss=0.2684, pruned_loss=0.06985, over 13141.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2643, pruned_loss=0.06772, over 2580376.80 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:23:26,697 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=553384.3333333334, ans=0.0 2024-06-22 07:23:33,311 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553402.6666666666, ans=0.1 2024-06-22 07:23:54,223 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=553439.3333333334, ans=0.0 2024-06-22 07:23:54,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=553439.3333333334, ans=0.0 2024-06-22 07:23:54,839 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=553457.6666666666, ans=0.125 2024-06-22 07:23:55,546 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=553457.6666666666, ans=0.2 2024-06-22 07:23:55,594 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=553457.6666666666, ans=0.125 2024-06-22 07:23:59,315 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2024-06-22 07:24:02,256 INFO [train.py:1028] (0/2) Epoch 30, batch 8500, loss[loss=0.2223, simple_loss=0.2828, pruned_loss=0.08084, over 12500.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2646, pruned_loss=0.06783, over 2578931.52 frames. ], batch size: 29, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:24:05,093 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=553476.0, ans=0.0 2024-06-22 07:24:15,390 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.506e+02 2.648e+02 2.914e+02 3.831e+02, threshold=5.296e+02, percent-clipped=0.0 2024-06-22 07:24:35,318 INFO [train.py:1028] (0/2) Epoch 30, batch 8550, loss[loss=0.1745, simple_loss=0.2416, pruned_loss=0.05375, over 12772.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2642, pruned_loss=0.06734, over 2576623.40 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:24:41,645 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553586.0, ans=0.1 2024-06-22 07:24:55,469 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=553604.3333333334, ans=0.0 2024-06-22 07:25:01,936 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553622.6666666666, ans=0.1 2024-06-22 07:25:12,587 INFO [train.py:1028] (0/2) Epoch 30, batch 8600, loss[loss=0.2012, simple_loss=0.2585, pruned_loss=0.07195, over 13118.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2648, pruned_loss=0.06766, over 2574795.90 frames. ], batch size: 112, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:25:16,390 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.48 vs. limit=15.0 2024-06-22 07:25:17,024 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=553659.3333333334, ans=6.0 2024-06-22 07:25:26,134 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.516e+02 2.631e+02 2.881e+02 4.266e+02, threshold=5.261e+02, percent-clipped=0.0 2024-06-22 07:25:30,925 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=553696.0, ans=0.125 2024-06-22 07:25:49,672 INFO [train.py:1028] (0/2) Epoch 30, batch 8650, loss[loss=0.2126, simple_loss=0.2786, pruned_loss=0.07324, over 13007.00 frames. ], tot_loss[loss=0.2, simple_loss=0.265, pruned_loss=0.06747, over 2577505.17 frames. ], batch size: 102, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:25:52,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=553751.0, ans=0.125 2024-06-22 07:26:03,914 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=553787.6666666666, ans=0.025 2024-06-22 07:26:08,022 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.30 vs. limit=10.0 2024-06-22 07:26:16,832 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=553824.3333333334, ans=0.125 2024-06-22 07:26:22,090 INFO [train.py:1028] (0/2) Epoch 30, batch 8700, loss[loss=0.1978, simple_loss=0.2715, pruned_loss=0.06202, over 13214.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2651, pruned_loss=0.06781, over 2573347.68 frames. ], batch size: 59, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:26:36,439 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.613e+02 2.762e+02 2.977e+02 3.624e+02, threshold=5.524e+02, percent-clipped=0.0 2024-06-22 07:26:42,520 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=553897.6666666666, ans=0.2 2024-06-22 07:26:47,565 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553897.6666666666, ans=0.1 2024-06-22 07:26:56,044 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=553916.0, ans=0.1 2024-06-22 07:26:59,244 INFO [train.py:1028] (0/2) Epoch 30, batch 8750, loss[loss=0.2132, simple_loss=0.2739, pruned_loss=0.07623, over 13060.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2647, pruned_loss=0.06782, over 2569321.74 frames. ], batch size: 121, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:27:15,854 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=553971.0, ans=0.125 2024-06-22 07:27:32,619 INFO [train.py:1028] (0/2) Epoch 30, batch 8800, loss[loss=0.187, simple_loss=0.2583, pruned_loss=0.05786, over 13264.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2655, pruned_loss=0.06819, over 2574198.64 frames. ], batch size: 72, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:27:39,458 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=554026.0, ans=0.125 2024-06-22 07:27:39,929 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2024-06-22 07:27:40,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=554026.0, ans=0.0 2024-06-22 07:27:42,342 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554044.3333333334, ans=0.1 2024-06-22 07:27:47,189 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=554044.3333333334, ans=0.125 2024-06-22 07:27:49,874 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=554062.6666666666, ans=0.125 2024-06-22 07:27:50,320 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.547e+02 2.772e+02 2.929e+02 3.634e+02, threshold=5.544e+02, percent-clipped=0.0 2024-06-22 07:27:53,804 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=554062.6666666666, ans=0.0 2024-06-22 07:28:06,060 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=554099.3333333334, ans=0.0 2024-06-22 07:28:06,322 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.00 vs. limit=12.0 2024-06-22 07:28:06,763 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=554099.3333333334, ans=0.0 2024-06-22 07:28:09,294 INFO [train.py:1028] (0/2) Epoch 30, batch 8850, loss[loss=0.2239, simple_loss=0.2852, pruned_loss=0.08127, over 12495.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2654, pruned_loss=0.06841, over 2561866.02 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:28:28,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=554172.6666666666, ans=0.0 2024-06-22 07:28:33,993 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.70 vs. limit=22.5 2024-06-22 07:28:45,564 INFO [train.py:1028] (0/2) Epoch 30, batch 8900, loss[loss=0.2147, simple_loss=0.2766, pruned_loss=0.07638, over 12875.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2665, pruned_loss=0.06905, over 2560466.09 frames. ], batch size: 33, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:28:49,669 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554209.3333333334, ans=0.125 2024-06-22 07:28:59,028 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.603e+02 2.749e+02 2.970e+02 3.767e+02, threshold=5.498e+02, percent-clipped=0.0 2024-06-22 07:29:00,419 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554246.0, ans=0.125 2024-06-22 07:29:06,862 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=554264.3333333334, ans=0.125 2024-06-22 07:29:10,161 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2024-06-22 07:29:14,563 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=554282.6666666666, ans=0.125 2024-06-22 07:29:17,706 INFO [train.py:1028] (0/2) Epoch 30, batch 8950, loss[loss=0.2074, simple_loss=0.2703, pruned_loss=0.07229, over 12458.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.267, pruned_loss=0.0688, over 2560576.94 frames. ], batch size: 202, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:29:20,527 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554301.0, ans=0.1 2024-06-22 07:29:26,737 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=554319.3333333334, ans=0.125 2024-06-22 07:29:31,214 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2024-06-22 07:29:31,722 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=554337.6666666666, ans=0.0 2024-06-22 07:29:42,521 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554356.0, ans=0.1 2024-06-22 07:29:44,337 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=554356.0, ans=0.0 2024-06-22 07:29:50,494 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:29:54,343 INFO [train.py:1028] (0/2) Epoch 30, batch 9000, loss[loss=0.1927, simple_loss=0.2615, pruned_loss=0.06196, over 13314.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2664, pruned_loss=0.06833, over 2567232.72 frames. ], batch size: 46, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:29:54,343 INFO [train.py:1051] (0/2) Computing validation loss 2024-06-22 07:30:02,008 INFO [train.py:1060] (0/2) Epoch 30, validation: loss=0.1959, simple_loss=0.2531, pruned_loss=0.06928, over 351949.00 frames. 2024-06-22 07:30:02,009 INFO [train.py:1061] (0/2) Maximum memory allocated so far is 18133MB 2024-06-22 07:30:04,363 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=554392.6666666666, ans=0.125 2024-06-22 07:30:09,656 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554411.0, ans=0.1 2024-06-22 07:30:16,283 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.569e+02 2.721e+02 2.913e+02 3.344e+02, threshold=5.442e+02, percent-clipped=0.0 2024-06-22 07:30:23,146 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=554447.6666666666, ans=0.125 2024-06-22 07:30:29,780 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=554466.0, ans=0.125 2024-06-22 07:30:35,681 INFO [train.py:1028] (0/2) Epoch 30, batch 9050, loss[loss=0.1907, simple_loss=0.2469, pruned_loss=0.06725, over 10890.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2673, pruned_loss=0.06891, over 2566679.34 frames. ], batch size: 16, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:30:36,534 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554484.3333333334, ans=0.125 2024-06-22 07:30:41,322 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=554484.3333333334, ans=0.125 2024-06-22 07:30:59,284 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=554539.3333333334, ans=0.125 2024-06-22 07:31:02,726 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=554557.6666666666, ans=0.0 2024-06-22 07:31:06,657 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:31:07,339 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=554557.6666666666, ans=0.1 2024-06-22 07:31:09,896 INFO [train.py:1028] (0/2) Epoch 30, batch 9100, loss[loss=0.1978, simple_loss=0.2633, pruned_loss=0.06617, over 13249.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2668, pruned_loss=0.06871, over 2568712.45 frames. ], batch size: 72, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:31:10,764 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=554576.0, ans=0.2 2024-06-22 07:31:12,147 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554576.0, ans=0.125 2024-06-22 07:31:12,741 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=554576.0, ans=0.0 2024-06-22 07:31:15,728 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.37 vs. limit=22.5 2024-06-22 07:31:18,518 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=554594.3333333334, ans=0.0 2024-06-22 07:31:24,207 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.515e+02 2.625e+02 2.856e+02 3.797e+02, threshold=5.251e+02, percent-clipped=0.0 2024-06-22 07:31:24,987 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=554612.6666666666, ans=0.125 2024-06-22 07:31:25,617 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554612.6666666666, ans=0.1 2024-06-22 07:31:39,164 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=554631.0, ans=0.125 2024-06-22 07:31:47,437 INFO [train.py:1028] (0/2) Epoch 30, batch 9150, loss[loss=0.188, simple_loss=0.2646, pruned_loss=0.05574, over 13144.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2677, pruned_loss=0.06898, over 2569880.21 frames. ], batch size: 77, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:31:57,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=554686.0, ans=0.2 2024-06-22 07:32:19,612 INFO [train.py:1028] (0/2) Epoch 30, batch 9200, loss[loss=0.226, simple_loss=0.2968, pruned_loss=0.0776, over 12974.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2674, pruned_loss=0.06856, over 2572658.31 frames. ], batch size: 36, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:32:26,011 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=554777.6666666666, ans=0.125 2024-06-22 07:32:26,307 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.19 vs. limit=22.5 2024-06-22 07:32:33,800 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.492e+02 2.653e+02 2.807e+02 3.689e+02, threshold=5.307e+02, percent-clipped=0.0 2024-06-22 07:32:43,546 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:32:46,978 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2024-06-22 07:32:50,340 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2024-06-22 07:32:51,792 INFO [train.py:1028] (0/2) Epoch 30, batch 9250, loss[loss=0.1829, simple_loss=0.2534, pruned_loss=0.05625, over 13197.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2672, pruned_loss=0.06823, over 2573634.42 frames. ], batch size: 67, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:32:55,882 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=554851.0, ans=0.125 2024-06-22 07:32:57,050 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=554851.0, ans=0.0 2024-06-22 07:33:06,198 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554869.3333333334, ans=0.1 2024-06-22 07:33:12,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=554887.6666666666, ans=0.125 2024-06-22 07:33:13,372 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=554906.0, ans=0.025 2024-06-22 07:33:13,923 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554906.0, ans=0.1 2024-06-22 07:33:16,662 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=554906.0, ans=0.0 2024-06-22 07:33:19,043 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554906.0, ans=0.125 2024-06-22 07:33:19,084 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=554906.0, ans=0.125 2024-06-22 07:33:23,349 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=554924.3333333334, ans=0.125 2024-06-22 07:33:25,968 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554924.3333333334, ans=0.1 2024-06-22 07:33:27,099 INFO [train.py:1028] (0/2) Epoch 30, batch 9300, loss[loss=0.1802, simple_loss=0.2464, pruned_loss=0.05702, over 13194.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2669, pruned_loss=0.06792, over 2571241.77 frames. ], batch size: 40, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:33:37,654 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554961.0, ans=0.125 2024-06-22 07:33:37,708 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=554961.0, ans=0.025 2024-06-22 07:33:38,974 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=554979.3333333334, ans=0.2 2024-06-22 07:33:40,902 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.517e+02 2.650e+02 2.815e+02 3.642e+02, threshold=5.301e+02, percent-clipped=0.0 2024-06-22 07:33:48,076 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.22 vs. limit=15.0 2024-06-22 07:33:51,607 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=555016.0, ans=0.125 2024-06-22 07:33:58,504 INFO [train.py:1028] (0/2) Epoch 30, batch 9350, loss[loss=0.1988, simple_loss=0.2715, pruned_loss=0.06302, over 12647.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2676, pruned_loss=0.06832, over 2567846.57 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:34:03,305 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555034.3333333334, ans=0.1 2024-06-22 07:34:11,599 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=555071.0, ans=0.0 2024-06-22 07:34:14,895 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2024-06-22 07:34:15,989 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=555071.0, ans=0.07 2024-06-22 07:34:19,115 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=555089.3333333334, ans=15.0 2024-06-22 07:34:25,941 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.81 vs. limit=15.0 2024-06-22 07:34:29,140 INFO [train.py:1028] (0/2) Epoch 30, batch 9400, loss[loss=0.2229, simple_loss=0.2922, pruned_loss=0.07675, over 13238.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2682, pruned_loss=0.06852, over 2566892.19 frames. ], batch size: 52, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:34:33,544 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555126.0, ans=0.1 2024-06-22 07:34:35,429 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.60 vs. limit=10.0 2024-06-22 07:34:37,190 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=555144.3333333334, ans=0.125 2024-06-22 07:34:39,554 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=555144.3333333334, ans=0.125 2024-06-22 07:34:41,532 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=555162.6666666666, ans=0.125 2024-06-22 07:34:42,578 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.573e+02 2.770e+02 2.922e+02 3.661e+02, threshold=5.539e+02, percent-clipped=0.0 2024-06-22 07:34:44,646 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555162.6666666666, ans=0.1 2024-06-22 07:34:47,652 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2024-06-22 07:34:49,699 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=555181.0, ans=0.07 2024-06-22 07:34:59,789 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2024-06-22 07:35:00,138 INFO [train.py:1028] (0/2) Epoch 30, batch 9450, loss[loss=0.2056, simple_loss=0.2718, pruned_loss=0.06975, over 12561.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2687, pruned_loss=0.06881, over 2567741.11 frames. ], batch size: 22, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:35:01,512 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=555217.6666666666, ans=0.0 2024-06-22 07:35:05,552 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=555236.0, ans=0.2 2024-06-22 07:35:15,517 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=555254.3333333334, ans=10.0 2024-06-22 07:35:24,205 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=555272.6666666666, ans=0.2 2024-06-22 07:35:25,901 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555272.6666666666, ans=0.1 2024-06-22 07:35:27,957 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=555291.0, ans=0.0 2024-06-22 07:35:33,276 INFO [train.py:1028] (0/2) Epoch 30, batch 9500, loss[loss=0.1905, simple_loss=0.2605, pruned_loss=0.06026, over 13222.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2681, pruned_loss=0.06829, over 2577055.25 frames. ], batch size: 43, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:35:45,097 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=555346.0, ans=0.125 2024-06-22 07:35:46,672 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.528e+02 2.691e+02 2.937e+02 3.844e+02, threshold=5.381e+02, percent-clipped=0.0 2024-06-22 07:35:47,454 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=555346.0, ans=0.125 2024-06-22 07:35:59,344 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.28 vs. limit=22.5 2024-06-22 07:36:02,925 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2024-06-22 07:36:03,785 INFO [train.py:1028] (0/2) Epoch 30, batch 9550, loss[loss=0.1935, simple_loss=0.2622, pruned_loss=0.06235, over 12954.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2677, pruned_loss=0.06854, over 2573168.96 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:36:24,178 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.const_attention_rate, batch_count=555456.0, ans=0.025 2024-06-22 07:36:28,638 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=555456.0, ans=0.2 2024-06-22 07:36:32,351 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.82 vs. limit=12.0 2024-06-22 07:36:36,602 INFO [train.py:1028] (0/2) Epoch 30, batch 9600, loss[loss=0.2169, simple_loss=0.2765, pruned_loss=0.07867, over 10171.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2674, pruned_loss=0.06828, over 2571247.47 frames. ], batch size: 304, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:36:39,301 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=555492.6666666666, ans=0.07 2024-06-22 07:36:50,257 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.470e+02 2.696e+02 2.979e+02 4.468e+02, threshold=5.393e+02, percent-clipped=0.0 2024-06-22 07:36:50,944 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=555529.3333333334, ans=0.125 2024-06-22 07:36:59,185 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=555547.6666666666, ans=0.025 2024-06-22 07:37:04,240 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.26 vs. limit=10.0 2024-06-22 07:37:07,492 INFO [train.py:1028] (0/2) Epoch 30, batch 9650, loss[loss=0.1852, simple_loss=0.2462, pruned_loss=0.06211, over 13102.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2672, pruned_loss=0.06884, over 2562097.99 frames. ], batch size: 132, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:37:07,619 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=555584.3333333334, ans=0.125 2024-06-22 07:37:15,059 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=11.14 vs. limit=15.0 2024-06-22 07:37:19,759 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=555621.0, ans=0.2 2024-06-22 07:37:25,100 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=555621.0, ans=6.0 2024-06-22 07:37:39,583 INFO [train.py:1028] (0/2) Epoch 30, batch 9700, loss[loss=0.2071, simple_loss=0.266, pruned_loss=0.07408, over 12969.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2665, pruned_loss=0.06877, over 2557120.55 frames. ], batch size: 144, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:37:40,379 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555676.0, ans=0.1 2024-06-22 07:37:44,732 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=15.0 2024-06-22 07:37:45,595 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=555694.3333333334, ans=0.125 2024-06-22 07:37:47,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=555694.3333333334, ans=0.0 2024-06-22 07:37:52,802 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.561e+02 2.813e+02 3.120e+02 3.842e+02, threshold=5.627e+02, percent-clipped=0.0 2024-06-22 07:38:11,918 INFO [train.py:1028] (0/2) Epoch 30, batch 9750, loss[loss=0.1927, simple_loss=0.2507, pruned_loss=0.06734, over 13105.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2658, pruned_loss=0.06831, over 2552850.65 frames. ], batch size: 132, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:38:14,304 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2024-06-22 07:38:21,346 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=555786.0, ans=0.125 2024-06-22 07:38:30,057 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=555822.6666666666, ans=0.0 2024-06-22 07:38:32,033 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=555822.6666666666, ans=0.0 2024-06-22 07:38:42,717 INFO [train.py:1028] (0/2) Epoch 30, batch 9800, loss[loss=0.1772, simple_loss=0.2426, pruned_loss=0.05586, over 12871.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2647, pruned_loss=0.06768, over 2547261.93 frames. ], batch size: 39, lr: 1.91e-03, grad_scale: 32.0 2024-06-22 07:38:46,152 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=555859.3333333334, ans=0.125 2024-06-22 07:38:52,145 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=555877.6666666666, ans=0.125 2024-06-22 07:38:56,292 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.554e+02 2.711e+02 3.023e+02 3.889e+02, threshold=5.423e+02, percent-clipped=0.0 2024-06-22 07:38:58,814 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=555896.0, ans=0.125 2024-06-22 07:39:11,253 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=555932.6666666666, ans=0.125 2024-06-22 07:39:13,613 INFO [train.py:1028] (0/2) Epoch 30, batch 9850, loss[loss=0.1949, simple_loss=0.2657, pruned_loss=0.06205, over 13044.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2644, pruned_loss=0.06749, over 2538364.54 frames. ], batch size: 102, lr: 1.91e-03, grad_scale: 16.0 2024-06-22 07:39:47,041 INFO [train.py:1028] (0/2) Epoch 30, batch 9900, loss[loss=0.199, simple_loss=0.273, pruned_loss=0.06251, over 12943.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2639, pruned_loss=0.06745, over 2529720.08 frames. ], batch size: 39, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:39:48,385 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=556042.6666666666, ans=0.2 2024-06-22 07:39:50,409 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=556042.6666666666, ans=0.125 2024-06-22 07:39:56,556 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=556061.0, ans=0.125 2024-06-22 07:40:01,085 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=556079.3333333334, ans=0.1 2024-06-22 07:40:01,484 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.552e+02 2.665e+02 2.824e+02 3.864e+02, threshold=5.329e+02, percent-clipped=0.0 2024-06-22 07:40:09,935 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.const_attention_rate, batch_count=556097.6666666666, ans=0.025 2024-06-22 07:40:15,667 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=556116.0, ans=0.125 2024-06-22 07:40:18,066 INFO [train.py:1028] (0/2) Epoch 30, batch 9950, loss[loss=0.2212, simple_loss=0.2776, pruned_loss=0.08238, over 12624.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2628, pruned_loss=0.06738, over 2524643.96 frames. ], batch size: 29, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:40:19,036 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.00 vs. limit=15.0 2024-06-22 07:40:34,786 INFO [scaling.py:1119] (0/2) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-06-22 07:40:36,634 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=556171.0, ans=0.0 2024-06-22 07:40:38,791 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=556189.3333333334, ans=0.0 2024-06-22 07:40:51,339 INFO [train.py:1028] (0/2) Epoch 30, batch 10000, loss[loss=0.202, simple_loss=0.2712, pruned_loss=0.06641, over 12628.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2631, pruned_loss=0.06757, over 2485964.73 frames. ], batch size: 22, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:40:51,436 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=556226.0, ans=0.125 2024-06-22 07:40:52,639 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=556226.0, ans=0.125 2024-06-22 07:40:59,476 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=556244.3333333334, ans=0.09899494936611666 2024-06-22 07:41:02,352 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.71 vs. limit=10.0 2024-06-22 07:41:06,729 WARNING [optim.py:487] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.608e+02 2.797e+02 3.027e+02 3.766e+02, threshold=5.594e+02, percent-clipped=0.0 2024-06-22 07:41:16,497 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=556281.0, ans=0.125 2024-06-22 07:41:19,429 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=556299.3333333334, ans=0.0 2024-06-22 07:41:20,912 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.02 vs. limit=15.0 2024-06-22 07:41:23,424 INFO [train.py:1028] (0/2) Epoch 30, batch 10050, loss[loss=0.2083, simple_loss=0.2799, pruned_loss=0.06835, over 12560.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2642, pruned_loss=0.06863, over 2445167.09 frames. ], batch size: 22, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:41:28,419 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2024-06-22 07:41:29,098 INFO [scaling.py:1023] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.35 vs. limit=6.0 2024-06-22 07:41:33,705 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=556336.0, ans=0.125 2024-06-22 07:41:47,088 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.const_attention_rate, batch_count=556391.0, ans=0.025 2024-06-22 07:41:53,885 INFO [train.py:1028] (0/2) Epoch 30, batch 10100, loss[loss=0.1711, simple_loss=0.232, pruned_loss=0.0551, over 12089.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2632, pruned_loss=0.06821, over 2425038.12 frames. ], batch size: 17, lr: 1.90e-03, grad_scale: 16.0 2024-06-22 07:41:56,502 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=556409.3333333334, ans=0.125 2024-06-22 07:42:07,486 INFO [checkpoint.py:75] (0/2) Saving checkpoint to zipformer/exp/epoch-30.pt 2024-06-22 07:42:17,272 INFO [train.py:1282] (0/2) Done!